Record an incident where a network request connection timed out

Record an incident where a network request connection timed out

[[338985]]

This article is reprinted from the WeChat public account "Porter to Architecture", author cocodroid. Please contact the WeChat public account "Porter to Architecture" to reprint this article.

Preface

The cause of the accident was analyzed from the aspects of HTTP request timeout, retry mechanism, operating system network, etc., and the business problem was finally solved.

Here are two questions:

1) Have you ever encountered production incidents due to network connection or request timeouts?

2) Do you know how many seconds the default network connection timeout is for the operating system?

Think about it first, and then write your answer in the comment section.

Background

Recently, a colleague had such a problem, a simple business scenario:

Service A uses HTTP to request service B interface m. Service A starts a scheduled task:

There are more than 1200 records in total queried from the db, each record corresponds to a request, and the m interface is called in a loop. After receiving the request, service B will use TCP to connect to other server machines to interact with the command. Note that this is not asynchronous concurrency to request the interface, because if asynchronous concurrency is requested, the processing thread of service B may be used up quickly, resulting in the inability to process more requests well, and even causing large-scale request timeouts or service downtime and other problems.

At this time, the scheduled task is running on time. Not long after, service A asks B to hold it, and eventually a timeout occurs.

The following timeout log is Read timed out:

Although it is normal for service A to query DB and other services by itself, the interaction between service A and service B is also very important. If there is a problem between the two, it will inevitably affect business processing or system aspects.

So why is this happening and what are the issues involved here?

Problem Solving

1. Retry mechanism accelerates the occurrence of problems

At this time, we check service A and find abnormal logs through elk logs. The number of abnormal logs increases sharply. The following screenshot:

Abnormal log details:

org.apache.http.impl.execchain.RetryExec, from this we can see that it should be related to the http retry mechanism.

It can be seen from the RetryExec source code that when http executes a request, if it is a normal request, it returns immediately; otherwise, if there is an IOException, it enters the retry phase.

It should be noted here that the for loop retry is an infinite loop. The number of retries here is controlled by the implementer. If no retry is required, no retry will be performed by default, and an exception will be thrown directly.

View the custom implementation source code of RetryHandler:

  1. @Component
  2. public class HttpRequestRetryHandlerServer implements HttpRequestRetryHandler {
  3. protected static final Logger LOG = LoggerFactory.getLogger(HttpRequestRetryHandlerServer.class);
  4. @Override
  5. public boolean retryRequest(IOException e, int retryCount, HttpContext httpCtx) {
  6. if (retryCount >= 3) {
  7. LOG.warn( "Maximum tries reached, exception would be thrown to outer block" );
  8. return   false ;
  9. }
  10. if (e instanceof org.apache.http.NoHttpResponseException) {
  11. LOG.warn( "No response from server on {} call" , retryCount);
  12. return   true ;
  13. }
  14. return   false ;
  15. }
  16. }

From the source code, we know that the maximum number of retries is 3 times, and it only targets this exception NoHttpResponseException. From the name, we know that this is an HTTP no response exception (the source code comment is: Signals that the target server failed to respond with a valid HTTP response.).

So why does service A enter the retry process?

From the above exception, we know that it is not caused by network connection timeout, but a normal request, but due to some reason, it has not received a normal response. From the previous exception Read timed out, we know that a read timeout exception occurred, so it may be related to parameters such as network data transmission.

View the default configuration:

From this, we can see that 6 seconds is the maximum time for data transmission (read timeout). If the waiting time for data results during an http request exceeds 6 seconds, the current request will be interrupted and a Read timed out exception will be thrown. So basically, we can know the cause of this exception.

2. Retry mechanism speeds up the problem - solution:

Analyze the current scene and make the following adjustments:

1) Since the http request does not need to be retried in this scenario, it will be closed:

  1. @Bean
  2. public CloseableHttpClient noRetryHttpClient(HttpClientBuilder clientBuilder) {
  3. // The number of retries is 0, no retries are performed
  4. clientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(0, false ));
  5. return clientBuilder.build();
  6. }

2) Due to this business request, service B may take more than 6 seconds to process, so the socketTimeout is adjusted to 15 seconds:

  1. # http pool config
  2. http:
  3. maxTotal: 500
  4. defaultMaxPerRoute: 100
  5. connectTimeout: 5000
  6. connectionRequestTimeout: 3000
  7. socketTimeout: 15000
  8. maxIdleTime: 1
  9. keepAliveTimeOut: 65

3. The machine connection timeout

Next, let’s check what happened to service B, which is the “lightning bolt” on the right side of the picture above. Why does it take so long to process?

When service A initiates an Http request, service B receives the request and connects to the server to exchange data. Service B communicates with the service machine using TCP ssh, which means network communication.

After checking the log of service B:

It can be seen that an exception occurred when connecting to the server. Note that the connection took about 63 seconds. And confirm that the target server is indeed not working properly, but has been down for a long time.

Since the network connection is on a Unix platform, the current operating system is Linux CentOS. So why is the timeout period 63 seconds instead of 5 seconds, 15 seconds, 60 seconds, etc.?

At this time, check the code for network connection:

  1. connection.connect ( ) ;

It can be seen that no parameters such as connection timeout are specified here, so the default parameters of the operating system kernel should be used.

The default timeout for establishing a TCP connection in Linux is 127 seconds. This is usually too long for the client. Most business scenarios will not use the default value, but set a reasonable connection timeout based on the business scenario. So where does this time come from? Why is it 127?

In fact, this time parameter is determined by the level of net.ipv4.tcp_syn_retries configuration.

The setting of net.ipv4.tcp_syn_retries indicates that when an application makes a connect() system call, if the other party does not return SYN + ACK (that is, in the case of a timeout), the kernel will retry sending the SYN packet at most a few times after the first send; and it also determines the waiting time.

The default value on Linux is net.ipv4.tcp_syn_retries = 6, which means that if the local machine actively initiates the connection (that is, actively opens the first SYN packet in the TCP three-way handshake), if the other party does not return SYN + ACK, then the maximum timeout of the application is 127 seconds.

After sending the first SYN message, wait for 1s (2 to the power of 0). If it times out, try again.

Wait for 2s (2 to the power of 1) after the second send. If it times out, try again.

Wait 4s (2 to the power of 2) after the third send. If it times out, try again.

Wait for 8s (2 to the power of 3) after the 4th send. If it times out, try again.

After the fifth send, wait for 16 seconds (2 to the power of 4). If it times out, try again.

Wait for 32 seconds (2 to the power of 5) after the sixth send. If it times out, try again.

After the seventh send, wait for 64 seconds (2 to the power of 6). If it times out, the request fails.

Next, check the tcp syn parameters on our machine:

The tcp_syn_retries setting of our server is 5, that is, the default timeout is = 1+2+4+8+16+32=63 seconds. This is exactly in line with the current problem, which is why the 63-second timeout occurs.

4. What about Windows platform?

(Originally, I didn’t plan to elaborate on this part. I hope readers can look up the information on their own, but I’ll still make it complete.)

Because I use Windows 10 as a development machine, I want to know what the timeout is on Windows. I wrote a test case and found that it was about 21 seconds. What is the principle behind this?

Check out relevant information:

TcpMaxConnectRetransmissions

Determines how many times TCP retransmits an unanswered request for a new connection. TCP retransmits new connection requests until they are answered, or until this value expires.

TCP/IP adjusts the frequency of retransmissions over time. The delay between the original transmission and the first retransmissions for each interface is determined by the value of TcpInitialRTT (by default, it is 3 seconds). This delay is doubled after each attempt. After the final attempt, TCP/IP waits for an interval equal to double the last delay and then abandons the connection request.

TcpInitialRTT

Determines how quickly TCP retransmits a connection request if it doesn't receive a response to the original request for a new connection.

By default, the retransmission timer is initialized to 3 seconds, and the request (SYN) is sent twice, as specified in the value of TcpMaxConnectRetransmissions.

According to the information, on the Windows platform, it is controlled by the following parameters: TcpMaxConnectRetransmissions and TcpInitialRTT. The default value of TcpMaxConnectRetransmissions is generally 2, and the default value of TcpInitialRTT is 3 seconds.

That is, there will be two retries, each time twice as long as the previous one, that is, 21 seconds: 3+3*2+(3*2)*2=3+6+12=21 seconds.

Query Windows parameters through commands:

  1. netsh interface tcp show global  

The maximum number of SYN retransmissions on my company's development machine is 2, but on my personal machine it is 4 (then the default connection timeout is: 3+6+12+24+48=93 seconds). Although both are Windows 10 systems, I don't know why they are different.

5. The machine connection timeout pot - solution:

When service B connects to the server, set the connection timeout to 5 seconds:

  1. connection . connect ( null , 5000, kexTimout);

In this way, if the connection fails for more than 5 seconds, a timeout exception will be handled to release resources early and no longer block the current processing thread.

6. Results:

Through relevant adjustments and optimizations, and re-release of service verification, the service will eventually run stably without any anomalies.

perfect!

Summarize

1) Although the culprit of this incident was not the HTTP retry mechanism of service A, it also accelerated the occurrence of the problem.

Therefore, we must be clear about whether a retry mechanism is needed. If not, do not set it up, otherwise it will waste resources and may even overwhelm the service provider's system.

2) Network connections generally include TCP and HTTP. To prevent timeouts from affecting business or even causing serious problems such as service downtime, it is generally necessary to set reasonable timeouts (connection timeout and data transmission time, etc.).

Because the operating system sets relatively general default parameters and does not consider specific business scenarios.

Network data transmission time: Specific business scenarios are very different. For example, the default data transmission time of 6 seconds may not be reasonable in actual scenarios. At this time, it needs to be adjusted according to the actual situation. For example, in my case, it is adjusted to 15 seconds.

Network connection timeout: For example, the default network connection timeout on Windows platform is generally 21 seconds. Linux (Centos) has a default step timeout mechanism, which is 127 seconds by default, and on my machine it is 63 seconds.

3) Learn the operating system timeout mechanism. For example, in Linux or Windows, when the connection times out, you can increase the previous timeout time by a multiple, and apply what you have learned to your business development.

<<:  How many base stations are there in the world? How many 5G base stations are there?

>>:  4G is still growing, but 5G is a bit awkward. Has the promised 5G phone replacement trend come to an end?

Blog    

Recommend

Increasing Adoption of 5G Technology to Drive Cellular IoT Module Market

The cellular IoT module market will reach $20.83 ...

10 SD-WAN projects to watch

[[323303]] GlobalConnect | Versa Networks GlobalC...

Akamai Launches Prolexic Network Cloud Firewall

April 25, 2023 – Akamai Technologies, Inc. (Akama...

New data transmission system developed: 10 times faster than USB

A new data-transfer system is here that's 10 ...

Single Pair Ethernet (SPE) and the Industrial Internet of Things

While Single Pair Ethernet (SPE) has been around ...

Discussion on the Application of SDN in Wide Area Network

Internet industry application trends and problems...

Rethinking the future of 5G through the lens of extended reality (XR)

5G technology is developing globally, and Singapo...

A complete guide to using Go language built-in packages!

Introduction to Commonly Used Built-in Packages i...

Service assurance cases in carrier SDN

There is a lot of interest in carrier SDN and the...

The role of edge computing and 5G in healthcare

Technology has changed the way we conduct diagnos...

Fiber Polarity and Its Role in Switching Technology

Before we delve into the world of switching techn...