Hard-core dry goods: HTTP timeout, repeated requests must see the pitfalls and solutions

Hard-core dry goods: HTTP timeout, repeated requests must see the pitfalls and solutions

[[351757]]

1 Timeout, unavoidable pain

HTTP call is to execute a network request through HTTP protocol. Since it is a network request, there is a possibility of timeout (maybe your network card, or the network card of the server), so you need to pay attention to the following during development:

  • Is the default timeout set by the framework reasonable?
  • If it's too short, the request hasn't been processed yet and you can't wait!
  • Too long, the request has exceeded the normal response time and has hung up
  • Considering the instability of the network, you can request a retry through a scheduled task after a timeout.
  • Pay attention to the idempotency design of the server interface, that is, whether retries are allowed
  • Consider whether the framework will limit the number of concurrent connections like the browser, so that the number of concurrent HTTP calls does not become a bottleneck under high concurrency.

1.1 HTTP call framework technology selection

  • Spring Cloud Family Bucket

Use Feign for declarative service calls.

  • Using only Spring Boot

HTTP client Apache HttpClient makes service calls.

1.2 Connection timeout configuration && read timeout parameters

Although the application layer is HTTP, the network layer is always TCP/IP. TCP/IP is a connection-oriented protocol that requires a connection to be established before data is transmitted. Therefore, the network framework provides the following timeout parameters:

  • Connection timeout parameter ConnectTimeout

Customizable maximum waiting time for establishing a connection

  • Read timeout parameter ReadTimeout

Controls the maximum waiting time for reading data from the socket.

1.3 Common pitfalls

The connection timeout is too long.

For example, 60s. The time required for TCP three-way handshake to establish a connection is very short, ranging from ms to seconds at most. It is impossible to take more than ten or dozens of seconds. It is probably a network or firewall configuration problem. If you still cannot connect after a few seconds, you may never be able to connect. Therefore, it is meaningless to set a particularly long connection timeout, 1 to 5 seconds is enough.

If it is a pure intranet call, you can set it shorter and fail quickly when the downstream service cannot be connected.

Brainless troubleshooting of connection timeout issues

A service usually has multiple nodes. If other clients connect to the server through load balancing, the client and the server will establish a connection directly. If a connection timeout occurs, it is most likely a problem with the server.

If the server uses Nginx reverse proxy for load balancing, the client actually connects to Nginx instead of the server. If a connection timeout occurs, check Nginx.

Read timeout parameters and read timeout pitfalls

As long as the read timeout occurs, the normal execution of the server program will be interrupted?

Case

The client interface calls the server interface server through HttpClient. The client reads the timeout for 2 seconds, and the server interface takes 5 seconds to execute.

After calling the client interface, check the log:

  • SocketTimeoutException occurs on the client after 2 seconds, indicating a read timeout

  • The server calmly completed the execution after 3 seconds.

The Tomcat Web server submits server requests to the thread pool for processing. As long as the server receives the request, network-level timeouts and disconnections will not affect the server's execution. Therefore, when a read timeout occurs, you cannot arbitrarily assume the server's processing status, and you need to consider how to proceed based on the business status.

Read timeout is only a concept at the Socket network level. It is the longest time for data transmission, so it is configured to be very short.

For example, 100ms.

When a read timeout occurs, the network layer cannot distinguish the following reasons:

  • The server does not return data to the client
  • Data takes a long time on the network or is lost

However, TCP transmits data only after the connection is established. For service calls where the network conditions are not particularly bad, it can be considered that:

  • Connection timeout

Network problem or service not online

  • Read timeout

Service processing timeout. Read timeout means that after writing data to the Socket, we wait for the Socket to return the data for a certain period of time, which includes the time or most of the time, which is the time for the server to process the business logic.

The longer the timeout is, the higher the success rate of the task interface is.

HTTP requests generally need to obtain results and are synchronous calls.

If the timeout period is very long, while waiting for the Server to return data, the Client thread (usually the Tomcat thread) is also waiting. When a large number of downstream services time out, the program may also be dragged down to create a large number of threads and eventually crash.

  • For scheduled or asynchronous tasks, a long read timeout configuration is not a big problem
  • However, for user response requests or synchronous interface calls of the microservice platform, the concurrency is generally large, and a shorter read timeout should be set to prevent being slowed down by downstream services. Usually, the read timeout is not set to exceed 30s.

Someone may ask in the comments, if the read timeout is set to 2s, and the server interface takes 3s, won't we never get the execution result?

Indeed, so setting the read timeout should be based on the actual situation:

If it is too long, downstream jitter may affect you.

If it is too short, it may affect the success rate. Sometimes we even need to set different client read timeouts for different server interfaces based on the SLA of downstream services.

1.4 Best Practices

The connection timeout represents the time to establish a TCP connection, and the read timeout represents the time to wait for the remote end to return data, including the time for the remote program to process. When solving the connection timeout problem, we need to figure out who we are connected to; when encountering the read timeout problem, we need to consider the service standards of the downstream service and our own service standards and set an appropriate read timeout. In addition, when using frameworks such as Spring Cloud Feign, be sure to confirm whether the configuration of the connection and read timeout parameters is correct and effective.

2 Feign&&Ribbon

2.1 How to configure timeout

The difficulty in configuring timeout parameters for Feign is that Feign itself has two timeout parameters, and the load balancing component Ribbon it uses also has related configurations. What is the priority of these configurations?

2.2 Examples

  • Test server timeout, assuming the server interface only sleeps for 10 minutes

  • Feign calls this interface:

  • Calling the interface through Feign Client

When the configuration file only specifies the server address:

  1. clientsdk.ribbon.listOfServers=localhost:45678

The following output is obtained:

  1. [21:46:24.222] [http-nio-45678- exec -4] [WARN] [ogtchfFeignAndRibbonController:26] -
  2. Execution time: 222ms Error: Connect   to localhost:45679 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1]
  3. failed: Connection refused ( Connection refused) executing
  4. POST http://clientsdk/feignandribbon/server

The default read timeout for Feign is 1 second, so such a short read timeout is considered a "pitfall".

Analyze the source code

Customize the two global timeouts of the Feign client

The following parameters can be set:

  1. feign.client.config. default .readTimeout=3000
  2. feign.client.config. default .connectTimeout=3000

After modifying the configuration and trying again, the following log is obtained:

  1. [http-nio-45678- exec -3] [WARN ] [ogtchfFeignAndRibbonController :26 ] - Execution time: 3006ms Error: Read timed out executing POST http://clientsdk/feignandribbon/server

3 second read timeout takes effect.

Note: There is a big pitfall here. If you only want to modify the read timeout, you may only configure this line:

  1. feign.client.config. default .readTimeout=3000

The test found that this configuration does not take effect.

To configure Feign read timeout, you must also configure the connection timeout

View FeignClientFactoryBean source code

  • Request.Options will be overwritten only if both ConnectTimeout and ReadTimeout are set at the same time

To set a timeout for a single Feign Client, replace default with the name of the Client:

  1. feign.client.config. default .readTimeout=3000
  2. feign.client.config. default .connectTimeout=3000
  3. feign.client.config.clientsdk.readTimeout=2000
  4. feign.client.config.clientsdk.connectTimeout=2000

Individual timeouts can override the global timeout

  1. [http-nio-45678- exec -3] [WARN] [ogtchfFeignAndRibbonController:26] -
  2. Execution time: 2006ms Error: Read timed out executing
  3. POST http://clientsdk/feignandribbon/server

In addition to configuring Feign, you can also configure the parameters of the Ribbon component to modify the two timeouts

The first letter of the parameter should be capitalized, which is different from Feign's configuration.

  1. ribbon.ReadTimeout=4000
  2. ribbon.ConnectTimeout=4000

The logs can prove that the parameters are effective:

  1. [http-nio-45678- exec -3] [WARN] [ogtchfFeignAndRibbonController:26] -
  2. Execution time: 4003ms Error: Read timed out executing
  3. POST http://clientsdk/feignandribbon/server

Configure Feign and Ribbon parameters at the same time

Who will take effect?

  1. clientsdk.ribbon.listOfServers=localhost:45678
  2. feign.client.config. default .readTimeout=3000
  3. feign.client.config. default .connectTimeout=3000
  4. ribbon.ReadTimeout=4000
  5. ribbon.ConnectTimeout=4000

What finally takes effect is Feign's timeout:

  1. [http-nio-45678- exec -3] [WARN] [ogtchfFeignAndRibbonController:26] -
  2. Execution time: 3006ms Error: Read timed out executing
  3. POST http://clientsdk/feignandribbon/server

Configure the timeout for both Feign and Ribbon, with Feign taking precedence.

In the LoadBalancerFeignClient source code

If Request.Options is not the default value, a FeignOptionsClientConfig will be created to replace the original Ribbon's DefaultClientConfigImpl, causing the Ribbon configuration to be overwritten by Feign:

But if configured like this, the Ribbon timeout (4 seconds) will take effect in the end. The difficulty is that Ribbon covers Feign again? No, this is still because of the second pitfall. The read timeout of Feign cannot take effect if configured alone:

  1. clientsdk.ribbon.listOfServers=localhost:45678
  2. feign.client.config. default .readTimeout=3000
  3. feign.client.config.clientsdk.readTimeout=2000
  4. ribbon.ReadTimeout=4000

3 Ribbon automatically retry requests

Some HTTP clients often have some built-in retry strategies. The original intention is good. After all, although packet loss due to network problems is frequent, the duration is short, and retrying can often succeed.

But be careful whether this meets our expectations.

3.1 Examples

The SMS service was repeatedly sent, but the user service that called the SMS service repeatedly confirmed that there was no retry logic in the code.

So where exactly is the problem?

Get request to send SMS interface, sleep for 2s to simulate time consumption:

Configure a Feign for the client to call:

There is a Ribbon component in Feign that is responsible for client load balancing. The server it calls is set to two nodes through the configuration file:

  1. SmsClient.ribbon.listOfServers=localhost:45679,localhost:45678

Client interface, calling the server through Feign

Start the server on ports 45678 and 45679 respectively, and then access the client interface of 45678 for testing. Because the client and server controllers are in the same application, 45678 plays the role of both client and server.

In log 45678, we can see that at 29 seconds, the client received the request and started to call the server interface to send a text message. At the same time, the server received the request. 2 seconds later (note the comparison between the first and third logs), the client output a read timeout error message:

  1. [http-nio-45678- exec -4] [INFO] [cdRibbonRetryIssueClientController:23] - client is called
  2. [http-nio-45678- exec -5] [INFO] [cdRibbonRetryIssueServerController:16] - http://localhost:45678/ribbonretryissueserver/sms is called, 13600000000=>a2aa1b32-a044-40e9-8950-7f0189582418
  3. [http-nio-45678- exec -4] [ERROR] [cdRibbonRetryIssueClientController:27 ] - send sms failed : Read timed out executing GET http://SmsClient/ribbonretryissueserver/sms?mobile=13600000000&message=a2aa1b32-a044-40e9-8950-7f0189582418

In the log of another server 45679, you can also see a request, 1 second after the client interface is called:

  1. [http-nio-45679- exec -2] [INFO] [cdRibbonRetryIssueServerController:16] - http://localhost:45679/ribbonretryissueserver/sms is called, 13600000000=>a2aa1b32-a044-40e9-8950-7f0189582418

The client interface call log is output only once, while the server log is output twice. Although the default read timeout of Feign is 1 second, the client timeout error occurs after 2 seconds.

This means that the client retried the message on its own initiative, causing the SMS to be sent repeatedly.

3.2 Source Code Revealed

Looking at the Ribbon source code, the MaxAutoRetriesNextServer parameter defaults to 1, which means that when a Get request has a problem on a server node (such as a read timeout), the Ribbon will automatically retry once:

Solution

1. Change the SMS interface from Get to Post

API design specifications: Stateful API interfaces should not be defined as Get. According to the HTTP protocol specification, Get requests are suitable for data queries, while Post requests are used to submit data to the server for modification or addition. The basis for choosing Get or Post should be the API behavior, not the parameter size.

  • Common misunderstanding: The parameters of the Get request are included in the Url QueryString, which is subject to the length limit of the browser, so some developers choose to use JSON to submit large parameters by Post and use Get to submit small parameters.

2. Set the MaxAutoRetriesNextServer parameter to 0 to disable automatic retries on the next server node after a service call fails. Add a line to the configuration file:

  1. ribbon.MaxAutoRetriesNextServer=0

Accountability

So, is the problem with the user service or the SMS service?

Maybe there are problems on both sides.

  • Get requests should be stateless or idempotent, and the SMS interface can be designed to support idempotent calls
  • If the user service developers understand the retry mechanism of Ribbon, they may be able to avoid detours in troubleshooting.

Best Practices

Regarding retries, because the HTTP protocol considers Get requests to be data query operations and are stateless, and considering that network packet loss is a common occurrence, some HTTP clients or proxy servers will automatically retry Get/Head requests. If your interface design does not support idempotence, you need to turn off automatic retries. However, a better solution is to follow the recommendations of the HTTP protocol and use the appropriate HTTP method.

4. Limiting crawlers to concurrent crawls

There is another common problem with HTTP request calls: the limitation of the number of concurrent requests leads to the inability to improve program processing performance.

4.1 Examples

In a certain crawler project, the overall crawling data efficiency is very low. It is meaningless to increase the number of thread pools, and the only option is to pile up machines.

Now let’s simulate this scenario and explore the nature of the problem.

Assume that the server to be crawled is a simple implementation like this, which sleeps for 1 second and returns the number 1:

The crawler needs to call this interface multiple times to crawl data. To ensure that the thread pool is not a concurrency bottleneck, a newCachedThreadPool with no thread limit is used. Then, HttpClient is used to execute HTTP requests, and the request tasks are submitted to the thread pool for processing in a loop. Finally, the execution time is output after waiting for all tasks to be executed:

Use the default CloseableHttpClient constructed by PoolingHttpClientConnectionManager to test the time it takes to crawl 10 times:


Although a request takes 1 second to execute, the thread pool can be expanded to use any number of threads.

Logically, the time to process 10 requests concurrently is roughly equivalent to the time to process one request, which is 1 second. However, the log shows that it actually takes 5 seconds:

4.2 Source code analysis

The PoolingHttpClientConnectionManager source code has two important parameters:

  • defaultMaxPerRoute=2, that is, the maximum number of concurrent requests for the same host/domain name is 2. Our crawler needs 10 concurrent requests, and it is obvious that the default value is too small to limit the efficiency of the crawler.
  • maxTotal=20 means the maximum concurrency of all hosts is 20, which is also the overall concurrency of HttpClient. If the number of our requests is 10, the maximum concurrency is 10, and 20 will not become a bottleneck. For example, if you use the same HttpClient to access 10 domain names, and the defaultMaxPerRoute is set to 10, you need to set maxTotal to 100 to ensure that each domain name can reach 10 concurrency.

HttpClient is a commonly used HTTP client, so why is the default value so limited?

Many early browsers also limited the number of concurrent requests to the same domain name to two. The limitation on concurrent connections to the same domain name is actually required by the HTTP 1.1 protocol. Here is a paragraph:

  • Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.
  • The HTTP 1.1 protocol was developed 20 years ago. Today, HTTP servers are much more powerful, so some new browsers do not fully comply with the 2-concurrency limit and increase the concurrency to 8 or even more.
  • If you need to initiate a large number of concurrent requests through an HTTP client, no matter what client you use, be sure to confirm whether the default concurrency of the client implementation meets your needs.

Try declaring a new HttpClient to remove the relevant restrictions, set maxPerRoute to 50 and maxTotal to 100, then modify the wrong method just now and use the new client for testing:

The output is as follows. Ten requests were completed in about one second. It can be seen that the crawler efficiency has been greatly improved by relaxing the default limit of two concurrent requests per host:

4.3 Best Practices

If your client has a relatively large number of concurrent request calls, such as crawling, or acting as a proxy, or if the program itself has a high concurrency, such a small default value can easily become a throughput bottleneck and needs to be adjusted in time.

<<:  Headline: Determine whether it is an IP address

>>:  10 classic interview questions that 99% of network engineers will encounter. Do you dare to challenge them?

Recommend

How does DNS affect your surfing speed?

This article introduces DNS-related knowledge in ...

How should a small LAN with less than 10 or 100 people be established?

What is a local area network? The so-called local...

IDC: Edge management services market expected to explode

As enterprises seek greater process efficiency an...

Http protocol: Under what circumstances does an options request occur?

background: A new colleague asked me that there w...

Gartner: Enterprises rethink software security strategies

Businesses are rethinking risk management and sof...

​From CDN to edge computing, computing power evolution accelerates again

The COVID-19 pandemic has accelerated the global ...

Six major trends in 5G development in 2021

2020-12-31 09:392020 is a year of vigorous constr...