Recently, a strange phenomenon occurred during project testing. In the test environment, the average time consumed when calling the backend HTTP service through Apache HTTP Client was close to 39.2ms.
Image via Pexels At first glance you may think this is normal and what's so strange about it? Actually, it is not. Let me tell you some basic information. The backend HTTP service does not have any business logic. It just converts a string into uppercase and then returns it. The string length is only 100 characters. In addition, the network Ping delay is only about 1.9ms. Therefore, theoretically, the call should take about 2-3ms, but why does it take an average of 39.2ms? Call delay Ping Latency Due to work reasons, the problem of time-consuming calls is no longer strange to me. I often help businesses solve problems related to internal RPC framework call timeouts, but this is the first time I have encountered the problem of HTTP call time-consuming. However, the routine for troubleshooting is the same. The main methodologies are nothing more than outside-in, top-down, etc. Let's first look at some external indicators to see if we can find any clues. Peripheral indicators System indicators Mainly look at some peripheral system indicators (note: both the calling and called machines should be looked at), such as load and CPU. You can get a clear view of them with just one Top command. Therefore, I confirmed that the CPU and load were both idle. Since I didn't take a screenshot at the time, I won't post it here. Progress Indicators The Java program process indicators mainly depend on the GC and thread stack conditions (note: both the calling and called machines should be considered). Young GC is very rare and takes less than 10ms, so there is no long STW. Because the average call time is 39.2ms, which is quite long, if the time consumption is caused by the code, the thread stack should be able to reveal something. After looking at it, I found nothing. The main performance of the service-related thread stack is that the thread pool thread is waiting for tasks, which means that the thread is not busy. Do you feel like you have run out of tricks? What should you do next? Local reproduction If the problem can be reproduced locally (the local system is a MAC system), it is also very helpful for troubleshooting. Therefore, I wrote a simple test program locally using Apache HTTP Client to directly call the backend HTTP service and found that the average time consumed was around 55ms. Why is it a little different from the result of 39.2ms in the test environment? The main reason is that the local and test environment backend HTTP service machines are in different regions, and the Ping delay is about 26ms, so the delay is increased. However, there are indeed problems locally, because the Ping delay is 26ms, and the backend HTTP service logic is simple and takes almost no time, so the average local call time should be around 26ms. Why is it 55ms? Are you getting more and more confused and at a loss as to where to start? During this period, you may have suspected that there was something wrong with the Apache HTTP Client. Therefore, I wrote a simple program using the HttpURLConnection that comes with JDK and did a test, and the results were the same. diagnosis position In fact, judging from the external system indicators, process indicators, and local reproduction, it can be roughly determined that it is not a program problem. What about the TCP protocol level? Students who have experience in network programming must know what TCP parameter will cause this phenomenon. Yes, you guessed it right, it is TCP_NODELAY. Which program on the caller or the called party has no settings? The caller uses Apache Http Client, and the default setting of tcpNoDelay is True. Let's take a look at the callee, which is our backend HTTP service. This HTTP service uses the HttpServer that comes with JDK:
I didn't see the direct setting of tcpNoDelay interface, so I looked through the source code. Oh, it turns out to be here. In the ServerConfig class, there is a static block that is used to get the startup parameters. By default, ServerConfig.noDelay is false:
verify In the backend HTTP service, add the startup parameter "-Dsun.net.httpserver.nodelay=true" and try again. The effect is obvious, the average time is reduced from 39.2ms to 2.8ms: Optimized call latency The problem has been solved, but if you stop here, it would be too cheap for this case and a waste of resources. Because there are still a lot of doubts waiting for you:
Come on, let’s strike while the iron is hot. Questions and Answers ①Who is TCP_NODELAY? In Socket programming, the TCP_NODELAY option is used to control whether to enable the Nagle algorithm. In Java, True means turning off the Nagle algorithm, and False means turning on the Nagle algorithm. You must be asking what the Nagle algorithm is? ②What is Nagle's algorithm? Nagle's algorithm is a method for improving the efficiency of TCP/IP networks by reducing the number of packets sent across the network. It is named after its inventor, John Nagle, who first used the algorithm in 1984 to try to solve network congestion problems at Ford Motor Company. Imagine if the application generates 1 byte of data each time, and then sends this 1 byte of data to the remote server in the form of a network data packet, it will easily cause the network to be overloaded due to too many data packets. In this typical case, transmitting a data packet with only 1 byte of valid data requires an additional overhead of a 40-byte header (ie, 20 bytes of IP header + 20 bytes of TCP header), and the utilization rate of this payload is extremely low. The content of Nagle's algorithm is relatively simple. The following is the pseudo code:
The specific approach is:
③What is Delayed ACK? As we all know, in order to ensure the reliability of transmission, the TCP protocol stipulates that a confirmation needs to be sent to the other party when a data packet is received. Simply sending an acknowledgment will be more expensive (20 bytes for the IP header + 20 bytes for the TCP header). TCP Delayed ACK (delayed confirmation) is designed to solve this problem in an effort to improve network performance. It combines several ACK response groups into a single response, or sends the ACK response together with the response data to the other party, thereby reducing protocol overhead. The specific approach is:
However, if three data packets from the other party arrive one after another, whether to send ACK immediately when the third data segment arrives depends on the above two items. ④What chemical reaction will happen when Nagle and Delayed ACK are combined? Both Nagle and Delayed ACK can improve the efficiency of network transmission, but using them together can have the opposite effect. For example, in the following scenario, A and B are transmitting data: A runs the Nagle algorithm and B runs the Delayed ACK algorithm. If A sends a data packet to B, B will not respond immediately due to Delayed ACK. If A uses the Nagle algorithm, A will keep waiting for B's ACK and will not send the second data packet until ACK comes. If these two data packets are for the same request, the request will be delayed by 40ms. ⑤ Grab a bag and have some fun Let's capture a packet to verify it. Execute the following script on the backend HTTP service to easily complete the packet capture process.
As shown in the figure below, this is a display of using Wireshark to analyze the packet content. The red box is a complete POST request processing process. Test environment data packet analysis The difference between sequence number 130 and sequence number 149 is 40ms (0.1859 - 0.1448 = 0.0411s = 41ms). This is the chemical reaction of Nagle and Delayed ACK sent together. Among them, 10.48.159.165 runs Delayed ACK, and 10.22.29.180 runs the Nagle algorithm. 10.22.29.180 is waiting for ACK, and 10.48.159.165 triggers Delayed ACK, so it waits for 40ms. This also explains why the test environment takes 39.2ms, because most of it is delayed by the 40ms of Delayed ACK. But when reproducing locally, why is the average latency of the local test 55ms instead of the Ping latency of 26ms? Let's capture a packet as well. As shown in the figure below, the red box shows a complete POST request processing process. The difference between sequence number 8 and sequence number 9 is about 25ms, which is about half of the Ping delay minus the network delay, 13ms. Local environment data packet analysis Therefore, the Delayed Ack is about 12ms (due to the local MAC system and Linux, there are some differences).
⑥Why can TCP_NODELAY solve the problem? tcpNoDelay disables the Nagle algorithm. Even if the ACK of the previous data packet has not arrived, the next data packet will be sent, thus breaking the effect of Delayed ACK. Generally in network programming, it is strongly recommended to enable tcpNoDelay to improve response speed. Of course, you can also solve the problem by configuring the Delayed ACK related system, but since it is inconvenient to modify the machine configuration, this method is not recommended. Summarize This article is a troubleshooting process for a simple HTTP call with a large latency. First, the related issues are analyzed from the outside to the inside, then the issues are located and the solutions are verified. Finally, we gave a comprehensive explanation of Nagle and Delayed ACK in TCP transmission and analyzed the problem case more thoroughly. |
<<: An article that explains the HTTP protocol in Dubbo in detail
>>: 7 key features of 5G mobile phones
There are already many articles in the industry p...
[[429983]] A key finding from the latest cross-in...
The Internet has been quietly changing over the y...
Last year, Intel and Broadcom performed the first...
On February 28, the MWC2023 China Telecom-Huawei ...
[[259528]] Recently, 5G mobile phones have been r...
[[311978]] Whether it is 2G, 3G, 4G or 5G, the mo...
[[181279]] Recently, the Ministry of Science and ...
No matter which operator you apply for broadband ...
Attacks against critical infrastructure and gover...
First, 2G was used for calling and texting, then,...
[[414626]] 1. Network Architecture There are many...
We are in the midst of a great digital wave. Inno...
Hengchuang Technology has released this year'...