I recently encountered a problem where the client always threw an exception when connecting to the server. After repeatedly locating and analyzing, and consulting various materials to understand, I found that there was no article that could clearly explain the two queues and how to observe their indicators. Therefore, I wrote this article, hoping to clarify this issue. Welcome everyone to discuss and exchange ideas. Problem Description Scenario: Java Client and Server use Socket to communicate. Server uses NIO. question:
Analyze the problem The normal TCP three-way handshake process for establishing a connection is divided into the following three steps:
From the description of the problem, it seems that the full connection queue (Accept queue, which will be discussed later) is full when TCP establishes a connection. Especially symptoms 2 and 4. To prove this is the reason, immediately use netstat -s | egrep "listen" to see the queue overflow statistics: After checking it several times, I found that overflowed has been increasing. It is clear that the full connection queue on the server must have overflowed. Next, let’s see how the OS handles the overflow: tcp_abort_on_overflow is 0, which means that if the full connection queue is full in the third step of the three-way handshake, the Server will discard the Ack sent by the Client (the Server believes that the connection has not been established yet). In order to prove that the exception of the client application code is related to the full connection queue, I first changed tcp_abort_on_overflow to 1. 1 means that if the full connection queue is full in the third step, the server sends a Reset packet to the client, indicating that the handshake process and the connection have been abolished (the connection has not been established on the server side). Then we test again. We can see many connection reset by peer errors in the client exceptions. This proves that the client error is caused by this reason (the logic is rigorous and the key point to quickly prove the problem). So the developer looked through the Java source code and found that the default backlog of Socket (this value controls the size of the full connection queue, which will be described in detail later) is 50. So we enlarged the code and ran it again. After more than 12 hours of stress testing, this error did not occur again, and we observed that the overflowed value no longer increased. To solve this problem, in simple terms, there is an Accept queue after the TCP three-way handshake. Only when entering this queue can the Listen state be changed to Accept. The default backlog value is 50, which can easily be full. When it is full, the Server will ignore the Ack packet sent by the Client in the third step of the handshake (the Server will resend the Syn + Ack packet in the second step of the handshake to the Client after a period of time). If this connection cannot be queued, it will be abnormal. However, I cannot just be satisfied with solving the problem, but I need to review the solution process and find out what knowledge points are involved that I am missing or do not fully understand. In addition to the abnormal information shown above, are there any more specific indications to check and confirm this problem? Deeply understand the process and queue of establishing a connection during the TCP handshake As shown in the figure above, there are two queues: Syns Queue (semi-connection queue); Accept Queue (full connection queue). In the three-way handshake, in the first step, after the Server receives the Syn from the Client, it puts the connection information into the semi-connection queue and replies Syn + Ack to the Client (the second step): In the third step, the Server receives the Ack from the Client. If the full connection queue is not full at this time, the information of this connection is taken out from the semi-connection queue and put into the full connection queue. Otherwise, it executes according to the instruction of tcp_abort_on_overflow. At this time, if the full connection queue is full and tcp_abort_on_overflow is 0, the Server will send Syn + Ack to the Client again after a while (that is, go through the second step of the handshake again). If the Client timeout wait is relatively short, the Client will easily become abnormal. In our OS, the default number of Retry steps 2 is 2 (the default number in CentOS is 5): If the TCP connection queue overflows, what indicators can be seen? The above solution process is a bit confusing and sounds confusing. So what is a faster and clearer way to confirm the problem next time a similar problem occurs? (Through concrete and perceptual things to strengthen our understanding and absorption of knowledge points.) netstat -s For example, the 667399 times shown above indicates the number of times the full connection queue overflowed. If this number keeps increasing every few seconds, then the full connection queue must be occasionally full. ss Command The Send-Q value in the second column is 50, which means that the maximum number of connections in the full-connection queue on the Listen port in the third column is 50. The Recv-Q in the first column is how much of the full-connection queue is currently in use. The size of the full connection queue depends on: min(backlog, somaxconn). backlog is passed in when the Socket is created, and somaxconn is an OS-level system parameter. At this point we can establish a connection with our code. For example, when Java creates a ServerSocket, it will ask you to pass in the backlog value: The size of the semi-connection queue depends on: max(64, /proc/sys/net/ipv4/tcp_max_syn_backlog), which may vary depending on the OS version. When we write code, we never think about this backlog or most of the time we don’t give it a value (then the default is 50) and simply ignore it. First of all, this is a blind spot in knowledge; secondly, maybe you saw this parameter in an article one day, and you had some impression of it at the time, but you forgot it after a while. This is because there is no connection between the knowledge and it is not systematized. But if you are like me, you first experienced the pain of this problem, and then found out why yourself, driven by stress and pain. At the same time, if you can understand why from the code layer to the OS layer, then you have a relatively good grasp of this knowledge point, and it will also become a powerful tool for the self-growth of your knowledge system in TCP or performance. netstat Command Like the ss command, netstat can also display the Send-Q and Recv-Q status information. However, if the connection is not in the Listen state, Recv-Q means that the received data is still in the cache and has not been read by the process. This value is the bytes that have not been read by the process. Send is the number of bytes in the send queue that have not been confirmed by the remote host, as shown below: The Recv-Q seen by netstat -tn has nothing to do with full connection and half connection. I specifically bring it up here because it is easy to confuse it with the Recv-Q of ss -lnt. By the way, I will build a knowledge system and consolidate related knowledge points. For example, if a large amount of data is accumulated in Recv-Q as shown in netstat -t, it is usually caused by the CPU being unable to process it: The above is to understand the full connection queue (a means of engineering efficiency) through some specific tools and indicators. Practice to verify the above understanding Change the backlog in Java to 10 (the smaller the value, the easier it is to overflow), and continue to run the stress test. At this time, the client starts to report an exception again, and then observe the following on the server through the ss command: According to the previous understanding, at this time we can see that the maximum number of service full connection queues on port 3306 is 10. But now there are 11 connections in the queue and waiting to enter the queue. There must be a connection that cannot enter the queue and will overflow. At the same time, we can see that the overflow value is constantly increasing. Accept Queue Parameters in Tomcat and Nginx By default, Tomcat uses short connections, and the backlog (the term in Tomcat is Accept count) is 200 for Ali-tomcat and 100 for Apache Tomcat. The default value of Nginx is 511, as shown below: Because Nginx is in multi-process mode, multiple 8085s are seen, that is, multiple processes are listening to the same port to avoid context switching as much as possible to improve performance. Summarize The overflow of full-connection queues and half-connection queues is a problem that is easily overlooked, but is critical, especially for some short-connection applications (such as Nginx and PHP, which of course also support long connections), which are more prone to outbreaks. Once overflow occurs, the CPU and thread status appear to be normal, but the pressure cannot be increased. From the client's point of view, the RT is also high (RT = network + queuing + actual service time). However, from the actual service time recorded in the server log, the RT is very short. Some frameworks such as JDK and Netty have a relatively small backlog by default, which may result in poor performance in some cases. I hope this article can help you understand the concepts, principles, and functions of semi-connection queues and full-connection queues in the TCP connection process. More importantly, what are the indicators that can clearly see these problems (engineering efficiency helps to strengthen the understanding of the theory). In addition, each specific problem is the best opportunity to learn. Just reading the book to understand is definitely not deep enough. Please cherish each specific problem and be able to figure out the ins and outs when you encounter it. Each problem is a good opportunity for you to pass the specific knowledge point. Finally, I would like to raise some related questions for you to think about:
By raising these questions, I hope to use this knowledge point as a starting point to allow your knowledge system to begin to grow on its own. Reference articles:
|
<<: Ten major trends in the future of industrial Internet
>>: Behind the 13 consecutive fines: When will the operators’ “low-price bidding” stop?
The discussion on China Unicom's predicament ...
When 5G R16 and R17 have not yet been launched, c...
Since the CDN technology architecture was invente...
iONcloud is a cloud hosting platform established ...
December 6, 2018 was a nightmare day for Japanese...
Some people say that WIFI is the vitamin of the I...
With the rapid development of information technol...
Verizon's 5G millimeter wave network is now a...
Since the international standards were finalized ...
[[268567]] On June 20, with the theme of "In...
Using bits to drive watts is one of the dreams of...
Alibaba Cloud announced yesterday that it will in...
At the 2020 China Radio Conference which opened y...
Communications operators must refocus on covering...
On June 8, while the college entrance examination...