In-depth analysis of common three-way handshake exceptions

In-depth analysis of common three-way handshake exceptions

[[416112]]

This article is reprinted from the WeChat public account "Kaida Neigong Xiuxian", written by Zhang Yanfei allen. To reprint this article, please contact the WeChat public account "Kaida Neigong Xiuxian".

Hello everyone, I am Fei Ge!

One of the important indicators in the backend interface performance indicators is the interface time consumption. Specifically, it includes the average response time TP90, TP99 time consumption value, etc. The lower these values ​​are, the better, generally speaking, a few milliseconds, or tens of milliseconds. If the response time is too long, for example, more than 1 second, the user side can feel very obvious lag. If this continues for a long time, users may directly vote with their feet and uninstall our app.

Under normal circumstances, a TCP connection takes about a little more than one RTT. But things are not always so beautiful, and there will always be accidents. In some cases, it may cause the connection time to increase, the CPU processing overhead to increase, or even timeout failure.

Today, I will talk about the various abnormal situations related to TCP handshake that I have encountered online.

1. Client connect exception

Port numbers and CPU consumption may not seem to have much to do with each other. But I have encountered a situation where CPU consumption increased significantly due to insufficient port numbers. Let Fei Ge analyze why this problem occurs!

When the client initiates the connect system call, the main task is port selection (see How is the client's port number determined in a TCP connection?).

In the selection process, there is a large loop that starts from a random position in ip_local_port_range and traverses the range. When an available port is found, the loop is exited. If the ports are sufficient, the loop only needs to be executed a few times to exit. But suppose that many ports are consumed and are no longer sufficient, or there are no available ports at all. Then this loop has to be executed many times. Let's take a look at the detailed code.

  1. //file:net/ipv4/inet_hashtables.c
  2. int __inet_hash_connect(...)
  3. {
  4. inet_get_local_port_range(&low, &high);
  5. remaining = (high - low) + 1;
  6.  
  7. for (i = 1; i <= remaining; i++) {
  8. // where offset is a random number
  9. port = low + (i + offset) % remaining;
  10. head = &hinfo->bhash[inet_bhashfn(net, port,
  11. hinfo->bhash_size)];
  12.  
  13. //Lock
  14. spin_lock(&head->lock);
  15.  
  16. //A long section of port selection logic
  17. //......
  18. //If the selection is successful, goto ok
  19. //If unsuccessful, goto next_port
  20.  
  21. next_port:
  22. //Unlock
  23. spin_unlock(&head->lock);
  24. }
  25. }

In each loop, you need to wait for the lock and perform multiple searches in the hash table. Note that this is a spin lock, which is a non-blocking lock. If the resource is occupied, the process will not be suspended, but will occupy the CPU to continuously try to acquire the lock.

But suppose the port range ip_local_port_range is configured as 10000 - 30000 and has been exhausted. Then each time a connection is initiated, the loop needs to be executed 20,000 times before exiting. This will involve a large amount of HASH lookup and spin lock waiting overhead, and the system CPU will increase significantly.

This is the normal connect system call duration captured online, which is 22 us (microseconds).

This is the connection overhead of one of our servers when there are not enough ports, which is 2581 us (microseconds).

From the above two figures, we can see that the connect time under abnormal conditions is more than 1000 times that of normal conditions. Although it is only a little more than 2 ms when converted into milliseconds, it should be noted that this is all CPU time.

2. Packet loss during the first handshake

When the server responds to the first handshake request from the client, it will determine whether the semi-connection queue and the full connection queue overflow. If overflow occurs, the handshake packet may be directly discarded without feedback to the client. Let's take a closer look at each of them.

2.1 Semi-connection queue is full

Let's look at the circumstances under which a half-connected queue can cause packet loss.

  1. //file: net/ipv4/tcp_ipv4.c
  2. int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
  3. {
  4. // Check if the semi-connection queue is full
  5. if (inet_csk_reqsk_queue_is_full(sk) && !isn) {
  6. want_cookie = tcp_syn_flood_action(sk, skb, "TCP" );
  7. if (!want_cookie)
  8. goto   drop ;
  9. }
  10.  
  11. // Check if the full connection queue is full
  12. ...
  13. drop :
  14. NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
  15. return 0;
  16. }

In the above code, if inet_csk_reqsk_queue_is_full returns true, it means that the semi-connection queue is full. In addition, tcp_syn_flood_action determines whether the kernel parameter tcp_syncookies is turned on. If not, it returns false.

  1. //file: net/ipv4/tcp_ipv4.c
  2. bool tcp_syn_flood_action(...)
  3. {
  4. bool want_cookie = false ;
  5.  
  6. if (sysctl_tcp_syncookies) {
  7. want_cookie = true ;
  8. }
  9. return want_cookie;
  10. }

That is to say, if the semi-connection queue is full and the ipv4.tcp_syncookies parameter is set to 0, the handshake packet from the client will goto drop, which means it will be discarded directly!

SYN Flood attack consumes all the semi-connection queues on the server to make normal user connection requests unresponsive. However, in the current Linux kernel, as long as tcp_syncookies is turned on, the normal handshake can still be guaranteed even if the semi-connection queue is full.

2.2 Full connection queue is full

We noticed that after the half-connection queue is judged, there is a related judgment about the full connection queue being full. If this condition is met, the server will still goto drop the handshake packet and discard it. Let's take a look at the source code:

  1. //file: net/ipv4/tcp_ipv4.c
  2. int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
  3. {
  4. // Check if the semi-connection queue is full
  5. ...
  6.  
  7. // Check if the full connection queue is full
  8. if (sk_acceptq_is_full(sk) && inet_csk_reqsk_queue_young(sk) > 1) {
  9. NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
  10. goto   drop ;
  11. }
  12. ...
  13. drop :
  14. NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENDROPS);
  15. return 0;
  16. }

sk_acceptq_is_full is used to determine whether the full connection queue is full, and inet_csk_reqsk_queue_young is used to determine whether there is a young_ack (unprocessed half-connection request).

From this code, we can see that if the full connection queue is full and there is a young_ack at the same time, the kernel will also directly discard the SYN handshake packet.

2.3 Client initiates retry

Assume that a full/half connection queue overflow occurs on the server side, causing packet loss. From the client's perspective, the SYN packet does not receive any response.

Fortunately, the client starts a retransmission timer when sending a handshake packet. If the expected synack is not received, the timeout retransmission logic will start to execute. However, the time unit of the retransmission timer is calculated in seconds, which means that if a handshake retransmission occurs, even if the first retransmission is successful, the fastest response of the interface will be more than 1 second. This has a great impact on the interface time consumption.

Let's take a closer look at the retransmission logic. The server starts the retransmission timer after sending a syn in connect.

  1. //file:net/ipv4/tcp_output.c
  2. int tcp_connect(struct sock *sk)
  3. {
  4. ...
  5. //Actually send syn
  6. err = tp->fastopen_req ? tcp_send_syn_data(sk, buff):
  7. tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
  8.  
  9. //Start the retransmission timer
  10. inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
  11. inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
  12. }

The inet_csk(sk)->icsk_rto passed in the timer setting is the timeout period, which is initially set to 1 second.

  1. //file:ipv4/tcp_output.c
  2. void tcp_connect_init(struct sock *sk)
  3. {
  4. //Initialize to TCP_TIMEOUT_INIT
  5. inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
  6. ...
  7. }
  8.  
  9. //file: include/net/tcp.h
  10. #define TCP_TIMEOUT_INIT ((unsigned)(1*HZ))

In some older versions of the kernel, such as 2.6, the initial value of the retransmission timer is 3 seconds.

  1. //Kernel version: 2.6.32
  2. //file: include/net/tcp.h
  3. #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))

If the synack response from the server is received normally, the client's timer will be cleared. This logic is in tcp_rearm_rto. (tcp_rcv_state_process -> tcp_rcv_synsent_state_process -> tcp_ack -> tcp_clean_rtx_queue -> tcp_rearm_rto)

  1. //file:net/ipv4/tcp_input.c
  2. void tcp_rearm_rto(struct sock *sk)
  3. {
  4. inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
  5. }

If packet loss occurs on the server side, the callback function tcp_write_timer will be used for retransmission after the timer expires.

In fact, not only handshake, but also connection state timeout retransmission is completed here. However, here we only discuss the case of handshake retransmission.

  1. //file: net/ipv4/tcp_timer.c
  2. static void tcp_write_timer(unsigned long data)
  3. {
  4. tcp_write_timer_handler(sk);
  5. ...
  6. }
  7.  
  8. void tcp_write_timer_handler(struct sock *sk)
  9. {
  10. //Get the timer type.
  11. event = icsk->icsk_pending;
  12.  
  13. switch (event) {
  14. case ICSK_TIME_RETRANS:
  15. icsk->icsk_pending = 0;
  16. tcp_retransmit_timer(sk);
  17. break;
  18. ......
  19. }
  20. }

tcp_retransmit_timer is the main function for retransmission. Retransmission is done here, as well as the setting of the next timer expiration time.

  1. //file: net/ipv4/tcp_timer.c
  2. void tcp_retransmit_timer(struct sock *sk)
  3. {
  4. ...
  5.  
  6. //Exit if the number of retransmissions exceeds
  7. if (tcp_write_timeout(sk))
  8. goto   out ;
  9.  
  10. //Retransmit
  11. if (tcp_retransmit_skb(sk, tcp_write_queue_head(sk)) > 0) {
  12. //Retransmission failed
  13. ......
  14. }
  15.  
  16. //Reset the next timeout before exiting
  17. out_reset_timer:
  18. //Calculate timeout
  19. if (sk->sk_state == TCP_ESTABLISHED ){
  20. ......
  21. } else {
  22. icsk->icsk_rto = min (icsk->icsk_rto << 1, TCP_RTO_MAX);
  23. }
  24.  
  25. //set up
  26. inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
  27. }

tcp_write_timeout is used to determine whether there are too many retries. If so, the retry logic is exited.

The judgment logic of tcp_write_timeout is actually a little complicated. For SYN handshake packets, the main judgment basis is net.ipv4.tcp_syn_retries, but it is not a simple comparison of the number of times, but a comparison of time. So if you see that the actual number of retransmissions online is inconsistent with the corresponding kernel parameters, don't be too surprised.

Then, the first element in the send queue is resent in tcp_retransmit_timer, and the next timeout is set to twice the previous one (right shift operation is equivalent to multiplication by 2).

2.4 Actual packet capture results

Let's take a look at a screenshot of the handshake process where the server loses packets in response to the first handshake.

From the figure, we can see that the client retried the handshake for the first time after 1 second. If there was still no response, it then retried 6 times in 3 seconds, 7 seconds, 15 seconds, 31 seconds, 63 seconds, etc. (my tcp_syn_retries was set to 6 at that time).

If the half/full connection queue overflows and causes packet loss during the first handshake on our server, then our interface response time will be at least 1 second (on some older kernel versions, the first SYN retry will take 3 seconds). If the handshake fails for two or three consecutive times, then 7 or 8 seconds will pass. Do you think this will have a big impact on users?

3. The third handshake packet loss

When the client receives the synack response from the server, it considers that the connection is established successfully, and then sets its connection status to ESTABLISHED and sends a third handshake request. However, there may be accidents during the third handshake.

  1. //file: net/ipv4/tcp_ipv4.c
  2. struct sock *tcp_v4_syn_recv_sock(struct sock *sk, ...)
  3. {
  4. // Check if the receiving queue is full
  5. if (sk_acceptq_is_full(sk))
  6. goto exit_overflow;
  7. ...
  8. exit_overflow:
  9. NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
  10. ...
  11. }

From the above code, we can see that during the third handshake, if the server's full connection queue is full, the ack handshake packet from the client is directly discarded.

It is easy to understand if you think about it. The request after the three-way handshake is completed should be placed in the full connection queue. However, if the full connection queue is full, the three-way handshake will still not succeed.

But what’s interesting is that the third handshake failure is not caused by the client retrying, but by the client resending the synack.

Let's do a real case to capture packets directly. I wrote a simple server that only listens but does not accept, and then finds a client to consume its connection queue. At this time, I capture the packet results when another client initiates a request to it.

The first red box is the third handshake. In fact, this handshake request has been discarded on the server side. But the client is not aware of this at this time. It has been foolishly thinking that the three-way handshake has been completed. Fortunately, the handshake request stored in the first handshake is still recorded in the server's semi-connection queue.

The server waits until the half-connection timer expires, and then re-sends a synack to the client. After receiving it, the client replies with the third handshake ack. If the full connection queue on the server is full during this period, the server will give up after retrying 5 times (controlled by the kernel parameter net.ipv4.tcp_synack_retries).

In this case, you should also pay attention to another problem. In practice, the client often thinks that the connection is successfully established and starts sending data. In fact, the connection has not been established yet. The data it sends, including retries, will be ignored by the server until the connection is actually established.

IV. Conclusion

One of the criteria for measuring whether an engineer is good is whether he can locate and handle various problems that occur online. Even a seemingly simple TCP three-way handshake may lead to various accidents in engineering practice. If you don't have a deep understanding of the handshake, you may not be able to handle various faults that occur online.

Today's article mainly describes the situations when there are insufficient ports, the half-connection queue is full, and the full-connection queue is full.

When there are not enough ports, the connect system call will cause too many spin lock waits and hash searches, which will increase CPU overhead. In severe cases, the CPU will be exhausted, affecting the execution of user business logic. There are several ways to deal with this problem.

Try to increase the port range by adjusting ip_local_port_range

Try to reuse connections and use long connections to reduce frequent handshake processing

The third useful, but not recommended, is to enable tcp_tw_reuse and tcp_tw_recycle

The server may lose packets during the first handshake, which may happen in the following two situations.

The semi-connection queue is full and tcp_syncookies is 0

The full connection queue is full and there are unfinished half-connection requests

In both cases, from the client's perspective, it is no different from a network outage, that is, the SYN packet sent out does not have any feedback, and then the handshake request is retransmitted after the timer expires. The first retransmission time is 1 second, and the subsequent waiting intervals double, 2 seconds, 4 seconds, 8 seconds... The total number of retransmissions is affected by the net.ipv4.tcp_syn_retries kernel parameter (note that I used the word influence, not determination).

The server may also have problems during the third handshake. If the full connection queue is full, packet loss will still occur. However, when the third handshake fails, only the server knows (the client mistakenly thinks the connection has been established successfully). The server initiates a synack retry based on the handshake information in the semi-connection queue. The number of retries is controlled by net.ipv4.tcp_synack_retries.

Once the above connection queue overflow problems occur on your line, your service will be seriously affected. Even if the first retry succeeds, your interface response time will directly increase to 1 second (3 seconds in the old version). If the retry fails two or three times, Nginx is likely to report access timeout failure.

Because handshake retries have a great impact on our services, it is necessary to deeply understand these abnormal situations in the three-way handshake. Let's talk about how we should deal with the problem of packet loss.

Method 1: Enable syncookie

In modern Linux versions, we can turn on tcp_syncookies to prevent too many requests from filling up the semi-connection queue, including SYN Flood attacks, to solve the problem of packet loss caused by a full semi-connection queue on the server.

Method 2: Increase the connection queue length

In "Why do server programs need to listen first?", we discussed that the length of the full connection queue is min(backlog, net.core.somaxconn) and the length of the semi-connection queue is. The length of the semi-connection queue is a little complicated, which is min(backlog, somaxconn, tcp_max_syn_backlog) + 1 rounded up to the power of 2, but the minimum cannot be less than 16.

If you need to increase the full/half connection queue length, please adjust one or more of the above parameters to achieve the goal. As long as the queue length is appropriate, the probability of handshake exceptions can be greatly reduced.

Method 3: Accept as soon as possible

Although this is generally not a problem, you should still pay attention to it. Your application should accept the new connection as soon as possible after the handshake is successful. Do not be busy processing other business logic and cause the full connection queue to be filled.

Method 4: Minimize the number of TCP connections

If the above methods do not solve your problem, it means that the TCP connection requests on your server are too frequent. At this time, you should consider whether you can use long connections instead of short connections to reduce the overly frequent three-way handshakes. This method can not only solve the possibility of handshake problems, but also cut down on the various memory, CPU, and time overheads of the three-way handshake, which is also very helpful in improving performance.

<<:  Rethink Research: Private 5G deployment will be faster than public 5G

>>:  How to avoid JS memory leaks?

Recommend

SmartHost adds block storage (large hard drive VPS), 256GB for only $1

The day before yesterday, I received an email fro...

In fact, IPv6 is not so perfect

Everything has its two sides, and technology is n...

Combining VXLAN and EVPN

EVPN is one of the hottest network technologies i...

Kerlink and Radio Bridge provide LoRaWAN solutions for private IoT networks

According to recent announcements, Kerlink and Ra...

5G, AI and IoT: the dream team for modern manufacturing

Artificial Intelligence and the Internet of Thing...

Do you really understand the connection control in Dubbo?

[[422543]] This article is reprinted from the WeC...

Green operation, data center still depends on automation

Power is the lifeline of data centers, and electr...