In-depth understanding of Linux TCP three-way handshake

In-depth understanding of Linux TCP three-way handshake

Author | zorrozou

Preface

The TCP protocol is a protocol that everyone seems to be familiar with, but also seems to be unfamiliar with. We say familiar because we use it almost every day, and everyone seems to have some understanding of concepts such as three-way handshake, four-way handshake, sliding window, slow start, congestion avoidance, and congestion control. We say unfamiliar because the TCP protocol is quite complex, and the network environment will change during operation. The relevant mechanisms of TCP will also produce relevant adaptive behaviors due to different changes. It is really not easy to explain its relevant concepts and operation process clearly.

This series of articles hopes to explain some details of TCP implementation on Linux from another perspective. Of course, my ability is limited, and I hope you will forgive me for some unclear parts. This article starts with the three-way handshake of TCP to establish a connection, and I hope it will be helpful to you. The kernel code version of this article is based on Linux-5.3.

What is reliable and connection-oriented?

When it comes to TCP, it is important to mention that it is a connection-oriented and reliable transport layer protocol. In contrast, UDP is unreliable and non-connection-oriented. In fact, IP delivery is a connectionless and unreliable protocol, and UDP simply adds a transport layer port encapsulation to the IP layer protocol, so it naturally inherits the delivery quality of IP. The reason why TCP is complex is that its design requires a connection-oriented and reliable transport layer protocol to be implemented on a connectionless and unreliable IP. Therefore, we need to first understand from an engineering perspective what is connection-oriented? What is reliable? Only then can we understand why TCP is so complex.

Let’s first outline these issues:

What is connection oriented:

Connection: The data transmitted in a connection has a relational state, such as the need to determine whether the other end of the transmission is in a state of waiting to send or receive. The relationship of the transmitted data needs to be maintained, such as the order of the data flow. A typical example is a phone call.

Connectionless: No need to care whether the other end is online. Each data segment sent is an independent data entity. There is no relationship between data and data, and there is no need to maintain the relationship between them. A typical example is text messaging.

What is reliable:

It mainly means that the data will not be damaged or lost during the transmission process, ensuring that the data can arrive correctly. If the above guarantee is not made, it is unreliable.

How to solve the connection-oriented problem:

Using the three steps of establishing a connection, transmitting data, and disconnecting to create a long-term data transmission mechanism, the data transmission in the same connection is contextual. Therefore, the following concepts are derived:

  • The seq sequence number field needs to be maintained to maintain the order of the data to ensure in-order delivery and solve the problem of data packet duplication.
  • Some special state marked packets are required to create, disconnect and maintain a connection: syn, ack, fin, rst

How to solve reliability issues:

Introduce a confirmation mechanism for data transmission, that is, wait for the other party to confirm after sending data. Therefore, it is necessary to maintain the confirmation field Acknowledgement and ack status. That is: stop waiting protocol.

After the introduction of the data confirmation mechanism (stop-and-wait protocol), the problem of low bandwidth utilization was caused. How to solve it? The solution is to introduce a window confirmation mechanism and a sliding window, that is, instead of confirming after each packet is sent, confirm after sending multiple packets.

After introducing the window, how to choose different window sizes on networks with different delays? The solution is to introduce window variables and window monitoring notifications:

Sender Maintenance:

  • Sent and confirmed ack offset (left edge of window)
  • Sent unconfirmed ack offset (current send byte position in the window)
  • Offset to be sent (right edge of window)

Receiver maintenance:

  • Accepted and confirmed offset (left edge of window)
  • The window size that will be saved after acceptance (right edge of the window)

The receiver will reply to the sender with an ACK confirmation, which will contain the latest window length so that the sender can adjust the window length. Here we will introduce the sack selection confirmation behavior and the persistence timer behavior when the window is 0.

After the sliding window is introduced, the bandwidth can be fully utilized, but the network environment is complex and congestion may occur at any time due to a large amount of data transmission. Therefore, a congestion control mechanism should be introduced: when congestion occurs, TCP should be able to ensure that the bandwidth is shared fairly by each TCP connection. Therefore, in the case of congestion, connections that occupy a large amount of bandwidth should be adjusted to occupy a smaller amount of bandwidth, and connections that occupy a small amount of bandwidth should be adjusted to occupy a larger amount of bandwidth. This is to achieve the purpose of fair resource occupation.

Congestion control adjusts bandwidth usage by adjusting the size of the sliding window, so a new variable called cwnd (congestion window) needs to be introduced at the receiving end to reflect the current network transmission capacity, while the previous notification window can be expressed as awnd. At this time, the actual window available to the sender is the smaller of cwnd and awnd.

There are many questions and concepts that arise from this, such as: How to determine the actual advertised window size? What is slow start? How does the congestion avoidance process work? How does congestion control work? And so on...

The fundamental reason why TCP is complex is to solve these problems in engineering. Now that we have outlined the ideas, let's first look at what the three-way handshake is for.

Why three times?

Why three handshakes, not two, or four, or any other number?

First we need to understand the purpose of establishing a connection. There are two purposes:

  • Confirm that the other party is online, that is, you can respond immediately when I request you. (Connection-oriented)
  • If the data to be transmitted is large, the order of the packets must be guaranteed, so the starting sequence number of the data transmitted in this link must be confirmed. Because data is transmitted in both directions, both sides must confirm the sequence number of the other end.

After confirming the second purpose, we can understand that the two-way handshake at least makes it impossible to determine whether the other party knows your starting sequence number. That is, suppose I am the server. The other party syn sends me the sequence number, and I also send my sequence number back to the other party, but what if the data packet I sent to the other party is lost? So I can't confirm whether the other party has received it, so I need the other party to confirm with me that he has indeed received it.

It is not impossible to have four handshakes, but it is too long-winded, so three times is the most reasonable. We can't escape the common practice, so we still use this classic picture to look at the process of the three-way handshake.

When I interview people, I often ask a rather stupid question here: If server A, after receiving the syn sent by client B and replying with syn+ack, receives an ack packet sent from another client C, will server A establish a subsequent ESTABLISHED connection with C at this time?

If drawn as a picture, it looks like this:

The reason why this question is relatively stupid is that most people think it won't work, but if you ask why, few people can really answer it. So why? In fact, it is also very simple. The ip+port of a new client is different. If it directly sends me an ack, it will directly reply to rst according to the TCP protocol, and naturally no connection will be created. This actually leads to a question. The kernel must be able to identify whether the ack request sent to me is the first time it has been sent to me, or if it has been sent syn before and I have replied syn+ack. The kernel will query through the four-tuple, and this query will be executed in tcp_v4_rcv(), which is the total population handled by TCP, and __inet_lookup() is called for search.

 static inline struct sock *__inet_lookup(struct net *net, struct inet_hashinfo *hashinfo, struct sk_buff *skb, int doff, const __be32 saddr, const __be16 sport, const __be32 daddr, const __be16 dport, const int dif, const int sdif, bool *refcounted) { u16 hnum = ntohs(dport); struct sock *sk; sk = __inet_lookup_established(net, hashinfo, saddr, sport, daddr, hnum, dif, sdif); *refcounted = true; if (sk) return sk; *refcounted = false; return __inet_lookup_listener(net, hashinfo, skb, doff, saddr, sport, daddr, hnum, dif, sdif); }

The search is divided into two steps. First, check whether there is a connection in established, and then check whether there is a connection in linstener. If not, send_reset directly. After confirming that the connection exists, if it is in TCP_ESTABLISHED state, directly tcp_rcv_established() receives data, otherwise enter tcp_rcv_state_process() to process various TCP states. If it is the first handshake, it is TCP_LISTEN state, enter:

acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0;

At this time, conn_request is tcp_v4_conn_request(), and the first handshake is processed in this method. If it is the third handshake, the TCP status should be: TCP_SYN_RECV.

When the server is in the SYN RECVED state, it needs to record the contents of the client's syn packet in the cache so that it can be searched during the packet receiving process, occupying part of the slab cache. This cache has an upper limit in the kernel, and /proc/sys/net/ipv4/tcp_max_syn_backlog is used to limit the number of caches. This value determines how many TCP_SYN_RECV state connections tcp can maintain at the same time under normal conditions, that is, the number of half-connections on the server. This value on a general server defaults to 1024-2048. This value is automatically generated by default based on your total memory size, and the value will be larger if the memory is large.

What happens if the semi-connection queue is exhausted? We can still find the answer in the kernel. In tcp_conn_request(), we can see the following paragraph:

 if (!want_cookie && !isn) { /* Kill the following clause, if you dislike this way. */ if (!net->ipv4.sysctl_tcp_syncookies && (net->ipv4.sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) < (net->ipv4.sysctl_max_syn_backlog >> 2)) && !tcp_peer_is_proven(req, dst)) { /* Without syncookies last quarter of * backlog is filled with destinations, * proven to be alive. * It means that we continue to communicate * to destinations, already remembered * to the moment of synflood. */ pr_drop_req(req, ntohs(tcp_hdr(skb)->source), rsk_ops->family); goto drop_and_release; } isn = af_ops->init_seq(skb); }

Here are some related concepts:

  • What is a syncookie?
  • What is inet_csk_reqsk_queue_len(sk)?

We will discuss the syncookie mechanism in detail later. Here, we only need to know this conclusion: when syncookie is enabled, the semi-connection queue can be considered to have no upper limit. From the definition of inet_csk_reqsk_queue_len, we can know that it checks the qlen in the request_sock_queue structure. The structure definition is as follows:

 /* * For a TCP Fast Open listener - * lock - protects the access to all the reqsk, which is co-owned by * the listener and the child socket. * qlen - pending TFO requests (still in TCP_SYN_RECV). * max_qlen - max TFO reqs allowed before TFO is disabled. * * XXX (TFO) - ideally these fields can be made as part of "listen_sock" * structure above. But there is some implementation difficulty due to * listen_sock being part of request_sock_queue hence will be freed when * a listener is stopped. But TFO related fields may continue to be * accessed even after a listener is closed, until its sk_refcnt drops * to 0 implying no more outstanding TFO reqs. One solution is to keep * listen_opt around until sk_refcnt drops to 0. But there is some other * complexity that needs to be resolved. Eg, a listener can be disabled * temporarily through shutdown()->tcp_disconnect(), and re-enabled later. */ struct fastopen_queue { struct request_sock *rskq_rst_head; /* Keep track of past TFO */ struct request_sock *rskq_rst_tail; /* requests that caused RST. * This is part of the defense * against spoofing attack. */ spinlock_t lock; int qlen; /* # of pending (TCP_SYN_RECV) reqs */ int max_qlen; /* != 0 iff TFO is currently enabled */ struct tcp_fastopen_context __rcu *ctx; /* cipher context for cookie */ }; /** struct request_sock_queue - queue of request_socks * * @rskq_accept_head - FIFO head of established children * @rskq_accept_tail - FIFO tail of established children * @rskq_defer_accept - User waits for some data after accept() * */ struct request_sock_queue { spinlock_t rskq_lock; u8 rskq_defer_accept; u32 synflood_warned; atomic_t qlen; atomic_t young; struct request_sock *rskq_accept_head; struct request_sock *rskq_accept_tail; struct fastopen_queue fastopenq; /* Check max_qlen != 0 to determine * if TFO is enabled. */ };

Here comes a new concept: TFO - TCP Fast Open, which we will skip here and discuss later. The qlen in this structure will increase after the tcp_conn_request() function is executed:

 if (fastopen_sk) { af_ops->send_synack(fastopen_sk, dst, &fl, req, &foc, TCP_SYNACK_FASTOPEN); /* Add the child socket directly into the accept queue */ if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) { reqsk_fastopen_remove(fastopen_sk, req, false); bh_unlock_sock(fastopen_sk); sock_put(fastopen_sk); goto drop_and_free; } sk->sk_data_ready(sk); bh_unlock_sock(fastopen_sk); sock_put(fastopen_sk); } else { tcp_rsk(req)->tfo_listener = false; if (!want_cookie) inet_csk_reqsk_queue_hash_add(sk, req, tcp_timeout_init((struct sock *)req)); af_ops->send_synack(sk, dst, &fl, req, &foc, !want_cookie ? TCP_SYNACK_NORMAL : TCP_SYNACK_COOKIE); if (want_cookie) { reqsk_free(req); return 0; } }

It can be understood that qlen is the current length of the semi-connected queue of the server's listen port. So this paragraph can be understood as:

 if (!net->ipv4.sysctl_tcp_syncookies && (net->ipv4.sysctl_max_syn_backlog - inet_csk_reqsk_queue_len(sk) < (net->ipv4.sysctl_max_syn_backlog >> 2)) && !tcp_peer_is_proven(req, dst)) {

When syncookie is not enabled, if the remaining length of the current half-connection pool is less than one-quarter of the maximum length, no new connection requests will be processed. This is the principle of the famous synflood attack:

For a server without the syncookie function, any client can fill up the server's semi-connection pool by constructing an incomplete three-way handshake process, sending only SYN and not returning the ACK of the third handshake, causing the server to be unable to establish a new TCP connection with any client.

Then we also know the original intention of the design of the syncookie function: to prevent synflood.

How do syncookies prevent synflood?

Since it is clear that synflood is an attack against the upper limit of the semi-connection pool, we need to find a way to bypass the semi-connection pool. Can the server not record the four-tuple information sent by the first syn, and verify it during the third handshake? In fact, it is possible: since the second handshake of the three-way handshake is the server's reply, why not put the information obtained from the first handshake in the reply packet, and let the client bring this information back during the third handshake, and then we can get the four-tuple information of the third handshake and the information recorded in it for verification? Of course, in order to keep the packet content as small as possible, we perform a hash operation on the information that needs to be recorded in the packet, and the new data obtained by the operation is called a cookie.

The specific processing method is described as follows:

Call the following code in tcp_conn_request() to generate a cookie:

 if (want_cookie) { isn = cookie_init_sequence(af_ops, sk, skb, &req->mss); req->cookie_ts = tmp_opt.tstamp_ok; if (!tmp_opt.tstamp_ok) inet_rsk(req)->ecn_ok = 0; }

The method that actually generates the cookie is:

 static __u32 secure_tcp_syn_cookie(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, __u32 sseq, __u32 data) { /* * Compute the secure sequence number. * The output should be: * HASH(sec1,saddr,sport,daddr,dport,sec1) + sseq + (count * 2^24) * + (HASH(sec2,saddr,sport,daddr,dport,count,sec2) % 2^24). * Where sseq is their sequence number and count increases every * minute by 1. * As an extra hack, we add a small "data" value that encodes the * MSS into the second hash value. */ u32 count = tcp_cookie_time(); return (cookie_hash(saddr, daddr, sport, dport, 0, 0) + sseq + (count << COOKIEBITS) + ((cookie_hash(saddr, daddr, sport, dport, count, 1) + data) & COOKIEMASK)); }

The hash value is calculated based on the packet's four-tuple information and the current time, and recorded in isn. The synack is sent using the tcp_v4_send_synack() function, which calls tcp_make_synack() to determine whether cookie_ts is set. If it is set, the tcp option information is initialized to the lower 6 bits of timestamp.

 #ifdef CONFIG_SYN_COOKIES if (unlikely(req->cookie_ts)) skb->skb_mstamp_ns = cookie_init_timestamp(req); else #endif

In this way, the synack is sent back to the client, and the package contains the cookie information. When the client replies to the last ack, it adds seq+1, which means that when the server receives the last ack, it only needs to subtract 1 from the seq sequence number of the ack to get the previously sent cookie. Then the cookie is calculated again based on the four-tuple information of the package, and it is verified whether the calculated cookie is the same as the returned cookie. The specific method is in cookie_v4_check(), and you can search the code by yourself if you are interested.

After such verification, the process that originally required memory resources to process was completely transformed into CPU operation. In this way, even if there is a synflood attack, the attack is no longer based on the memory limit, but will be converted into CPU operation, which will greatly weaken the effect of the attack.

The syncookie function is enabled by default in the kernel. The switch is:

/proc/sys/net/ipv4/tcp_syncookies

The default value of this file is 1, which means that the syncookie function is turned on. It should be noted that in this scenario, only after the tcp_max_syn_backlog limit is exhausted, the newly created connection will use syncookie. Set it to 0 to turn off syncookie, and set it to 2 to ignore the tcp_max_syn_backlog semi-connection queue and use syncookie directly.

listen backlog

Here we need to explain the backlog parameter of the listen system call. As we all know, to put a port in the listening state, we need to call three system calls: socket, bind, and listen. The final entry of TCP into the LISTEN state is done by the listen system call. The second parameter backlog of this system call is explained in the man page as follows:

 The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.

From the description, this backlog seems to limit the TCP semi-connection queue, but if you look at the man page carefully and scroll down, you can see this content:

 NOTES To accept connections, the following steps are performed: 1. A socket is created with socket(2). 2. The socket is bound to a local address using bind(2), so that other sockets may be connect(2)ed to it. 3. A willingness to accept incoming connections and a queue limit for incoming connections are specified with listen(). 4. Connections are accepted with accept(2). POSIX.1-2001 does not require the inclusion of <sys/types.h>, and this header file is not required on Linux. However, some historical (BSD) implementations required this header file, and portable applications are probably wise to include it. The behavior of the backlog argument on TCP sockets changed with Linux 2.2. Now it specifies the queue length for completely established sockets waiting to be accepted, instead of the number of incomplete connection requests. The maximum length of the queue for incomplete sockets can be set using /proc/sys/net/ipv4/tcp_max_syn_backlog. When syncookies are enabled there is no logical maximum length and this setting is ignored. See tcp(7) for more information. If the backlog argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128. In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.

This section really explains the true meaning of this backlog. In simple terms, the establishment of a TCP connection mainly includes 4 steps:

(1) Create a socket. socket()

(2) Bind the socket to a local address and port. bind()

(3) Set the socket value to the listen state. listen()

At this point, the client can create a connection with the socket of the corresponding port. Of course, if a connection is created here, it can only complete the three-way handshake. It cannot yet create a connection that the application can read and write. Finally, to create a truly usable connection, the fourth step is required:

(4) The server accepts a new connection and creates a new fd returned by accept().

After that, the server can use this new accept fd to communicate with the client. The listen backlog here limits that if the server is in the LISTEN state and a client establishes a connection with me, but if the server does not accept the new connection in time, how big is the queue of requests that have not yet been accepted? There is another question here, what will happen when the connection queue exceeds the limit?

We can write a simple server program to test this state. The server code is as follows:

 [root@localhost basic]# cat server.c #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/socket.h> #include <arpa/inet.h> #include <string.h> int main() { int sfd, afd; socklen_t socklen; struct sockaddr_in saddr, caddr; sfd = socket(AF_INET, SOCK_STREAM, 0); if (sfd < 0) { perror("socket()"); exit(1); } bzero(&saddr, sizeof(saddr)); saddr.sin_family = AF_INET; saddr.sin_port = htons(8888); if (inet_pton(AF_INET, "0.0.0.0", &saddr.sin_addr) <= 0) { perror("inet_pton()"); exit(1); } //saddr.sin_addr.s_addr = htonl(INADDR_LOOPBACK); if (bind(sfd, (struct sockaddr *)&saddr, sizeof(saddr)) < 0) { perror("bind()"); exit(1); } if (listen(sfd, 5) < 0) { perror("listen()"); exit(1); } pause(); while (1) { bzero(&caddr, sizeof(caddr)); afd = accept(sfd, (struct sockaddr *)&caddr, &socklen); if (afd < 0) { perror("accept()"); exit(1); } if (write(afd, "hello", strlen("hello")) < 0) { perror("write()"); exit(1); } close(afd); } exit(0); }

The code is very simple, socket, bind, listen and then pause directly. Let's take a look at the current status:

 [root@localhost basic]# ./server & [1] 14141 [root@localhost basic]# ss -tnal State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 5 0.0.0.0:8888 0.0.0.0:*

At this time, for the LISTEN state connection displayed by the ss command, the Send-Q number means the length of the listen backlog. We use telnet as the client to connect to the current port 8888 and capture packets to see the TCP connection process and its changes.

 [root@localhost basic]# tcpdump -i ens33 -nn port 8888 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes 10:54:41.863704 IP 192.168.247.130.45790 > 192.168.247.129.8888: Flags [S], seq 3982567931, win 64240, options [mss 1460,sackOK,TS val 1977602046 ecr 0,nop,wscale 7], length 0 10:54:41.863788 IP 192.168.247.129.8888 > 192.168.247.130.45790: Flags [S.], seq 3708893655, ack 3982567932, win 28960, options [mss 1460,sackOK,TS val 763077058 ecr 1977602046,nop,wscale 7], length 0 10:54:41.864005 IP 192.168.247.130.45790 > 192.168.247.129.8888: Flags [.], ack 1, win 502, options [nop,nop,TS val 1977602046 ecr 763077058], length 0

The three-way handshake is established without any problem.

 [root@localhost zorro]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 1 5 0.0.0.0:8888 0.0.0.0:*

From the ss display, we can see that the Recv-Q in the LISTEN state is the number of connections currently queued in the backlog queue. Let's create a few more to see what happens if the limit is exceeded:

 [root@localhost zorro]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 6 5 0.0.0.0:8888 0.0.0.0:* [root@localhost basic]# tcpdump -i ens33 -nn port 8888 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes 11:00:40.674176 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1977960856 ecr 0,nop,wscale 7], length 0 11:00:41.682431 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1977961864 ecr 0,nop,wscale 7], length 0 11:00:43.728894 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1977963911 ecr 0,nop,wscale 7], length 0 11:00:47.761967 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1977967944 ecr 0,nop,wscale 7], length 0 11:00:56.017547 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1977976199 ecr 0,nop,wscale 7], length 0 11:01:12.402559 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1977992584 ecr 0,nop,wscale 7], length 0 11:01:44.657797 IP 192.168.247.130.45804 > 192.168.247.129.8888: Flags [S], seq 3183080621, win 64240, options [mss 1460,sackOK,TS val 1978024840 ecr 0,nop,wscale 7], length 0

When the number of connections exceeds 6 and a new connection is created, the new connection can no longer complete the three-way handshake. The client's syn does not receive a response and starts to retry. After 6 retries, the connection ends. The client reports an error:

 [root@localhost zorro]# telnet 192.168.247.129 8888 Trying 192.168.247.129... telnet: connect to address 192.168.247.129: Connection timed out

The number of retries when the first handshake syn does not receive a response is limited by this kernel parameter:

/proc/sys/net/ipv4/tcp_syn_retries

Set the maximum number of retries for the first handshake syn when no synack is received. The default value is 6. You can modify this value to change the number of retries. But the time rule cannot be changed. The interval time increases exponentially by 2, that is, the first retry is 1 second, the second is 2 seconds, then 4 seconds, 8 seconds and so on. So by default, tcp_syn_retries waits for a maximum of 63 seconds. There is also a file to specify the number of retries for the second handshake:

/proc/sys/net/ipv4/tcp_synack_retries

Set the maximum number of retries after the second handshake synack is sent, if the last ack is not received, the default value is 5. So tcp_synack_retries waits for up to 31 seconds.

According to the above tests, we found that when the listen backlog queue is exhausted, the new connection cannot complete the three-way handshake, which is sometimes confused with the synflood attack because it has similar effects to synflood. In the tcp_conn_request() processing in the kernel, we can see how the kernel responds to synflood and listen backlog full respectively:

 int tcp_conn_request(struct request_sock_ops *rsk_ops, const struct tcp_request_sock_ops *af_ops, struct sock *sk, struct sk_buff *skb) { struct tcp_fastopen_cookie foc = { .len = -1 }; __u32 isn = TCP_SKB_CB(skb)->tcp_tw_isn; struct tcp_options_received tmp_opt; struct tcp_sock *tp = tcp_sk(sk); struct net *net = sock_net(sk); struct sock *fastopen_sk = NULL; struct request_sock *req; bool want_cookie = false; struct dst_entry *dst; struct flowi fl; /* TW buckets are converted to open requests without * limitations, they conserve resources and peer is * evidently real one. */ if ((net->ipv4.sysctl_tcp_syncookies == 2 || inet_csk_reqsk_queue_is_full(sk)) && !isn) { want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name); if (!want_cookie) goto drop; } if (sk_acceptq_is_full(sk)) { NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS); goto drop; } ......

When sk_acceptq_is_full(sk), it will be dropped directly, and the counter corresponding to LINUX_MIB_LISTENOVERFLOWS will be increased. By querying the corresponding relationship of the counters, we can know that LINUX_MIB_LISTENOVERFLOWS corresponds to:

SNMP_MIB_ITEM("ListenOverflows", LINUX_MIB_LISTENOVERFLOWS)

That is the ListenOverflows count in /proc/net/netstat. This count also corresponds to the listen queue of a socket overflowed displayed in netstat -s. From the code of sk_acceptq_is_full, we can see why when the listen backlog is set to 5, the number of connections must exceed 5+1 before the connection times out, because the current number of connections must be greater than the maximum limit:

 static inline bool sk_acceptq_is_full(const struct sock *sk) { return sk->sk_ack_backlog > sk->sk_max_ack_backlog; }

From the man page of listen, we can know that the kernel limit file for listen backlog is:

/proc/sys/net/core/somaxconn

For a listen socket, the currently effective configuration is the minimum value of somaxconn and backlog.

In general, this value does not need to be optimized. Can we imagine when our application will not have time to accept when the connection is established? In most cases, when the load pressure of your system is so high that it is too late to handle the accept of the new connection, in this case, it is more important to expand the capacity rather than increase the queue. In this case, sometimes we should even reduce the queue and reduce the number of syn retries of the client so that the client can fail more quickly and prevent the accumulation of too many connections from causing an avalanche. Of course, some software with poor concurrent processing architecture design will also exhaust the queue when there is no high load pressure. At this time, the main thing to adjust is the software architecture or other settings.

TFO - TCP Fast Open

From the above code, we can know that the current Linux TCP protocol stack supports TFO. TFO, the Chinese name is TCP Fast Open. As the name suggests, its main purpose is to simplify the three-way handshake process and make TCP open faster on a large latency network. So how does TFO work? We can first observe the behavior of TFO from a practical example.

We start the web service on one server, and use curl on another server to access port 80 of the web server, then access its / page and exit. The captured packet content is as follows:

 [root@localhost zorro]# tcpdump -i ens33 -nn port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes 11:31:07.390934 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [S], seq 4136264279, win 64240, options [mss 1460,sackOK,TS val 667677346 ecr 0,nop,wscale 7], length 0 11:31:07.390994 IP 192.168.247.129.80 > 192.168.247.130.58066: Flags [S.], seq 1980017862, ack 4136264280, win 28960, options [mss 1460,sackOK,TS val 4227985538 ecr 667677346,nop,wscale 7], length 0 11:31:07.391147 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 667677347 ecr 4227985538], length 0 11:31:07.391177 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [P.], seq 1:80, ack 1, win 502, options [nop,nop,TS val 667677347 ecr 4227985538], length 79: HTTP: GET / HTTP/1.1 11:31:07.391185 IP 192.168.247.129.80 > 192.168.247.130.58066: Flags [.], ack 80, win 227, options [nop,nop,TS val 4227985538 ecr 667677347], length 0 11:31:07.391362 IP 192.168.247.129.80 > 192.168.247.130.58066: Flags [.], seq 1:2897, ack 80, win 227, options [nop,nop,TS val 4227985538 ecr 667677347], length 2896: HTTP: HTTP/1.1 200 OK 11:31:07.391441 IP 192.168.247.129.80 > 192.168.247.130.58066: Flags [P.], seq 2897:4297, ack 80, win 227, options [nop,nop,TS val 4227985539 ecr 667677347], length 1400: HTTP 11:31:07.391497 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [.], ack 2897, win 496, options [nop,nop,TS val 667677347 ecr 4227985538], length 0 11:31:07.391632 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [.], ack 4297, win 501, options [nop,nop,TS val 667677347 ecr 4227985539], length 0 11:31:07.398223 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [F.], seq 80, ack 4297, win 501, options [nop,nop,TS val 667677354 ecr 4227985539], length 0 11:31:07.398336 IP 192.168.247.129.80 > 192.168.247.130.58066: Flags [F.], seq 4297, ack 81, win 227, options [nop,nop,TS val 4227985545 ecr 667677354], length 0 11:31:07.398480 IP 192.168.247.130.58066 > 192.168.247.129.80: Flags [.], ack 4298, win 501, options [nop,nop,TS val 667677354 ecr 4227985545], length 0

This is a complete three-way handshake and four-way handshake process, as well as an HTTP data transmission process. Of course, we have not yet enabled TFO, so we still use a diagram to illustrate the connection process:

We were lucky to observe a three-wave handshake to close the connection during this connection process, but that is not our topic today. The rest of the connection process is basically standard TCP behavior. Then we turn on TFO to see if there is any change.

Our web server uses nginx and the client uses curl. Both software support TFO by default. First, enable TFO support in the kernel:

 [root@localhost zorro]# echo 3 > /proc/sys/net/ipv4/tcp_fastopen [root@localhost zorro]# cat /proc/sys/net/ipv4/tcp_fastopen 3

This file is the switch of TFO, 0: off, 1: open client support, which is also the default value, 2: open server support, 3: open client and server. Generally, when necessary, we set it to 1 on the client server and 2 on the server server. For convenience, it is also OK to set both to 3. Then configure the server nginx to open TFO:

 server { listen 80 default_server fastopen=128; listen [::]:80 default_server fastopen=128; server_name _;

Find the listen setting in the nginx configuration file and add a fastopen parameter followed by a value as shown above. Then restart the server.

The client is relatively simple, just use the --tcp-fastopen parameter of curl. We execute this command on the client:

 [root@localhost zorro]# curl --tcp-fastopen http://192.168.247.129/

At the same time, capture the packet on the server and take a look:

The first request to capture the packet:

 [root@localhost zorro]# tcpdump -i ens33 -nn port 80 tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens33, link-type EN10MB (Ethernet), capture size 262144 bytes 11:44:03.774234 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [S], seq 3253027385, win 64240, options [mss 1460,sackOK,TS val 668453730 ecr 0,nop,wscale 7,tfo cookiereq,nop,nop], length 0 11:44:03.774361 IP 192.168.247.129.80 > 192.168.247.130.58074: Flags [S.], seq 3812865995, ack 3253027386, win 28960, options [mss 1460,sackOK,TS val 4228761923 ecr 668453730,nop,wscale 7,tfo cookie 336fe8751f5cca4b,nop,nop], length 0 11:44:03.774540 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 668453730 ecr 4228761923], length 0 11:44:03.774575 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [P.], seq 1:80, ack 1, win 502, options [nop,nop,TS val 668453730 ecr 4228761923], length 79: HTTP: GET / HTTP/1.1 11:44:03.774597 IP 192.168.247.129.80 > 192.168.247.130.58074: Flags [.], ack 80, win 227, options [nop,nop,TS val 4228761923 ecr 668453730], length 0 11:44:03.774786 IP 192.168.247.129.80 > 192.168.247.130.58074: Flags [.], seq 1:2897, ack 80, win 227, options [nop,nop,TS val 4228761923 ecr 668453730], length 2896: HTTP: HTTP/1.1 200 OK 11:44:03.774889 IP 192.168.247.129.80 > 192.168.247.130.58074: Flags [P.], seq 2897:4297, ack 80, win 227, options [nop,nop,TS val 4228761923 ecr 668453730], length 1400: HTTP 11:44:03.774997 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [.], ack 2897, win 496, options [nop,nop,TS val 668453731 ecr 4228761923], length 0 11:44:03.775022 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [.], ack 4297, win 489, options [nop,nop,TS val 668453731 ecr 4228761923], length 0 11:44:03.775352 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [F.], seq 80, ack 4297, win 501, options [nop,nop,TS val 668453731 ecr 4228761923], length 0 11:44:03.775455 IP 192.168.247.129.80 > 192.168.247.130.58074: Flags [F.], seq 4297, ack 81, win 227, options [nop,nop,TS val 4228761924 ecr 668453731], length 0 11:44:03.775679 IP 192.168.247.130.58074 > 192.168.247.129.80: Flags [.], ack 4298, win 501, options [nop,nop,TS val 668453731 ecr 4228761924], length 0

The second request to catch the packet:

 11:44:11.476255 IP 192.168.247.130.58076 > 192.168.247.129.80: Flags [S], seq 3310765845:3310765924, win 64240, options [mss 1460,sackOK,TS val 668461432 ecr 0,nop,wscale 7,tfo cookie 336fe8751f5cca4b,nop,nop], length 79: HTTP: GET / HTTP/1.1 11:44:11.476334 IP 192.168.247.129.80 > 192.168.247.130.58076: Flags [S.], seq 2616505126, ack 3310765925, win 28960, options [mss 1460,sackOK,TS val 4228769625 ecr 668461432,nop,wscale 7], length 0 11:44:11.476601 IP 192.168.247.129.80 > 192.168.247.130.58076: Flags [.], seq 1:2897, ack 1, win 227, options [nop,nop,TS val 4228769625 ecr 668461432], length 2896: HTTP: HTTP/1.1 200 OK 11:44:11.476619 IP 192.168.247.129.80 > 192.168.247.130.58076: Flags [P.], seq 2897:4297, ack 1, win 227, options [nop,nop,TS val 4228769625 ecr 668461432], length 1400: HTTP 11:44:11.476657 IP 192.168.247.130.58076 > 192.168.247.129.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 668461432 ecr 4228769625], length 0 11:44:11.476906 IP 192.168.247.130.58076 > 192.168.247.129.80: Flags [.], ack 4297, win 489, options [nop,nop,TS val 668461433 ecr 4228769625], length 0 11:44:11.477100 IP 192.168.247.130.58076 > 192.168.247.129.80: Flags [F.], seq 1, ack 4297, win 501, options [nop,nop,TS val 668461433 ecr 4228769625], length 0 11:44:11.477198 IP 192.168.247.129.80 > 192.168.247.130.58076: Flags [F.], seq 4297, ack 2, win 227, options [nop,nop,TS val 4228769625 ecr 668461433], length 0 11:44:11.477301 IP 192.168.247.130.58076 > 192.168.247.129.80: Flags [.], ack 4298, win 501, options [nop,nop,TS val 668461433 ecr 4228769625], length 0

We found that after the TFO is enabled, the overall interaction method of tcp is basically the same as before before it is enabled. Syn in the first handshake has more cookiereq field. The second handshake server responded to the cookie 336fe8751f5cca4b field. This is the main interaction difference between the first tcp connection requested by the same client after the TFO is enabled:

The client's syn package will have an empty cookie field. If the server also supports TFO, after seeing this empty cookie field, it will calculate a TFO cookie and reply to the client. This cookie is used when the client establishes a TCP connection with the server next time. The syn package with cookies means that the packet also carries application layer data. In this way, the subsequent TCP three-handshake process can not only be a handshake, but also carry http protocol data. The two interactions are shown below:

Let's take a look at other kernel parameters related to TFO:

/proc/sys/net/ipv4/tcp_fastopen

Through the above content, we already know that the meaning of the value of this file is 1, 2, and 3. In addition to these values, we can also set it to:

  • 0x4: Valid for the client. Data will be sent in SYN regardless of whether the cookie is available and there is no cookie option.
  • 0x200: Valid for server side. Accepts SYN data without any cookie options.
  • 0x400: works on the server side. By default, all listen ports support TFO without setting the TCP_FASTOPEN socket option.

It should also be added here that in general, in addition to turning on the relevant switches in the kernel, the application needs to support TFO and make related adjustments. For the client, sendmsg() or sendto() need to be used to send data, and the MSG_FASTOPEN tag needs to be added to the flag parameter. For the server, you need to use setsockopt to set the TCP_FASTOPEN option after the socket is opened to turn on TFO support.

/proc/sys/net/ipv4/tcp_fastopen_key

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

/proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_sec

Because TFO modified the normal TCP three-time handshake process, during the first syn packet passing through the network to the server, it is possible that some routers or firewall rules will disable this special syn as an abnormal traffic. In this context, we call this phenomenon TFO firewall blackhole. The default mechanism is that if a firewall blackhole is detected, the TFO function will be temporarily turned off. This file is used to set the shutdown time period. The default value is: 3600, which means that if a black hole is detected, the TFO will be turned off within 3600 seconds. And after each shutdown period ends, if a black hole is found again, the next shutdown period time will increase exponentially. A value of 0 means that the black hole is detected.

In addition, what does fastopen=128 mean in nginx configuration? In fact, it is to limit the maximum length limit of the three-handshake connection for this port after opening fastopen. This limit can be opened larger. For details, please refer to nginx configuration description: nginx.org/en/docs/ht... .

The kernel code implementation of TFO will not be described in detail here. You can find the TFO-related processing process in the code described above. If you have a basic understanding, you can study it yourself. Whether the server opens TFO is still a long time ago. In a complex network environment, the performance of TFO seems to be a little far from everyone's ideals. Of course, it is not that TFO is not good, but that the network often rejects the relevant packets, resulting in the fact that the TFO does not take effect.

For more detailed descriptions of TFO, you can also refer to RFC7413: tools.ietf.org/html/...

Other kernel parameters

During the three handshakes, there are several retry settings:

/proc/sys/net/ipv4/tcp_synack_retries

Set the maximum number of retries after the second handshake synack is sent, if the last ack is not received, the default value is 5.

There is only the number of retry times set here, and there is no retry interval time. The interval time increases exponentially by 2, that is, the first retry is 1 second, the second retry is 2 seconds, then 4 seconds, 8 seconds, and so on. So by default tcp_syn_retries waits up to 63 seconds, and tcp_synack_retries waits up to 31 seconds.

<<:  Ethernet spending suffers decline despite surging bandwidth demand

>>:  RG-CT7526 series cloud terminal released: ARM architecture hardcore warrior, performance soared 107%

Recommend

From theory to practice: the wide application of MUX VLAN in the network

Background of MUX VLAN MUX VLAN (Multiplex VLAN) ...

Will 5G mobile phones and package fees become cheaper and cheaper?

[[350564]] 1China has the largest 5G user group i...

Did you know that subset problems are actually template problems?

[[426614]] After understanding the essence, this ...

Under the SDN wave, where will traditional routing technology go?

introduction "All martial arts come from Sha...

The latest analysis of WiFi 6E and WiFi 7 market!

WiFi has been expanding its deployment and applic...

Mainstream IoT wireless technologies are here!

This article covers: Bluetooth, WiFi, BLE, Zigbee...