Author | zorrozou PrefaceThe TCP protocol is a protocol that everyone seems to be familiar with, but also seems to be unfamiliar with. We say familiar because we use it almost every day, and everyone seems to have some understanding of concepts such as three-way handshake, four-way handshake, sliding window, slow start, congestion avoidance, and congestion control. We say unfamiliar because the TCP protocol is quite complex, and the network environment will change during operation. The relevant mechanisms of TCP will also produce relevant adaptive behaviors due to different changes. It is really not easy to explain its relevant concepts and operation process clearly. This series of articles hopes to explain some details of TCP implementation on Linux from another perspective. Of course, my ability is limited, and I hope you will forgive me for some unclear parts. This article starts with the three-way handshake of TCP to establish a connection, and I hope it will be helpful to you. The kernel code version of this article is based on Linux-5.3. What is reliable and connection-oriented?When it comes to TCP, it is important to mention that it is a connection-oriented and reliable transport layer protocol. In contrast, UDP is unreliable and non-connection-oriented. In fact, IP delivery is a connectionless and unreliable protocol, and UDP simply adds a transport layer port encapsulation to the IP layer protocol, so it naturally inherits the delivery quality of IP. The reason why TCP is complex is that its design requires a connection-oriented and reliable transport layer protocol to be implemented on a connectionless and unreliable IP. Therefore, we need to first understand from an engineering perspective what is connection-oriented? What is reliable? Only then can we understand why TCP is so complex. Let’s first outline these issues: What is connection oriented: Connection: The data transmitted in a connection has a relational state, such as the need to determine whether the other end of the transmission is in a state of waiting to send or receive. The relationship of the transmitted data needs to be maintained, such as the order of the data flow. A typical example is a phone call. Connectionless: No need to care whether the other end is online. Each data segment sent is an independent data entity. There is no relationship between data and data, and there is no need to maintain the relationship between them. A typical example is text messaging. What is reliable: It mainly means that the data will not be damaged or lost during the transmission process, ensuring that the data can arrive correctly. If the above guarantee is not made, it is unreliable. How to solve the connection-oriented problem: Using the three steps of establishing a connection, transmitting data, and disconnecting to create a long-term data transmission mechanism, the data transmission in the same connection is contextual. Therefore, the following concepts are derived:
How to solve reliability issues:Introduce a confirmation mechanism for data transmission, that is, wait for the other party to confirm after sending data. Therefore, it is necessary to maintain the confirmation field Acknowledgement and ack status. That is: stop waiting protocol. After the introduction of the data confirmation mechanism (stop-and-wait protocol), the problem of low bandwidth utilization was caused. How to solve it? The solution is to introduce a window confirmation mechanism and a sliding window, that is, instead of confirming after each packet is sent, confirm after sending multiple packets. After introducing the window, how to choose different window sizes on networks with different delays? The solution is to introduce window variables and window monitoring notifications: Sender Maintenance:
Receiver maintenance:
The receiver will reply to the sender with an ACK confirmation, which will contain the latest window length so that the sender can adjust the window length. Here we will introduce the sack selection confirmation behavior and the persistence timer behavior when the window is 0. After the sliding window is introduced, the bandwidth can be fully utilized, but the network environment is complex and congestion may occur at any time due to a large amount of data transmission. Therefore, a congestion control mechanism should be introduced: when congestion occurs, TCP should be able to ensure that the bandwidth is shared fairly by each TCP connection. Therefore, in the case of congestion, connections that occupy a large amount of bandwidth should be adjusted to occupy a smaller amount of bandwidth, and connections that occupy a small amount of bandwidth should be adjusted to occupy a larger amount of bandwidth. This is to achieve the purpose of fair resource occupation. Congestion control adjusts bandwidth usage by adjusting the size of the sliding window, so a new variable called cwnd (congestion window) needs to be introduced at the receiving end to reflect the current network transmission capacity, while the previous notification window can be expressed as awnd. At this time, the actual window available to the sender is the smaller of cwnd and awnd. There are many questions and concepts that arise from this, such as: How to determine the actual advertised window size? What is slow start? How does the congestion avoidance process work? How does congestion control work? And so on... The fundamental reason why TCP is complex is to solve these problems in engineering. Now that we have outlined the ideas, let's first look at what the three-way handshake is for. Why three times?Why three handshakes, not two, or four, or any other number? First we need to understand the purpose of establishing a connection. There are two purposes:
After confirming the second purpose, we can understand that the two-way handshake at least makes it impossible to determine whether the other party knows your starting sequence number. That is, suppose I am the server. The other party syn sends me the sequence number, and I also send my sequence number back to the other party, but what if the data packet I sent to the other party is lost? So I can't confirm whether the other party has received it, so I need the other party to confirm with me that he has indeed received it. It is not impossible to have four handshakes, but it is too long-winded, so three times is the most reasonable. We can't escape the common practice, so we still use this classic picture to look at the process of the three-way handshake. When I interview people, I often ask a rather stupid question here: If server A, after receiving the syn sent by client B and replying with syn+ack, receives an ack packet sent from another client C, will server A establish a subsequent ESTABLISHED connection with C at this time? If drawn as a picture, it looks like this: The reason why this question is relatively stupid is that most people think it won't work, but if you ask why, few people can really answer it. So why? In fact, it is also very simple. The ip+port of a new client is different. If it directly sends me an ack, it will directly reply to rst according to the TCP protocol, and naturally no connection will be created. This actually leads to a question. The kernel must be able to identify whether the ack request sent to me is the first time it has been sent to me, or if it has been sent syn before and I have replied syn+ack. The kernel will query through the four-tuple, and this query will be executed in tcp_v4_rcv(), which is the total population handled by TCP, and __inet_lookup() is called for search. The search is divided into two steps. First, check whether there is a connection in established, and then check whether there is a connection in linstener. If not, send_reset directly. After confirming that the connection exists, if it is in TCP_ESTABLISHED state, directly tcp_rcv_established() receives data, otherwise enter tcp_rcv_state_process() to process various TCP states. If it is the first handshake, it is TCP_LISTEN state, enter: acceptable = icsk->icsk_af_ops->conn_request(sk, skb) >= 0; At this time, conn_request is tcp_v4_conn_request(), and the first handshake is processed in this method. If it is the third handshake, the TCP status should be: TCP_SYN_RECV. When the server is in the SYN RECVED state, it needs to record the contents of the client's syn packet in the cache so that it can be searched during the packet receiving process, occupying part of the slab cache. This cache has an upper limit in the kernel, and /proc/sys/net/ipv4/tcp_max_syn_backlog is used to limit the number of caches. This value determines how many TCP_SYN_RECV state connections tcp can maintain at the same time under normal conditions, that is, the number of half-connections on the server. This value on a general server defaults to 1024-2048. This value is automatically generated by default based on your total memory size, and the value will be larger if the memory is large. What happens if the semi-connection queue is exhausted? We can still find the answer in the kernel. In tcp_conn_request(), we can see the following paragraph:
Here are some related concepts:
We will discuss the syncookie mechanism in detail later. Here, we only need to know this conclusion: when syncookie is enabled, the semi-connection queue can be considered to have no upper limit. From the definition of inet_csk_reqsk_queue_len, we can know that it checks the qlen in the request_sock_queue structure. The structure definition is as follows: Here comes a new concept: TFO - TCP Fast Open, which we will skip here and discuss later. The qlen in this structure will increase after the tcp_conn_request() function is executed:
It can be understood that qlen is the current length of the semi-connected queue of the server's listen port. So this paragraph can be understood as: When syncookie is not enabled, if the remaining length of the current half-connection pool is less than one-quarter of the maximum length, no new connection requests will be processed. This is the principle of the famous synflood attack: For a server without the syncookie function, any client can fill up the server's semi-connection pool by constructing an incomplete three-way handshake process, sending only SYN and not returning the ACK of the third handshake, causing the server to be unable to establish a new TCP connection with any client. Then we also know the original intention of the design of the syncookie function: to prevent synflood. How do syncookies prevent synflood?Since it is clear that synflood is an attack against the upper limit of the semi-connection pool, we need to find a way to bypass the semi-connection pool. Can the server not record the four-tuple information sent by the first syn, and verify it during the third handshake? In fact, it is possible: since the second handshake of the three-way handshake is the server's reply, why not put the information obtained from the first handshake in the reply packet, and let the client bring this information back during the third handshake, and then we can get the four-tuple information of the third handshake and the information recorded in it for verification? Of course, in order to keep the packet content as small as possible, we perform a hash operation on the information that needs to be recorded in the packet, and the new data obtained by the operation is called a cookie. The specific processing method is described as follows: Call the following code in tcp_conn_request() to generate a cookie:
The method that actually generates the cookie is: The hash value is calculated based on the packet's four-tuple information and the current time, and recorded in isn. The synack is sent using the tcp_v4_send_synack() function, which calls tcp_make_synack() to determine whether cookie_ts is set. If it is set, the tcp option information is initialized to the lower 6 bits of timestamp. In this way, the synack is sent back to the client, and the package contains the cookie information. When the client replies to the last ack, it adds seq+1, which means that when the server receives the last ack, it only needs to subtract 1 from the seq sequence number of the ack to get the previously sent cookie. Then the cookie is calculated again based on the four-tuple information of the package, and it is verified whether the calculated cookie is the same as the returned cookie. The specific method is in cookie_v4_check(), and you can search the code by yourself if you are interested. After such verification, the process that originally required memory resources to process was completely transformed into CPU operation. In this way, even if there is a synflood attack, the attack is no longer based on the memory limit, but will be converted into CPU operation, which will greatly weaken the effect of the attack. The syncookie function is enabled by default in the kernel. The switch is: /proc/sys/net/ipv4/tcp_syncookies The default value of this file is 1, which means that the syncookie function is turned on. It should be noted that in this scenario, only after the tcp_max_syn_backlog limit is exhausted, the newly created connection will use syncookie. Set it to 0 to turn off syncookie, and set it to 2 to ignore the tcp_max_syn_backlog semi-connection queue and use syncookie directly. listen backlogHere we need to explain the backlog parameter of the listen system call. As we all know, to put a port in the listening state, we need to call three system calls: socket, bind, and listen. The final entry of TCP into the LISTEN state is done by the listen system call. The second parameter backlog of this system call is explained in the man page as follows: From the description, this backlog seems to limit the TCP semi-connection queue, but if you look at the man page carefully and scroll down, you can see this content: This section really explains the true meaning of this backlog. In simple terms, the establishment of a TCP connection mainly includes 4 steps: (1) Create a socket. socket() (2) Bind the socket to a local address and port. bind() (3) Set the socket value to the listen state. listen() At this point, the client can create a connection with the socket of the corresponding port. Of course, if a connection is created here, it can only complete the three-way handshake. It cannot yet create a connection that the application can read and write. Finally, to create a truly usable connection, the fourth step is required: (4) The server accepts a new connection and creates a new fd returned by accept(). After that, the server can use this new accept fd to communicate with the client. The listen backlog here limits that if the server is in the LISTEN state and a client establishes a connection with me, but if the server does not accept the new connection in time, how big is the queue of requests that have not yet been accepted? There is another question here, what will happen when the connection queue exceeds the limit? We can write a simple server program to test this state. The server code is as follows: The code is very simple, socket, bind, listen and then pause directly. Let's take a look at the current status: At this time, for the LISTEN state connection displayed by the ss command, the Send-Q number means the length of the listen backlog. We use telnet as the client to connect to the current port 8888 and capture packets to see the TCP connection process and its changes. The three-way handshake is established without any problem. From the ss display, we can see that the Recv-Q in the LISTEN state is the number of connections currently queued in the backlog queue. Let's create a few more to see what happens if the limit is exceeded: When the number of connections exceeds 6 and a new connection is created, the new connection can no longer complete the three-way handshake. The client's syn does not receive a response and starts to retry. After 6 retries, the connection ends. The client reports an error: The number of retries when the first handshake syn does not receive a response is limited by this kernel parameter: /proc/sys/net/ipv4/tcp_syn_retries Set the maximum number of retries for the first handshake syn when no synack is received. The default value is 6. You can modify this value to change the number of retries. But the time rule cannot be changed. The interval time increases exponentially by 2, that is, the first retry is 1 second, the second is 2 seconds, then 4 seconds, 8 seconds and so on. So by default, tcp_syn_retries waits for a maximum of 63 seconds. There is also a file to specify the number of retries for the second handshake: /proc/sys/net/ipv4/tcp_synack_retries Set the maximum number of retries after the second handshake synack is sent, if the last ack is not received, the default value is 5. So tcp_synack_retries waits for up to 31 seconds. According to the above tests, we found that when the listen backlog queue is exhausted, the new connection cannot complete the three-way handshake, which is sometimes confused with the synflood attack because it has similar effects to synflood. In the tcp_conn_request() processing in the kernel, we can see how the kernel responds to synflood and listen backlog full respectively: When sk_acceptq_is_full(sk), it will be dropped directly, and the counter corresponding to LINUX_MIB_LISTENOVERFLOWS will be increased. By querying the corresponding relationship of the counters, we can know that LINUX_MIB_LISTENOVERFLOWS corresponds to: SNMP_MIB_ITEM("ListenOverflows", LINUX_MIB_LISTENOVERFLOWS) That is the ListenOverflows count in /proc/net/netstat. This count also corresponds to the listen queue of a socket overflowed displayed in netstat -s. From the code of sk_acceptq_is_full, we can see why when the listen backlog is set to 5, the number of connections must exceed 5+1 before the connection times out, because the current number of connections must be greater than the maximum limit: From the man page of listen, we can know that the kernel limit file for listen backlog is: /proc/sys/net/core/somaxconn For a listen socket, the currently effective configuration is the minimum value of somaxconn and backlog. In general, this value does not need to be optimized. Can we imagine when our application will not have time to accept when the connection is established? In most cases, when the load pressure of your system is so high that it is too late to handle the accept of the new connection, in this case, it is more important to expand the capacity rather than increase the queue. In this case, sometimes we should even reduce the queue and reduce the number of syn retries of the client so that the client can fail more quickly and prevent the accumulation of too many connections from causing an avalanche. Of course, some software with poor concurrent processing architecture design will also exhaust the queue when there is no high load pressure. At this time, the main thing to adjust is the software architecture or other settings. TFO - TCP Fast OpenFrom the above code, we can know that the current Linux TCP protocol stack supports TFO. TFO, the Chinese name is TCP Fast Open. As the name suggests, its main purpose is to simplify the three-way handshake process and make TCP open faster on a large latency network. So how does TFO work? We can first observe the behavior of TFO from a practical example. We start the web service on one server, and use curl on another server to access port 80 of the web server, then access its / page and exit. The captured packet content is as follows: This is a complete three-way handshake and four-way handshake process, as well as an HTTP data transmission process. Of course, we have not yet enabled TFO, so we still use a diagram to illustrate the connection process: We were lucky to observe a three-wave handshake to close the connection during this connection process, but that is not our topic today. The rest of the connection process is basically standard TCP behavior. Then we turn on TFO to see if there is any change. Our web server uses nginx and the client uses curl. Both software support TFO by default. First, enable TFO support in the kernel: This file is the switch of TFO, 0: off, 1: open client support, which is also the default value, 2: open server support, 3: open client and server. Generally, when necessary, we set it to 1 on the client server and 2 on the server server. For convenience, it is also OK to set both to 3. Then configure the server nginx to open TFO:
Find the listen setting in the nginx configuration file and add a fastopen parameter followed by a value as shown above. Then restart the server. The client is relatively simple, just use the --tcp-fastopen parameter of curl. We execute this command on the client: At the same time, capture the packet on the server and take a look: The first request to capture the packet: The second request to catch the packet: We found that after the TFO is enabled, the overall interaction method of tcp is basically the same as before before it is enabled. Syn in the first handshake has more cookiereq field. The second handshake server responded to the cookie 336fe8751f5cca4b field. This is the main interaction difference between the first tcp connection requested by the same client after the TFO is enabled: The client's syn package will have an empty cookie field. If the server also supports TFO, after seeing this empty cookie field, it will calculate a TFO cookie and reply to the client. This cookie is used when the client establishes a TCP connection with the server next time. The syn package with cookies means that the packet also carries application layer data. In this way, the subsequent TCP three-handshake process can not only be a handshake, but also carry http protocol data. The two interactions are shown below: Let's take a look at other kernel parameters related to TFO: /proc/sys/net/ipv4/tcp_fastopen Through the above content, we already know that the meaning of the value of this file is 1, 2, and 3. In addition to these values, we can also set it to:
It should also be added here that in general, in addition to turning on the relevant switches in the kernel, the application needs to support TFO and make related adjustments. For the client, sendmsg() or sendto() need to be used to send data, and the MSG_FASTOPEN tag needs to be added to the flag parameter. For the server, you need to use setsockopt to set the TCP_FASTOPEN option after the socket is opened to turn on TFO support. /proc/sys/net/ipv4/tcp_fastopen_key : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : /proc/sys/net/ipv4/tcp_fastopen_blackhole_timeout_sec Because TFO modified the normal TCP three-time handshake process, during the first syn packet passing through the network to the server, it is possible that some routers or firewall rules will disable this special syn as an abnormal traffic. In this context, we call this phenomenon TFO firewall blackhole. The default mechanism is that if a firewall blackhole is detected, the TFO function will be temporarily turned off. This file is used to set the shutdown time period. The default value is: 3600, which means that if a black hole is detected, the TFO will be turned off within 3600 seconds. And after each shutdown period ends, if a black hole is found again, the next shutdown period time will increase exponentially. A value of 0 means that the black hole is detected. In addition, what does fastopen=128 mean in nginx configuration? In fact, it is to limit the maximum length limit of the three-handshake connection for this port after opening fastopen. This limit can be opened larger. For details, please refer to nginx configuration description: nginx.org/en/docs/ht... . The kernel code implementation of TFO will not be described in detail here. You can find the TFO-related processing process in the code described above. If you have a basic understanding, you can study it yourself. Whether the server opens TFO is still a long time ago. In a complex network environment, the performance of TFO seems to be a little far from everyone's ideals. Of course, it is not that TFO is not good, but that the network often rejects the relevant packets, resulting in the fact that the TFO does not take effect. For more detailed descriptions of TFO, you can also refer to RFC7413: tools.ietf.org/html/... Other kernel parametersDuring the three handshakes, there are several retry settings: /proc/sys/net/ipv4/tcp_synack_retries Set the maximum number of retries after the second handshake synack is sent, if the last ack is not received, the default value is 5. There is only the number of retry times set here, and there is no retry interval time. The interval time increases exponentially by 2, that is, the first retry is 1 second, the second retry is 2 seconds, then 4 seconds, 8 seconds, and so on. So by default tcp_syn_retries waits up to 63 seconds, and tcp_synack_retries waits up to 31 seconds. |
<<: Ethernet spending suffers decline despite surging bandwidth demand
Background of MUX VLAN MUX VLAN (Multiplex VLAN) ...
On April 12, the China Academy of Information and...
Hello everyone, I am Bernie, an IT pre-sales engi...
Maxthon Hosting still offers a 20% discount code ...
[[350564]] 1China has the largest 5G user group i...
It has been a long time since I shared informatio...
As the year draws to a close, the smartphone indu...
[[426614]] After understanding the essence, this ...
This month, CMIVPS is offering a limited special ...
HostXen sent an event during this year's 618....
introduction "All martial arts come from Sha...
WiFi has been expanding its deployment and applic...
This article covers: Bluetooth, WiFi, BLE, Zigbee...
Recently, Aruba, a subsidiary of Hewlett Packard ...
Nowadays, smartphones and the Internet have broug...