Preface The previous article "Whether it is hard or not, you decide! Nearly 40 illustrations to explain the TCP three-way handshake and four-way wave interview questions that are asked thousands of times" has been recognized by many readers. I would like to thank you for your recognition. It makes everyone feel warm.
Here I come! Today I am here to illustrate TCP again. Xiaolin may be late, but he will not be absent. The main reason for the delay is that TCP is extremely complex. In order to ensure reliability, it uses a huge number of mechanisms to ensure it. It is really a "great" protocol, but as I was writing it, I found that it was too complicated. . . All the pictures in this article were drawn by Xiaolin. It was very hard and tiring. Without further ado, let’s get straight to the main text. Go! text I believe everyone knows that TCP is a reliable transmission protocol, so how does it ensure reliability? In order to achieve reliable transmission, many things need to be considered, such as data corruption, packet loss, duplication, and fragment order disorder. If these problems cannot be solved, there is no way to talk about reliable transmission. Then, TCP achieves reliable transmission through mechanisms such as sequence numbers, confirmation responses, retransmission control, connection management, and window control. Today, we will focus on TCP's retransmission mechanism, sliding window, flow control, and congestion control. outline 1. Retransmission Mechanism One of the ways TCP achieves reliable transmission is through sequence numbers and confirmation responses. In TCP, when the data from the sender reaches the receiving host, the receiving host returns a confirmation message to indicate that the message has been received. Normal data transmission But in a complex network, data transmission may not be as smooth as shown above. What if the data is lost during the transmission? Therefore, TCP uses a retransmission mechanism to solve the problem of packet loss. Next, let’s talk about the common retransmission mechanism:
1. Timeout retransmission One of the retransmission mechanisms is to set a timer when sending data. If the ACK confirmation message from the other party is not received after the specified time, the data will be resent, which is what we often call timeout retransmission. TCP will timeout and retransmit in the following two situations:
Two cases of timeout retransmission (1) What should the timeout be set to? Let's first understand what RTT (Round-Trip Time) is. From the following figure we can know: RTT RTT is the time required for data to be transmitted from one end of the network to the other, that is, the round-trip time of the packet. The retransmission timeout is expressed as RTO (Retransmission Timeout). What will happen if the timeout RTO is "longer or shorter" in the case of retransmission? Longer vs. shorter timeouts There are two situations with different timeout periods in the figure above:
It is very important to accurately measure the value of the timeout RTO, which can make our retransmission mechanism more efficient. Based on the above two situations, we can know that the value of the timeout retransmission time RTO should be slightly larger than the value of the round-trip RTT of the message. RTO should be slightly larger than RTT At this point, you may think that the calculation of the retransmission timeout period RTO value is not very complicated. It seems that when the sender sends a packet, t0 is recorded, and when the receiver receives the ack, t1 is recorded, so RTT = t1 – t0. It is not that simple. This is just a sample and cannot represent the general situation. In fact, the "round-trip RTT value of the message" often changes, because our network also changes frequently. Because the "round-trip RTT value of the message" often fluctuates, the "timeout retransmission time RTO value" should be a dynamically changing value. Let's take a look at how Linux calculates RTO. To estimate the round trip time, you usually need to sample the following two:
RFC6289 recommends using the following formula to calculate RTO: RTO calculation recommended by RFC6289 Where SRTT is the smoothed RTT, and DevRTR is the difference between the smoothed RTT and the latest RTT. Under Linux, α = 0.125, β = 0.25, μ = 1, ∂ = 4. Don't ask how they came about; they were simply obtained through extensive experiments. If the data that has been retransmitted times out again and needs to be retransmitted, TCP's strategy is to double the timeout interval. That is, every time a retransmission timeout occurs, the next timeout interval will be set to twice the previous value. Two timeouts indicate a poor network environment and frequent retransmissions are not recommended. The problem with timeout-triggered retransmission is that the timeout period may be relatively long. Is there a faster way? Therefore, the "fast retransmission" mechanism can be used to solve the waiting time for timeout retransmission. 2. Fast retransmit TCP also has another fast retransmit mechanism, which is not time-driven but data-driven retransmission. How does the fast retransmit mechanism work? It's actually very simple, a picture is worth a thousand words. Fast retransmission mechanism In the figure above, the sender sent 1, 2, 3, 4, and 5 copies of data:
Therefore, the way fast retransmit works is that when three identical ACK messages are received, the lost segments will be retransmitted before the timer expires. The fast retransmit mechanism only solves one problem, which is the timeout problem, but it still faces another problem: when retransmitting, should it retransmit the previous one or retransmit all the packets? For example, for the above example, should Seq2 be retransmitted? Or should Seq2, Seq3, Seq4, and Seq5 be retransmitted? This is because the sender does not know who sent back the three consecutive Ack 2s. Depending on the TCP implementation, both of the above situations are possible. It can be seen that this is a double-edged sword. In order to solve the problem of not knowing which TCP packets to retransmit, the SACK method was developed. 3. SACK method There is another way to implement the retransmission mechanism called SACK (Selective Acknowledgment). This method requires adding a SACK in the "Option" field of the TCP header, which can send the cached map to the sender so that the sender can know which data has been received and which data has not been received. Knowing this information, only the lost data can be retransmitted. As shown in the figure below, the sender receives the same ACK confirmation message three times, which triggers the fast retransmission mechanism. Through the SACK information, it is found that only the data segment 200~299 is lost. When retransmitting, only this TCP segment is selected for repetition. Selective Confirmation If SACK is to be supported, both parties must support it. In Linux, this feature can be enabled via the net.ipv4.tcp_sack parameter (enabled by default after Linux 2.4). 4. Duplicate SACK Duplicate SACK, also known as D-SACK, mainly uses SACK to tell the "sender" which data has been received repeatedly. The following two examples are used to illustrate the role of D-SACK. Example 1: ACK packet loss ACK packet loss
Example 2: Network Delay Network delay
It can be seen that D-SACK has the following advantages:
In Linux, this feature can be enabled/disabled via the net.ipv4.tcp_dsack parameter (enabled by default since Linux 2.4). 2. Sliding Window 1. Reasons for introducing the window concept We all know that TCP requires a confirmation response for each data packet sent. When the previous data packet receives the response, the next one is sent. This mode is a bit like a face-to-face chat between you and me, you talk a few words and I talk a few words. But the disadvantage of this method is that it is relatively inefficient. If you finish a sentence and I am dealing with other things and don't reply to you in time, then you have to wait for me to finish other things and reply to you before you can say the next sentence. Obviously, this is unrealistic. Confirmation response by data packet Therefore, this transmission method has a disadvantage: the longer the round-trip time of the data packet, the lower the efficiency of communication. To solve this problem, TCP introduced the concept of window, which will not reduce the efficiency of network communication even when the round-trip time is long. Then, with a window, you can specify the window size. The window size refers to the maximum value of data that can be sent without waiting for a confirmation response. The window is actually a buffer space opened by the operating system. The sending host must keep the sent data in the buffer before waiting for the confirmation reply to return. If the confirmation reply is received on time, the data can be cleared from the buffer. Assuming the window size is 3 TCP segments, the sender can "continuously send" 3 TCP segments, and if ACK is lost in the middle, it can be confirmed by "the next confirmation response". As shown in the following figure: Parallel processing using sliding windows It doesn't matter if the ACK 600 message is lost, because the next ACK can be used for confirmation. As long as the sender receives ACK 700, it means that the receiver has received all the data before 700. This mode is called cumulative confirmation or cumulative response. (1) Who determines the window size? There is a field in the TCP header called Window, which is the window size. This field is used by the receiver to tell the sender how much buffer space it has available to receive data. The sender can then send data based on the receiver's processing capability without causing the receiver to be unable to process the data. Therefore, usually the window size is determined by the receiver. The data size sent by the sender cannot exceed the window size of the receiver, otherwise the receiver will not be able to receive the data normally. (2) Sender’s sliding window Let's first look at the sender's window. The following figure shows the data cached by the sender. It is divided into four parts according to the processing situation. The dark blue box is the send window, and the purple box is the available window:
In the figure below, when the sender sends "all" the data at once, the size of the available window becomes 0, indicating that the available window is exhausted and no more data can be sent before receiving ACK confirmation. Available window exhausted In the figure below, after receiving the ACK confirmation response for the previously sent data 32~36 bytes, if the size of the sending window has not changed, the sliding window moves 5 bytes to the right, because 5 bytes of data have been acknowledged, and then 52~56 bytes become the available window again, so the 5 bytes of data 52~56 can be sent subsequently. 32 ~ 36 bytes confirmed (3) How does the program represent the four parts of the sender? The TCP sliding window scheme uses three pointers to track bytes in each of the four transmission categories. Two of the pointers are absolute pointers (referring to a specific sequence number) and one is a relative pointer (requires an offset). SND.WND, SND.UN, SND.NXT
Then the calculation of the available window size can be: Available window size = SND.WND - (SND.NXT - SND.UNA) (4) Receiver’s Sliding Window Next, let's look at the receiver's window. The receiver window is relatively simple and is divided into three parts based on the processing situation:
Receive Window The three receiving parts are divided using two pointers:
(5) Are the sizes of the receive window and the send window equal? They are not completely equal. The size of the receive window is approximately equal to the size of the send window. Because the sliding window is not static. For example, when the receiving application reads data very quickly, the receiving window can be vacant very quickly. Then the new receiving window size is told to the sender through the Windows field in the TCP message. There is a delay in the transmission process, so the receiving window and the sending window are approximately equal. 3. Flow Control The sender cannot send data to the receiver mindlessly; the receiver's processing capabilities must be considered. If you keep sending data to the other party without thinking, but the other party cannot process it, it will trigger the retransmission mechanism, resulting in unnecessary waste of network traffic. To solve this problem, TCP provides a mechanism that allows the "sender" to control the amount of data sent according to the actual receiving capacity of the "receiver". This is called flow control. For simplicity, let's assume the following scenario:
Flow Control According to the flow control in the figure above, each process is described below:
1. Relationship between operating system buffer and sliding window In the previous flow control example, we assumed that the send window and the receive window are unchanged, but in fact, the number of bytes stored in the send window and the receive window are placed in the operating system memory buffer, and the operating system buffer will be adjusted by the operating system. When the application process cannot read the contents of the buffer in time, it will also affect our buffer. (1) How does worrying about the system buffer affect the send window and receive window? Let’s take a look at the first example. Changes in the send window and receive window when the application does not read the cache in time. Consider the following scenario:
According to the flow control in the figure above, each process is described below:
It can be seen that the final window is shrunk to 0, that is, the window is closed. When the sender's available window becomes 0, the sender will actually send a window detection message at regular intervals to know whether the receiver's window has changed. This content will be discussed later, so I will briefly mention it here. Let’s look at the second example first. When the server system resources are very tight, the worrying system may directly reduce the size of the receiving buffer. At this time, the application cannot read the cached data in time. Then something serious will happen and data packets will be lost. Describe each process:
Therefore, if the cache is reduced first and then the window is shrunk, packet loss will occur. To prevent this from happening, TCP regulations do not allow reducing the cache and shrinking the window at the same time. Instead, the window is shrunk first, and then the cache is reduced after a while, thus avoiding packet loss. 2. Window closed As we have seen before, TCP performs flow control by allowing the receiver to indicate the amount of data (window size) it wants to receive from the sender. If the window size is 0, the sender will be prevented from passing data to the receiver until the window becomes non-zero, which means the window is closed. (1) Potential dangers of window closing When the receiver notifies the sender of the window size, it does so through an ACK message. Then, when the window is closed, after processing the data, the receiver will notify the sender of an ACK message with a window value other than 0. If the ACK message notifying the window is lost in the network, it will be a big trouble. Window closing potential danger This will cause the sender to wait for the receiver's non-zero window notification, and the receiver will also wait for the sender's data. If no measures are taken, this mutual waiting process will cause a deadlock. (2) How does TCP solve the potential deadlock problem when the window is closed? To solve this problem, TCP sets a persistence timer for each connection. As long as one party of the TCP connection receives a zero window notification from the other party, the persistence timer is started. If the persistence timer times out, a window probe message will be sent, and when the other party confirms the probe message, it will give its current receive window size. Window detection
The number of window probes is generally 3 times, each time about 30-60 seconds (different implementations may be different). If the receive window is still 0 after 3 times, some TCP implementations will send a RST message to terminate the connection. 3. Confused Window Syndrome If the receiver is too busy and does not have time to remove the data in the receiving window, the sender's sending window will become smaller and smaller. In the end, if the receiver frees up a few bytes and tells the sender that there is now a window of a few bytes, the sender will send these bytes without hesitation. This is the silly window syndrome. You know, our TCP + IP header has 40 bytes. It is not economical to incur such a large overhead just to transmit those few bytes of data. It's like a bus that can carry 50 people. Every time one or two people come, it will just leave. Only bus drivers who have a mine at home dare to do this, otherwise they will go bankrupt sooner or later. It's not difficult to solve this problem. The bus driver will wait until the number of passengers exceeds 25 before he decides that the bus can leave. As an example of confused window syndrome, consider the following scenario: The receiver's window size is 360 bytes, but the receiver is stuck for some reason. Assume that the receiver's application layer has the following reading capabilities: For every 3 bytes received by the receiver, the application can only read 1 byte of data from the buffer;
Confused Window Syndrome The changes in the window size of each process are clearly described in the figure. It can be found that the window is constantly decreasing and the data sent is relatively small. So, the phenomenon of silly window syndrome can occur on both the sender and the receiver:
So, to solve the confused window syndrome, just solve the above two problems.
(1) How can I prevent the recipient from notifying the small window? The usual strategy for the receiver is as follows: When the "window size" is less than min(MSS, cache space/2), that is, less than the minimum value of MSS and 1/2 cache size, the sender will be notified that the window is 0, which prevents the sender from sending any more data. After the receiver has processed some data, the window size >= MSS, or half of the receiver's buffer space is available, the window can be opened to allow the sender to send data. (2) How can the sender avoid sending small data? The sender's usual strategy: The Nagle algorithm is used. The idea of the algorithm is delay processing. Data can be sent only when one of the following two conditions is met:
As long as one of the above conditions is not met, the sender will continue to store data until the above sending condition is met. In addition, the Nagle algorithm is turned on by default. If you are using programs that require small data packets to interact, such as telnet or ssh, which are highly interactive programs, you need to turn off the Nagle algorithm. You can set the TCP_NODELAY option in the Socket to disable this algorithm (there is no global parameter for disabling the Nagle algorithm, and it needs to be disabled according to the characteristics of each application)
4. Congestion Control 1. Why do we need congestion control? Isn't there flow control? The previous flow control is to prevent the "sender"'s data from filling up the "receiver"'s cache, but it does not know what happened in the network. Generally speaking, computer networks are in a shared environment, so it is possible that the network will be congested due to communications between other hosts. When the network is congested, if a large number of data packets continue to be sent, it may cause data packet delays and loss. At this time, TCP will retransmit the data, but retransmission will cause a heavier burden on the network, which will lead to greater delays and more packet loss. This situation will enter a vicious cycle and be continuously amplified... Therefore, TCP cannot ignore what happens on the network. It is designed as a selfless protocol. When the network is congested, TCP will sacrifice itself and reduce the amount of data sent. So, there is congestion control, the purpose of which is to prevent the "sender"'s data from filling up the entire network. In order to regulate the amount of data to be sent on the "sender side", a concept called "congestion window" is defined. 2. What is the congestion window? How does it relate to the send window? The congestion window cwnd is a state variable maintained by the sender, which changes dynamically according to the degree of network congestion. We mentioned earlier that the sending window swnd and the receiving window rwnd are approximately equal. Now, after the concept of congestion window is introduced, the value of the sending window is swnd = min(cwnd, rwnd), which is the minimum value of the congestion window and the receiving window. The rules for changing the congestion window cwnd are as follows:
3. So how do you know if the current network is congested? In fact, as long as the "sender" does not receive the ACK response message within the specified time, that is, a timeout retransmission occurs, it will be considered that the network is congested. 4. What are the control algorithms for congestion control? Congestion control mainly consists of four algorithms:
(1) Slow start After TCP has just established a connection, it first has a slow start process. This slow start means increasing the number of data packets sent little by little. If a large amount of data is sent right away, wouldn't this cause congestion to the network? The slow start algorithm only requires one rule to be remembered: every time the sender receives an ACK, the size of the congestion window cwnd increases by 1. Here it is assumed that the congestion window cwnd and the send window swnd are equal. Here is an example:
Slow start algorithm It can be seen that with the slow start algorithm, the number of packets sent increases exponentially. So when will the slow start increase end? There is a state variable called ssthresh (slow start threshold).
(2) Congestion Avoidance Algorithm As mentioned earlier, when the congestion window cwnd "exceeds" the slow start threshold ssthresh, the congestion avoidance algorithm will be entered. Generally speaking, the size of ssthresh is 65535 bytes. Then after entering the congestion avoidance algorithm, its rule is: every time an ACK is received, cwnd increases by 1/cwnd. Continuing with the previous slow start example, let's assume that ssthresh is 8: When 8 ACK acknowledgements arrive, each confirmation increases by 1/8, and the cwnd of 8 ACK confirmations increases by 1 in total. Therefore, 9 MSS-sized data can be sent this time, which becomes a linear growth. Congestion Avoidance Therefore, we can find that the congestion avoidance algorithm changes the exponential growth of the original slow start algorithm into linear growth. It is still in the growth stage, but the growth rate is slower. As the traffic keeps growing, the network will gradually become congested, causing packet loss. At this time, the lost data packets will need to be retransmitted. When the retransmission mechanism is triggered, the "congestion occurrence algorithm" is entered. (3) Congestion occurs When the network is congested, data packets will be retransmitted. There are two main retransmission mechanisms:
The congestion sending algorithms used by these two methods are different, and we will discuss them separately below. a. Congestion occurrence algorithm for timeout retransmission When a "timeout retransmission" occurs, the congestion occurrence algorithm will be used. At this time, the values of sshresh and cwnd will change:
Congested sending - timeout retransmission Then, we restart the slow start, which will suddenly reduce the data flow. This is really like returning to the pre-liberation era once the "timeout retransmission" occurs. However, this method is too radical and the reaction is too strong, which will cause network lag. It's like drifting at high speed on Akina Mountain, and suddenly you have to brake urgently. Can the tires withstand it? . . . b. Congestion occurrence algorithm with fast retransmit There is a better way. We have talked about the "fast retransmit algorithm" before. When the receiver finds that an intermediate packet is lost, it sends the ACK of the previous packet three times, so the sender will retransmit quickly without waiting for a timeout. TCP considers this situation not serious because most of the packets are not lost, only a small part is lost. Then the changes of ssthresh and cwnd are as follows:
(4) Quick recovery Fast retransmit and fast recovery algorithms are usually used at the same time. The fast recovery algorithm believes that if you can still receive 3 duplicate ACKs, it means that the network is not that bad, so there is no need to be as strong as RTO timeout. As mentioned before, before entering fast recovery, cwnd and ssthresh have been updated:
Then, enter the fast recovery algorithm as follows:
Fast retransmit and fast recovery In other words, the situation did not return to the pre-liberation era overnight like "timeout retransmission", but it remained at a relatively high value and subsequently increased linearly. |
<<: Driven by the new infrastructure, will data center construction be "rushed"?
>>: Why is 5G considered the criterion for the Internet of Things era?
The underlying resource management platform of di...
[[433836]] Hello everyone, I am a Java expert. Pr...
At present, the domestic network operation and ma...
Thanks to Netty's excellent design and encaps...
On January 10, Miao Wei, Minister of Industry and...
The tribe has shared G-core product information s...
1. Introduction Sudden failures in data center op...
【51CTO.com Quick Translation】 "I didn't ...
In November 2023, the "China Enterprise &quo...
5G is one of the hottest topics at the moment, an...
On the 1st of this month, I shared PacificRack...
As we all know, my country's operators have b...
XSX.net recently launched a 50% discount promotio...
According to foreign media reports, on July 9 loc...