TCP Things 1: TCP Protocol, Algorithm and Principle

TCP Things 1: TCP Protocol, Algorithm and Principle

TCP is a very complex protocol because it has to solve many problems, and these problems bring out many sub-problems and dark sides. So learning TCP itself is a painful process, but the learning process can make people gain a lot. For the details of the TCP protocol, I still recommend you to read W. Richard Stevens's "TCP/IP Detailed Explanation Volume 1: Protocol" (of course, you can also read RFC793 and many RFCs later). In addition, I will use English terms in this article, so that you can find relevant technical documents through these English keywords.

I want to write this article for three reasons:

  • One is to exercise my ability to describe such a complex TCP protocol clearly in a simple article.
  • Another reason is that many programmers today don't read this book seriously, and prefer fast food culture. So, I hope this fast food article can help you understand TCP, a classic technology, and appreciate the difficulties in software design. And you can gain some software design knowledge from it.
  • The most important thing is that I hope these basics can help you clarify many things that seemed plausible before, and you can realize the importance of the basics.

Therefore, this article will not cover everything, but will only provide a general introduction to the TCP protocol, algorithms, and principles.

Without further ado, first of all, we need to know that TCP is at the fourth layer - the Transport layer - in the seven-layer model of the network OSI, IP is at the third layer - the Network layer, and ARP is at the second layer - the Data Link layer. The data on the second layer is called Frame, the data on the third layer is called Packet, and the data on the fourth layer is called Segment.

First of all, we need to know that the data of our program will first be typed into the TCP Segment, then the TCP Segment will be typed into the IP Packet, and then into the Ethernet Frame. After being transmitted to the other end, each layer parses its own protocol and then hands the data over to the higher-level protocol for processing.

TCP header format

Next, let's take a look at the format of the TCP header:

TCP header format (Image source)

You need to pay attention to the following points:

  • TCP packets do not have IP addresses, that is a matter of the IP layer. But they have source ports and destination ports.
  • A TCP connection requires four tuples to indicate that it is the same connection (src_ip, src_port, dst_ip, dst_port) or five tuples to be exact, plus one for the protocol. But since we are only talking about the TCP protocol here, I will only talk about the four tuples.
  • Note four very important things in the picture above:
  • Sequence Number is the sequence number of the packet, which is used to solve the problem of network packet reordering.
  • Acknowledgement Number is ACK - used to confirm receipt and to solve the problem of packet loss.
  • Window is also called Advertised-Window, which is the famous sliding window, used to solve flow control.
  • TCP Flag, which is the type of packet, is mainly used to manipulate the TCP state machine.

For other things, please refer to the following diagram:

(Image source)

TCP state machine

In fact, transmission on the Internet is connectionless, including TCP. The so-called "connection" of TCP is actually just maintaining a "connection state" between the two communicating parties, making it look like there is a connection. Therefore, the state change of TCP is very important.

Below is a comparison chart of "TCP protocol state machine" (picture source) and "TCP connection establishment", "TCP connection disconnection", and "data transmission". I put the two pictures side by side so that you can compare them. In addition, the following two pictures are very, very important, you must remember them. (Complaint: seeing such a complex state machine, you know how complex this protocol is. Complex things always have many tricky things, so the TCP protocol is actually quite tricky).

Many people will ask, why does it take 3 handshakes to establish a connection and 4 handshakes to disconnect a connection?

  • For the three-way handshake to establish a link, the main thing is to initialize the initial value of the Sequence Number. The two communicating parties must notify each other of their initial Sequence Number (abbreviated as ISN: Initial Sequence Number) - so it is called SYN, the full name is Synchronize Sequence Numbers. It is x and y in the above figure. This number will be used as the sequence number for subsequent data communications to ensure that the data received by the application layer will not be out of order due to transmission problems on the network (TCP will use this sequence number to splice data).
  • If you look closely, there are actually 2 waves, because TCP is full-duplex, so both the sender and the receiver need Fin and Ack. However, one party is passive, so it looks like 4 waves. If both sides disconnect at the same time, they will enter the CLOSING state and then reach the TIME_WAIT state. The following figure is a schematic diagram of both sides disconnecting at the same time (you can also refer to the TCP state machine):

Both ends disconnected at the same time (Image source)

In addition, there are a few things to note:

  • About SYN timeout when establishing a connection. Imagine that if the server receives the SYN sent by the client and sends back a SYN-ACK, but the client is offline, and the server does not receive the ACK sent back by the client, then the connection is in an intermediate state, that is, neither successful nor failed. Therefore, if the server does not receive the SYN-ACK within a certain period of time, TCP will resend the SYN-ACK. Under Linux, the default number of retries is 5, and the retry interval starts from 1s and is reversed each time. The 5 retry intervals are 1s, 2s, 4s, 8s, and 16s, a total of 31s. After the fifth retry, it takes another 32s to know that the fifth time has also timed out. Therefore, it takes a total of 1s + 2s + 4s + 8s + 16s + 32s = 2^6 -1 = 63s for TCP to disconnect the connection.
  • About SYN Flood. Some malicious people created SYN Flood for this purpose - after sending a SYN to the server, it went offline, so the server needs to wait 63 seconds by default before disconnecting, so that the hacker can exhaust the server's syn connection queue and make normal connection requests unable to be processed. Therefore, Linux provides a parameter called tcp_syncookies to deal with this matter - when the SYN queue is full, TCP will create a special Sequence Number through the source address port, destination address port and timestamp and send it back (also called cookie). If it is a hacker, there will be no response. If it is a normal connection, this SYN Cookie will be sent back, and then the server can establish a connection through the cookie (even if you are not in the SYN queue). Please note, please do not use tcp_syncookies to handle normal heavy-load connections. Because synccookies is a compromised version of the TCP protocol and is not rigorous. For normal requests, you should adjust three TCP parameters for your choice. One is: tcp_synack_retries, which can be used to reduce the number of retries; the second is: tcp_max_syn_backlog, which can increase the number of SYN connections; the third is: tcp_abort_on_overflow, which can simply reject the connection if it cannot be handled.
  • Regarding the initialization of ISN. ISN cannot be hard coded, otherwise there will be problems - for example: if 1 is always used as ISN after the connection is established, if the client sends 30 segments, but the network is disconnected, the client reconnects and uses 1 as ISN again, but the packets from the previous connection arrive, so they are regarded as packets of the new connection. At this time, the client's sequence number may be 3, and the server thinks that the number on the client side is 30. It's all messed up. RFC793 says that ISN will be tied to a fake clock, which will increase the ISN by one every 4 microseconds until it exceeds 2^32 and starts from 0 again. In this way, the cycle of an ISN is about 4.55 hours. Because we assume that the survival time of our TCP Segment on the network will not exceed the Maximum Segment Lifetime (abbreviated as MSL - Wikipedia entry), so as long as the value of MSL is less than 4.55 hours, then we will not reuse the ISN.
  • About MSL and TIME_WAIT. Through the description of ISN above, I believe you also know how MSL comes. We noticed that in the TCP state diagram, there is a timeout setting from the TIME_WAIT state to the CLOSED state. This timeout setting is 2*MSL (RFC793 defines MSL as 2 minutes, and Linux sets it to 30s). Why is there a TIME_WAIT? Why not just switch to the CLOSED state? There are two main reasons: 1) TIME_WAIT ensures that there is enough time for the other end to receive ACK. If the passive closing party does not receive Ack, it will trigger the passive end to resend Fin, which is exactly 2 MSL. 2) There is enough time to prevent this connection from being mixed with the subsequent connections (you must know that some self-willed routers will cache IP packets. If the connection is reused, these delayed packets may be mixed with the new connection). You can read this article "TIME_WAIT and its design implications for protocols and scalable client server systems".
  • About too many TIME_WAIT. From the above description, we know that TIME_WAIT is a very important state, but if there are short connections with high concurrency, there will be too many TIME_WAIT, which will also consume a lot of system resources. Just search and you will find that most of the solutions are to teach you to set two parameters, one is called tcp_tw_reuse, and the other is called tcp_tw_recycle. The default values ​​of these two parameters are closed. The latter recyle is more aggressive than the former resue, and resue is gentler. In addition, if you use tcp_tw_reuse, you must set tcp_timestamps=1, otherwise it will be invalid. Here, you must pay attention that turning on these two parameters will have a big pitfall - it may cause some strange problems with TCP connections (because as mentioned above, if you do not wait for the timeout to reuse the connection, the new connection may not be established. As the official document says, "It should not be changed without advice/request of technical experts").
  • Regarding tcp_tw_reuse. The official documentation says that tcp_tw_reuse plus tcp_timestamps (also known as PAWS, for Protection Against Wrapped Sequence Numbers) can ensure security from a protocol perspective, but you need tcp_timestamps to be enabled on both sides (you can read the source code of tcp_twsk_unique). I personally think there will still be problems in some scenarios.
  • About tcp_tw_recycle. If tcp_tw_recycle is turned on, it will be assumed that the other end has turned on tcp_timestamps, and then the timestamp will be compared. If the timestamp becomes larger, it can be reused. However, if the other end is a NAT network (for example: a company only uses one IP to access the public network) or the other end's IP is reused by another machine, this matter becomes complicated. The SYN for establishing the link may be directly discarded (you may see the error of connection time out) (If you want to observe the Linux kernel code, please refer to the source code tcp_timewait_state_process).
  • About tcp_max_tw_buckets. This controls the number of concurrent TIME_WAITs. The default value is 180000. If the limit is exceeded, the system will destroy the excess and issue a warning in the log (such as: time wait bucket table overflow). The official website document says that this parameter is used to combat DDoS. It also says that the default value of 180000 is not small. This still needs to be considered according to the actual situation.

Again, using tcp_tw_reuse and tcp_tw_recycle to solve the TIME_WAIT problem is very, very dangerous, because these two parameters violate the TCP protocol (RFC 1122).

In fact, TIME_WAIT means that you actively disconnected, so this is the so-called "nothing will happen if you don't commit suicide". Imagine that if you let the other end disconnect, then the problem will be on the other end, haha. In addition, if your server is an HTTP server, how important is it to set an HTTP KeepAlive (the browser will reuse a TCP connection to handle multiple HTTP requests), and then let the client disconnect (you have to be careful, browsers may be very greedy, they will not actively disconnect unless they have no choice).

Sequence Number in Data Transmission

The following is a screenshot I took from Wireshark of data transmission when I visited coolshell.cn to show you how SeqNum changes. (Use Statistics -> Flow Graph... in the Wireshark menu)

You can see that the increase of SeqNum is related to the number of bytes transmitted. In the figure above, after the three-way handshake, two packets with Len: 1440 are received, and the SeqNum of the second packet becomes 1441. Then the previous ACK is 1441, indicating that a 1440 has been received.

Note: If you use the Wireshark packet capture program to watch the three-way handshake, you will find that SeqNum is always 0. This is not the case. In order to make the display more friendly, Wireshark uses Relative SeqNum - relative sequence number. You just need to cancel it in the protocol preference in the right-click menu to see "Absolute SeqNum".

TCP retransmission mechanism

TCP must ensure that all data packets can arrive, so a retransmission mechanism is necessary.

Note that the Ack confirmation from the receiver to the sender will only confirm the last continuous packet. For example, the sender sent a total of five data packets, 1, 2, 3, 4, and 5. The receiver received 1 and 2, so it replied ack 3, and then received 4 (note that 3 was not received at this time). What will TCP do at this time? We need to know that, as mentioned earlier, SeqNum and Ack are in bytes, so when ack, you cannot jump to confirm, you can only confirm relatively large consecutive packets, otherwise, the sender will think that all the previous ones have been received.

Timeout retransmission mechanism

One is to not reply with ACK and wait for 3. When the sender finds that it has timed out and cannot receive the ACK for 3, it will retransmit 3. Once the receiver receives 3, it will ACK 4, which means that both 3 and 4 have been received.

However, this method has a more serious problem. That is, because it has to wait for 3, even if 4 and 5 have been received, the sender has no idea what happened. Because no Ack was received, the sender may pessimistically believe that they were lost, which may cause 4 and 5 to be retransmitted.

There are two options for this:

  • One is to retransmit only the timeout packet, which is the third copy of data.
  • The other is to retransmit all the data after the timeout, that is, the 3rd, 4th, and 5th data.

These two methods have their pros and cons. One method will save bandwidth, but it is slow, and the second method will be faster, but it will waste bandwidth and may be useless. But in general, neither is good. Because they are waiting for the timeout, the timeout may be very long (the next article will explain how TCP dynamically calculates the timeout).

Fast retransmission mechanism

Therefore, TCP introduced an algorithm called Fast Retransmit, which is not driven by time, but by data. In other words, if the packets do not arrive continuously, the last packet that may be lost will be acked. If the sender receives the same ack three times in a row, it will retransmit. The advantage of Fast Retransmit is that you don't have to wait for timeout before retransmitting.

For example: if the sender sends 1, 2, 3, 4, and 5 pieces of data, and one piece arrives first, it sends an ack to 2. However, 2 is not received for some reason, and 3 arrives, so it sends an ack to 2 again. Then, 4 and 5 arrive, but it sends an ack to 2 again, because 2 is still not received. Therefore, the sender receives three confirmations of ack=2, and knows that 2 has not arrived, so it immediately retransmits 2. Then, the receiver receives 2, and because 3, 4, and 5 have all been received, it sends an ack to 6. The schematic diagram is as follows:

Fast Retransmit only solves one problem, which is the timeout problem. It still faces a difficult choice, that is, whether to retransmit the previous one or retransmit all. For the above example, should we retransmit #2 or #2, #3, #4, and #5? Because the sender does not know who sent back these three consecutive acks (2). Maybe the sender sent 20 copies of data, which came from #6, #10, and #20. In this way, the sender is likely to retransmit the data from 2 to 20 (this is the actual implementation of some TCP). Obviously, this is a double-edged sword.

SACK Method

Another better way is called Selective Acknowledgment (SACK) (see RFC 2018). This method requires adding a SACK to the TCP header. The ACK is still the ACK of Fast Retransmit, and the SACK reports the received data fragments. See the figure below:

In this way, the sender can know which data has arrived and which has not arrived based on the returned SACK. Therefore, the Fast Retransmit algorithm is optimized. Of course, this protocol requires support from both sides. In Linux, this function can be enabled through the tcp_sack parameter (enabled by default after Linux 2.4).

Another issue that needs attention here is receiver reneging. The so-called reneging means that the receiver has the right to discard the data in the SACK that has been reported to the sender. This is not encouraged because it will complicate the problem. However, the receiver may do this in some extreme cases, such as giving memory to other more important things. Therefore, the sender cannot rely entirely on SACK, but still has to rely on ACK and maintain the Time-Out. If the subsequent ACK does not increase, then the SACK still needs to be retransmitted. In addition, the receiver can never mark the SACK packet as Ack.

Note: SACK consumes the sender's resources. Imagine if a hacker sends a bunch of SACK options to the data sender, which will cause the sender to retransmit or even traverse the data that has been sent, which will consume a lot of sender resources. For details, please refer to "TCP SACK Performance Tradeoffs".

Duplicate SACK – Issue with duplicate data received

Duplicate SACK is also called D-SACK. It mainly uses SACK to tell the sender which data has been received repeatedly. RFC-2883 has detailed descriptions and examples. Here are a few examples (from RFC-2883)

D-SACK uses the first segment of SACK as a marker.

  • If the range of the first segment of SACK is covered by ACK, then it is D-SACK
  • If the range of the first segment of SACK is covered by the second segment of SACK, then it is D-SACK

Example 1: ACK packet loss

In the example below, two ACKs are lost, so the sender retransmits the initial data packet (3000-3499). The receiver then finds that it has received a duplicate, so it sends back a SACK=3000-3500. Because the ACK has reached 4000, it means that all data before 4000 has been received, so this SACK is D-SACK - it is intended to tell the sender that I have received duplicate data, and our sender also knows that the data packet was not lost, but the ACK packet was lost.

  • Transmitted Received ACK Sent
  • Segment Segment (Including SACK Blocks)
  • 3000-3499 3000-3499 3500 (ACK dropped)
  • 3500-3999 3500-3999 4000 (ACK dropped)
  • 3000-3499 3000-3499 4000, SACK=3000-3500
  • ---------

Example 2: Network delay

In the example below, the network packet (1000-1499) was delayed by the network, resulting in the sender not receiving an ACK. The three packets that arrived later triggered the "Fast Retransmit algorithm", so it was retransmitted. However, during the retransmission, the delayed packet arrived again, so a SACK=1000-1500 was returned. Because the ACK has reached 3000, this SACK is a D-SACK, which indicates that a duplicate packet has been received.

In this case, the sender knows that the retransmission triggered by the "Fast Retransmit algorithm" was not caused by the loss of the sent packet or the loss of the response ACK packet, but because of network delay.

  • Transmitted Received ACK Sent
  • Segment Segment (Including SACK Blocks)
  • 500-999 500-999 1000
  • 1000-1499 (delayed)
  • 1500-1999 1500-1999 1000, SACK=1500-2000
  • 2000-2499 2000-2499 1000, SACK=1500-2500
  • 2500-2999 2500-2999 1000, SACK=1500-3000
  • 1000-1499 1000-1499 3000
  • 1000-1499 3000, SACK=1000-1500
  • ---------

It can be seen that the introduction of D-SACK has the following benefits:

1) It allows the sender to know whether the sent packet is lost or the returned ACK packet is lost.

2) Is your timeout too small, resulting in retransmission?

3) The situation where the first-sent packets arrive later on the network (also known as reordering)

4) Are my data packets being copied on the network?

Knowing these things can help TCP understand the network situation and thus better perform flow control on the network.

The tcp_dsack parameter in Linux is used to enable this feature (enabled by default after Linux 2.4).

<<:  China Mobile's TD-SCDMA network withdrawal begins: Fujian has taken the lead

>>:  A brief history of Wi-Fi security protocols, from zero to WPA3

Blog    

Recommend

Case | A sobering report on a major network failure

December 6, 2018 was a nightmare day for Japanese...

How unified communications technologies can support long-term remote work

[[354214]] Organizations implementing long-term r...

Facebook launches new AI project to learn from videos

On March 30, according to foreign media reports, ...

Four major trends in China's Internet development

On April 20, 1994, China gained full access to th...

"Vanity" is updated: Huawei's distribution market is booming

In the development history of China's ICT mar...

The most feared problems when migrating from data center to IDC data center

1. Introduction When an enterprise wants to chang...

How to Choose the Right Data Cabling for Your Business

Are you building a new office? Is your current of...

Why Manufacturing is an Excellent Use Case for Edge Computing

As IoT devices become more common, edge computing...

The three major operators unveiled their latest 5G strategies

As the global 5G latest version standard is locke...

[Black Friday] CMIVPS: 50% off on all VPS annual payments, 30% off on top-up

CMIVPS has launched its last big promotion this y...

Network-oriented, NPMD achieves 80% of functions with 20% of investment

An organization once worked with MIT to interview...

Learn how to restore IP address in one article!

[[426350]] Recover IP address Given a string cont...