What exactly is the performance problem with TCP?

What exactly is the performance problem with TCP?

Overview

The performance issue of TCP is essentially a trade-off between fairness and efficiency.

The core of TCP's reliable transport layer is three-fold:

(1) Confirmation and retransmission (can already meet the "reliability" requirement, but may have performance issues)

(2) Sliding window (also known as flow control, to improve throughput, make full use of link bandwidth, and prevent the sender from sending too slowly)

(3) Congestion control (preventing network link overload from causing packet loss and preventing the sender from sending too fast)

  • The sliding window focuses on flow control from sender to receiver.
  • Congestion control focuses more on flow control at the entire network (link) level

The sliding window and congestion control constrain each other, allowing the sender to automatically adjust the sending rate from a global perspective of the network link. From this perspective, the significance of TCP to the entire network has exceeded the "transport layer".

Congestion Control

Compared with the sliding window, congestion control has a more comprehensive perspective and takes into account all hosts and routers in the entire network link, as well as related factors that reduce network transmission performance.

Since congestion control has to consider so many factors, it is inevitable that there will be so-called "performance issues" in certain scenarios. Let's analyze them in detail below.

1. Slow start

Slow start itself will not cause performance problems, because during slow start, cwnd (congestion window size value) increases exponentially, so "slow start" is not actually slow. We have already talked about this in the previous TCP congestion control implementation principle article.

However, in certain scenarios (such as HTTP), slow start will increase the number of round trips for data transmission.

Here we take Linux as an example. After kernel 3.0, Google's suggestion is adopted to initialize cwnd to 10 MSS, the default MTU is 1500, and the MSS is 1460. Then, the total amount of TCP data (Segment) sent for the first time is:

By default, the ssthresh value is 65 KB, which is the threshold for entering the congestion avoidance phase from the slow start phase.

Let's randomly visit the homepage of a website (such as stackoverflow.com). Here we take the size of the HTML text data on its homepage (68.8 KB) as an example to illustrate the performance impact of slow start on the server sending response data.

  • The total amount of data sent for the first time is: 14 KB
  • After 1 RTT, cwnd doubles
  • The total amount of data sent the second time is: 28 KB
  • After 1 RTT, cwnd doubles again
  • The total amount of data sent for the third time is: 56 KB

After 3 sends, 14 + 28 + 56 = 98 KB, the home page html text data transmission is completed, and a total of 3 round trips have been experienced.

After the web page resources are loaded, other resources/data are rarely loaded. However, at this time, cwnd is just close to the ssthresh threshold.

Suppose we eliminate the slow start phase and start the transmission at full speed, sending a total of 65 KB of data in the first transmission. Then, we only need to go through 2 round trips to complete the data transmission.

2. Congestion occurs (packet loss)

In the congestion control algorithms based on packet loss (such as Reno, Cubic, and NewReno), it is believed that once packet loss occurs, the network link is congested, so the sender will sharply reduce the sending window or even enter a short waiting state (timeout retransmission).

A 1% packet loss rate does not just reduce the transmission performance by 1%, but may reduce it by 50% or even more (depending on the specific TCP implementation). At this time, an extreme situation may occur: the network spends more time retransmitting lost data packets than sending new data packets, so this is the biggest culprit of the so-called TCP performance problem.

Packet loss will also increase network link congestion. Suppose a TCP data segment is forwarded to the Nth router. The first N-1 routers have completed forwarding, but the data segment is lost when forwarding by the Nth router, which ultimately causes the lost data segment to waste the bandwidth of all previous routers.

When packet loss occurs in the TCP Reno algorithm, performance is halved

When packet loss occurs in the TCP Tahoe algorithm, it is reset directly and enters the slow start process

Here, we take HTTP as an example. The impact of packet loss is more serious in HTTP/2 because HTTP/2 only uses one TCP connection for transmission. Therefore, one packet loss will slow down the download speed of all resources. However, HTTP/1.1 may have multiple independent TCP connections, and one packet loss will only affect one of the TCP connections. Therefore, in this scenario, HTTP/1.1 will achieve better performance.

3. Sequential reliability guarantee

Although TCP ensures that all packets arrive in order, this order semantics guarantee may cause problems similar to HTTP head of line blocking.

TCP uses sequence numbers (Seq) to identify the order of data during transmission. Once a piece of data is lost, subsequent data needs to be saved in the receiver's TCP buffer and wait for the lost data to be retransmitted before the next step of processing (passing it to the application layer) can be performed.

The application layer cannot know the status of the TCP receive buffer, so it must wait for the sequence (Seq) to be complete before obtaining application data. However, in fact, the received data packets may contain data that the application layer can directly process, so this can also be called the TCP head of line blocking problem.

4. Improvement and optimization

The most obvious improvement to the packet loss-based congestion control algorithm is to use a more reasonable congestion control algorithm, such as the BBR algorithm, which can better adapt to high bandwidth, high latency, and tolerate a certain packet loss rate.

If TCP is guaranteed to transmit data with zero packet loss, bandwidth utilization can be maximized.

🤔: Think about it: If there is high packet loss, will using UDP get better performance?

Three-way handshake

In addition to the "performance issues" caused by congestion control, the three-way handshake mechanism when TCP establishes a connection can also cause performance issues in some scenarios.

For most TCP usage scenarios (long connections + frequent data transmission), the three-way handshake is almost negligible. The real performance impact is in the scenario of long time + a large number of short connections. To address this problem, you can consider converting short connections into long connections, or use TFO technology [1] for optimization.

In addition, there are two scenarios that may cause performance issues: HTTP and network switching.

1. HTTP

In HTTP/1.1, accessing different resources (CSS, Javascript, images ...) will use multiple TCP connections, which will cause a lot of latency, as shown in the figure below.

Waterfall chart when accessed using HTTP/1.1

The solution is also simple: directly upgrade to use HTTP/2. During the entire communication process, there will only be one TCP connection.

Waterfall chart when accessing using HTTP/2

In addition, some readers may think of the usage scenario of "weak network" (such as crowded subway cars). However, since it is a "weak network", it is difficult to circumvent this problem by using other transmission protocols.

2. Network switching

TCP connection migration: Due to the limitations of the TCP quaternary, if the source IP changes, the TCP connection needs to be re-established, resulting in a delay (for example, the current device switches from Wifi to cellular network).

In similar scenarios, there are multiple physical locations using different egress public IP addresses, such as school libraries and dormitories, company meeting rooms and office areas. When users switch physical spaces, they will also send a request to re-establish a TCP connection.

Of course, this problem can also be optimized by using TFO technology [2].

Acknowledgement and Retransmission

There are three main reasons why this may cause TCP performance issues:

  • Performance impact caused by TCP timeout retransmission
  • Limitations of TCP Fast Retransmit
  • What problems does TCP selective retransmission solve?

The details have been discussed in detail in a previous article [3] and will not be repeated in this article.

summary

Under ideal transmission conditions, the performance of modern TCP is limited only by the speed of light and the size of the receiver's buffer (memory), that is, by hardware and physics.

In terms of hardware, there are various assists and accelerations such as TOE and NIC.

So ultimately TCP's hardware performance under ideal conditions is limited by:

  • Minimum bandwidth of the link
  • The slowest hardware processing in the link
  • Minimum receive buffer size in the link

The combination of the three is the so-called "bottleneck link" in the communication process.

If there are no hardware performance limitations, that is, if there is sufficient bandwidth, sufficient memory, and sufficient processing speed, TCP's performance is theoretically limited only by physics, that is, the speed of light.

Finally, let me quote the boss again:

About 90% of the practical problems encountered by developers in network programming are related to their understanding of TCP/IP. Do not make any a priori assumptions about the relative performance of TCP and UDP. Even a small change in parameters can have a serious impact on performance.

<<:  Why can TFO reduce TCP to 0 handshakes?

>>: 

Recommend

A Brief Analysis of Data Flow Technology in Data Centers

What is the most valuable thing in a data center?...

Why is millimeter wave the only way to the 5G era?

According to the 3GPP agreement, 5G networks will...

Dish signs 10-year AT&T network service deal worth $5 billion

US satellite TV operator Dish Network has signed ...

How far are we from the legendary 5G?

If the upper left corner of your phone desktop sh...

Where does the power of high-performance 5G core network come from?

The core network is the brain of the entire commu...

Thirty years of changes and evolution of Internet core protocols

When the Internet began to be widely used in the ...

Is 5G cooperation the starting point for operators’ value return?

After the 5G licenses were issued, the market gen...

Expert opinion: AI is still very "weak", how can it compete with humans?

[51CTO.com original article] "I am neither a...