1. Dilemma of Traditional TCP/IP Network Transmission1.1 Traditional Ethernet end-to-end transmission system overhead is too highWhen describing the relationship between software and hardware during the communication process, we usually divide the model into user space, kernel, and hardware. Userspace and Kernel actually use the same physical memory, but for security reasons, Linux divides the memory into user space and kernel space. The user layer has no permission to access and modify the memory content of the kernel space, and can only fall into the kernel state through system calls. The memory management mechanism of Linux is relatively complex. A typical communication process based on traditional Ethernet can be layered as shown in the following figure: The data flow of this model is roughly as shown in the figure above. The data first needs to be copied from the user space to the kernel space. This time the copy is completed by the CPU, copying the data block from the user space to the Socket Buffer in the kernel space. The software TCP/IP protocol stack in the kernel adds headers and checksum information of each layer to the data. Finally, the network card will copy the data from the memory through DMA and send it to the network card at the other end through the physical link. The opposite process is the exact opposite: the hardware copies the data packet to the memory via DMA, and then the CPU parses and verifies the data packet layer by layer, and finally copies the data to the user space. The key point in the above process is that the CPU needs to participate in the entire data processing process, that is, copying data from the user space to the kernel space, and assembling and parsing the data. When the amount of data is large, this will cause a great burden on the CPU. In traditional networks, "node A sends a message to node B" actually means "moving a piece of data in the memory of node A to the memory of node B through the network link". This process, whether it is the sending or receiving end, requires the command and control of the CPU, including the control of the network card, interrupt processing, message encapsulation and parsing, etc. The data in the user space of the left node in the above figure needs to be copied by the CPU to the buffer in the kernel space before it can be accessed by the network card. During this period, the data will pass through the TCP/IP protocol stack implemented by software, plus the headers and checksums of each layer, such as TCP header, IP header, etc. The network card copies the data in the kernel to the buffer inside the network card through DMA, and sends it to the other end through the physical link after processing. After receiving the data, the other end will perform the opposite process: copy the data from the internal storage space of the network card to the buffer in the kernel space of the memory through DMA, and then the CPU will parse it through the TCP/IP protocol stack, take out the data and copy it to the user space. It can be seen that even with DMA technology, the above process still has a strong dependence on the CPU. 1.2 TCP protocol itself has inherent shortcomings in the long and fat pipe scenario
① The transmission delay (sending delay) is very small: the packet sending and receiving speed is very fast, and a large amount of data can be sent to the network in a very short time. ② The propagation delay is very large: From the time a data packet is sent to the network, it takes a very long time (compared to the sending delay) to be transmitted to the receiving end.
① The bandwidth-delay product of LFN is very large (it is sent very quickly, but it takes a long time to propagate to the other end), resulting in a large number of data packets being stranded in the transmission process. The TCP flow control algorithm will stop sending when the window becomes 0. However, the window size field in the original TCP header is 16 bits, so the maximum window size is 65535 bytes, which limits the total length of data sent by the sender but not confirmed to 65536 bytes. Refer to the calculation of 65535*8/1024/1024=0.5Mbps. Then, assuming that the transmission speed is fast enough, in a network with a transmission delay of 100 milliseconds, as long as the bandwidth of 5Mbps is sufficient, the sender can send the last bit before the first bit reaches the receiver, and then the window becomes 0 and stops sending data. The sender must wait for at least 100 milliseconds to receive the receiving window notification sent back by the receiver before opening the window to continue sending, which means that only 5Mbps of bandwidth can be used at most, so the network cannot be fully utilized. ------ Therefore, the window expansion option is proposed to declare a larger window. ② LFN’s high latency can lead to pipeline exhaustion According to TCP's congestion control, packet loss will cause the connection to be congested. Even if fast recovery is entered due to redundant ACKs, the congestion window will be reduced by half. If slow start is entered due to timeout, the congestion window will become 1. In either case, the amount of data allowed to be sent by the sender is greatly reduced, which will cause the pipeline to dry up and the network communication speed to drop sharply. ③LFN is not conducive to RTT measurement of TCP protocol According to the TCP protocol, each TCP connection has only one RTT timer. At the same time, only one message is used for RTT measurement. Before the data that starts the RTT timing is ACKed, TCP cannot measure the next RTT. In a long and fat pipe, the propagation delay is very large, which means that the RTT test cycle is very long. ④LFN causes TCP out of order at the receiving end The transmission speed of the long fat pipe is very fast (transmission delay). TCP uses a 32-bit unsigned sequence number to identify each byte of data. TCP defines the maximum segment lifetime (MSL) to limit the lifetime of the segment in the network. However, on the LFN network, since the sequence number space is limited, the sequence number will be reused after 4294967296 bytes have been transmitted. If the network is so fast that the sequence number wraps around in less than one MSL, there will be two different segments with the same sequence number in the network, and the receiver will not be able to distinguish their order. It only takes 34 seconds to complete the transmission of 4294967296 bytes in a gigabit network (1000Mb/s). 2. The overall framework of XDP2.1 Basic PrinciplesRDMA (Remote Direct Memory Access) means remote direct address access. Through RDMA, the local node can "directly" access the memory of the remote node. The so-called "direct" means that it can read and write remote memory just like accessing local memory, bypassing the complex TCP/IP network protocol stack of traditional Ethernet, and the other end is unaware of this process, and most of the work of this reading and writing process is completed by hardware rather than software. After using RDMA technology, this process can be simply represented as the following diagram: Similarly, a piece of data in the local memory is copied to the peer memory. When using RDMA technology, the CPUs at both ends hardly need to participate in the data transmission process (only participate in the control plane). The local network card directly copies the data from the user space DMA of the memory to the internal storage space, and then the hardware assembles the messages at each layer and sends it to the peer network card through the physical link. After receiving the data, the peer RDMA network card strips off the message headers and checksums at each layer, and copies the data directly to the user space memory through DMA. RDMA transfers the server application data directly from the memory to the smart network card (solidified RDMA protocol), and the smart network card hardware completes the RDMA transmission message encapsulation, freeing up the operating system and CPU. 2.2 Core Advantages1) Zero Copy: There is no need to copy data to the operating system kernel state and process the packet header, and the transmission delay will be significantly reduced. 2) Kernel Bypass: No operating system kernel involvement is required, and there is no cumbersome header processing logic in the data path, which not only reduces latency but also greatly saves CPU resources. 3) Protocol Offload: RDMA communication can read and write memory without the CPU of the remote node participating in the communication. This actually puts the message encapsulation and parsing into the hardware. Compared with traditional Ethernet communication, both CPUs must participate in the parsing of messages at each layer. If the data volume is large and the interaction is frequent, it will be a considerable overhead for the CPU, and these occupied CPU computing resources could have been used to do more valuable work. Compared with traditional Ethernet, RDMA technology achieves both higher bandwidth and lower latency, so it can play its role in bandwidth-sensitive scenarios, such as the interaction of massive data, and latency-sensitive scenarios, such as data synchronization between multiple computing nodes. 2.3 Basic Classification of RDMA NetworksCurrently, there are roughly three types of RDMA networks, namely InfiniBand, RoCE (RDMA over Converged Ethernet), and iWARP (RDMA over TCP). RDMA was originally exclusive to the Infiniband network architecture, ensuring reliable transmission at the hardware level, while RoCE and iWARP are both based on Ethernet RDMA technology. 1) InfiniBand InfiniBand is a network designed specifically for RDMA. It was proposed by IBTA (InfiniBand Trade Association) in 2000. It specifies a complete set of link layer to transport layer (not the transport layer of the traditional OSI seven-layer model, but located above it) specifications. It mainly adopts Cut-Through forwarding mode (straight-through forwarding mode) to reduce forwarding delay and Credit-based flow control mechanism (credit-based flow control mechanism) to ensure no packet loss. However, IB also has inevitable cost defects. Since it is not compatible with existing Ethernet, in addition to the need for network cards that support IB, if enterprises want to deploy it, they must re-purchase supporting switching equipment. 2) RoCE RoCE has two versions: RoCEv1 is implemented based on the Ethernet link layer. The network layer of v1 version still uses the IB specification, while v2 uses UDP+IP as the network layer, so that data packets can also be routed and can only be transmitted at the L2 layer; RoCEv2 is based on UDP to carry RDMA and can be deployed in a three-layer network. RoCE can be considered as a "low-cost solution" for IB. Deploying a RoCE network requires supporting RDMA-specific smart network cards, but does not require dedicated switches and routers (supporting technologies such as ECN/PFC to reduce packet loss). Its network construction cost is the lowest among the three RDMA network models. 3) iWARP The transport layer is the iWARP protocol. iWARP is the TCP layer implementation in the Ethernet TCP/IP protocol and supports L2/L3 layer transmission. Large-scale networking TCP connections consume a lot of CPU, so it is rarely used. iWARP only requires the network card to support RDMA, and does not require dedicated switches and routers. The network construction cost is between InfiniBand and RoCE. 2.4 Implementation ComparisonInfiniband technology is advanced, but it is expensive and its application is limited to the HPC high-performance computing field. With the emergence of RoCE and iWARPC, the cost of using RDMA has been further reduced, thus promoting the popularization of RDMA technology. The use of these three types of RDMA networks in high-performance storage and computing data centers can significantly reduce data transmission latency and provide higher CPU resource availability for applications. The InfiniBand network brings extreme performance to data centers, with transmission latency as low as 100 nanoseconds, an order of magnitude lower than the latency of Ethernet devices. RoCE and iWARP networks bring super high cost-effectiveness to data centers. Based on Ethernet carrying RDMA, they fully utilize the advantages of RDMA such as high performance and low CPU usage, while the network construction cost is also low. RoCE based on UDP protocol has better performance than iWARP based on TCP protocol. Combined with the flow control technology of lossless Ethernet, it solves the problem of packet loss sensitivity. RoCE network has been widely used in high-performance data centers in various industries. 3. Exploration of RDMA application in home broadband networkWith the comprehensive promotion of national strategies such as "Network Power, Digital China, and Smart Society", digital, networked, and intelligent digital homes have become the embodiment of the concept of smart cities at the family level. In the "14th Five-Year Plan" and the 2035 Vision, digital homes are positioned as an important part of building a "new picture of a better digital life". With the support of new generation information technology, digital homes are constantly evolving towards smart homes, completing the transformation from "digital" to "smart". At present, the scale of China's smart home market is expanding year by year. China has become the world's largest consumer of smart home market, accounting for about 50% to 60% of the global market share (data source: CSHIA, AIME data, National Bureau of Statistics). According to CCID Consulting Research, the scale of China's smart home market will reach 1.57 trillion yuan in 2030, with an average compound growth rate (CAGR) of 14.6% from 2021 to 2030. The rapid development of the home broadband market is accompanied by: ① The contradiction between the undifferentiated and best-effort service mode of home broadband and the differentiated and deterministic network quality requirements of services Currently, broadband access does not differentiate network connections for different services, and provides services in a best-effort manner. When congested, all services have the same priority and adopt the same processing strategy. However, services have different requirements for network quality and different sensitivities to delays and packet loss. Some delay-sensitive services such as games and cloud computers require deterministic network guarantees. To ensure user experience, the user experience drops sharply when the network is congested and packet loss occurs. Therefore, different processing strategies are required for such services during congestion. ② The contradiction between bandwidth improvement and experience degradation in long and fat pipe scenarios According to statistics from the Ministry of Industry and Information Technology, more than 94% of the country's broadband users have speeds above 100 Mbps, but users still experience freezes and slow download speeds when accessing long-distance content. The reason is not insufficient access bandwidth, but the inherent insufficiency of the underlying TCP protocol congestion control algorithm in the long fat pipe (LFN) scenario. TCP is a protocol from decades ago and can no longer adapt to current network conditions and application requirements. New protocols and algorithms are urgently needed to ensure the service experience in the long fat pipe scenario. In summary, from the perspective of industry trends, under the wave of computing power upgrades in home broadband networks, RDMA technology can achieve a deeper integration of computing and networking compared to TCP. Data can be directly transferred from the memory of one computer to another computer without the intervention of the operating systems of both parties and without time-consuming processing by the processor, ultimately achieving the effect of high bandwidth, low latency and low resource usage. |
<<: Future Development Path of Home Broadband Infrastructure
>>: The Impact of WiFi Chipsets on Internet Speed and Performance
Continuing from the previous article "Let...
What is the difference between the Internet of Id...
At the 2019 Mobile World Congress, Huawei brought...
[51CTO.com original article] In late summer 2017,...
To further strengthen the use case for enterprise...
After years of sustained rapid growth, my country...
[[383106]] In a blink of an eye, the Spring Festi...
Remember that at the beginning of this month, 5G ...
In a CAN network, all nodes share a bus for data ...
It is now common to use mobile communication netw...
In recent years, rapid advances in digital techno...
"Reuters reported that Huawei plans to incre...
[51CTO.com original article] A few years ago, whe...
DediPath is a foreign service provider founded in...
1. Link Aggregation Link aggregation is the combi...