If we expect to reduce network latency from 10ms to 1ms, we should first analyze the specific composition of these delays. It is very likely that the bottleneck is not network transmission. If the goal is to reduce network latency to microseconds or hundreds of microseconds, then we may use high-performance networks, such as RDMA technology. 1. Limitations of the TCP/IP protocol stackThe access bandwidth of servers within data centers is being upgraded from 10G access to 25G, and some servers used for machine learning even use 100G access bandwidth. Quantitative change leads to qualitative change, and the increased access bandwidth poses serious performance challenges to the traditional TCP/IP protocol stack. In current operating systems, using TCP to transmit data requires a lot of CPU participation, including packet encapsulation and parsing and flow control processing. Each TCP packet is processed by a CPU core, and the maximum throughput of a single TCP stream is limited by the processing power of a single CPU core. Unfortunately, the computing power of a single CPU core can hardly be improved, which leads to data transmission using the TCP/IP protocol stack will inevitably encounter a throughput bottleneck. If you want to fully utilize the 100G bandwidth, you must use parallel transmission of multiple streams. This increases the complexity of the application on the one hand, and on the other hand, it means that more expensive computing resources have to be invested in the field of network transmission. In addition to bandwidth, latency is also a major pain point when using the traditional TCP/IP protocol stack. For an application to send data, it must first go through the socket API. The data enters the kernel state from the user state, and after the kernel's message processing, it is handed over to the network protocol stack for sending. Similarly, at the receiving end, the kernel first receives the message, extracts the data after processing, and waits for user state processing. In the meantime, the operating system needs to convert between kernel state and user state, and the CPU needs to process the message, and it also depends on the interrupt of the operating system. Therefore, in the transmission delay of small data volumes, the main delay is not the transmission delay on the physical network, but the processing delay in the sending/receiving software protocol stack. 2. High-performance IB networkIn high-performance computing clusters, the network protocol is generally not Ethernet+TCP/IP, but Infiniband (IB) protocol, which is a complete protocol stack from the physical layer to the transport layer. The physical interface of Infiniband is completely different from the Ethernet interface, and it is a network completely different from the Ethernet architecture in general data centers. Infiniband is a network design based entirely on centralized control, which simplifies the complexity of switching equipment and helps reduce forwarding latency, but this also results in the scalability of IB networks being inferior to Ethernet. As a natural extension of the supercomputer concept, Infiniband appears to the upper layer as a bus within a computer. It does not use the Best Effort forwarding strategy, but instead uses a lossless design that avoids packet loss, and the upper-layer programming interface opens up direct Read/Write for remote memory. Infiniband's transport layer protocol is RDMA, or Remote Direct Memory Access, which takes advantage of the lossless nature of the underlying network and provides a programming interface called verbs API to upper-layer applications. RDMA was designed from the beginning to allow hardware network card ASICs to perform network transmission-related operations. The first Infiniband standard was formed in 2000 and has been widely used in the HPC field since its creation. It was not until 2010 that the first RDMA over Converged Ethernet (RoCE) standard was formed, announcing that the transport layer protocol RDMA of the IB network can run on Ethernet in the form of an overlay network. The performance of RoCE is still inferior to that of Infiniband. Since then, the IB network and Ethernet have officially overlapped, and RDMA technology has begun to enter general data centers, providing key support for the upcoming explosive growth of AI computing and cloud computing. 3. Characteristics of RDMAWhen we use RDMA, the actual data sent by the application does not need to be processed by the kernel or copied. Instead, the RDMA network card directly DMAs the data from the user-state memory to the network card, and then performs encapsulation and other processing on the network card hardware. The receiving end also decapsulates the data from the network card and directly DMAs the data to the user-state memory. This is the true meaning of the two features of kernel bypass and zero copy in RDMA technology. Because of kernel bypass and zero copy, RDMA has three key derivative features: low latency, high bandwidth, and low CPU consumption. These three derivative features are often mentioned as the main advantages of RDMA over the TCP/IP protocol stack. It should be noted that kernel bypass and zero copy both require hardware support from the network card, which means that if you want to use RDMA, you must have a network card that supports RDMA technology. The part on the left side of RDMA that passes through libibverbs in the above figure is generally called the Command Channel. It is the command channel when the application calls the verbs API, and is not a real data channel. 3.1 RDMA verbs APIApplications use the verbs API, not the socket API, to use RDMA. RDMA has three queues: Send Queue (SQ), Recv Queue (RQ), and Completion Queue (CQ). SQ and RQ often appear in pairs, collectively called Queue Pair. SQ is the send request queue, RQ is the receive request queue, and CQ is the queue for the completion of send and receive requests. When an application needs to send data, it submits a send request (API: ibvpostsend) to SQ. This request does not contain the data itself, but only a pointer to the data and the length of the data. The submission operation is implemented through the API. This request will be passed to the network card, which will take out the data to be sent according to the address and length, and then send it to the network. After the sending is completed, a sending completion statement will be generated in CQ. On the receiving end, the application must submit a receiving request to RQ (API: ibvpostrecv) in advance. This receiving request contains the memory pointer where the received data is to be stored and the maximum data length that can be received. When the data arrives, the network card will put the data in the memory location specified by the receiving request in the RQ queue head, and then generate a receiving completion statement in CQ. The sending and receiving of the verbs API are both asynchronous non-blocking calls. The application needs to check CQ to determine the completion of a request. QP can be regarded as similar to a socket. In fact, in RDMA, in addition to Send/Recv, there are also special RDMA Write/Read requests that can directly access the remote application's virtual memory. RDMA Write/Read requests are also submitted to the initiator SQ), but unlike Send/Recv, the other end does not need to submit a receive request to RQ in advance. This unilateral RDMA transmission method will have better performance than Send/Recv, but the exact address of the data on the other end must be known in advance. 3.2 Packet loss-free implementation of RDMAPacket loss itself is a performance killer for network transmission, and in order to quickly recover from packet loss, it is necessary to introduce complex processing logic that is difficult to implement with hardware. Ethernet can use the pause mechanism to become lossless. The implementation of the lossless mechanism is crucial for RDMA. When the pause mechanism is enabled in RoCE, in order to minimize the side effects of the pause, different priority queues can be separated in the network, and the pause is only enabled on specific priority queues. This is the Ethernet priority flow control (PFC) mechanism, which has become part of the DCB standard. Most data center switches now support PFC. When the PFC configuration is in effect, RDMA traffic runs at the priority with the pause enabled, while TCP traffic runs at the priority without the pause enabled to minimize the side effects of the pause mechanism. RDMA can abandon the PFC configuration under special design, but such equipment has not yet been truly born. It is still assumed that RDMA requires an underlying network with no packet loss. The normal use of RDMA is highly dependent on the correct configuration of the switching devices and network cards within the network. 3.3 RDMA Related TechnologiesRDMAcm, the full name of which is RDMA communication manager, is a wrapper call library for verbs API. RDMAcm API is similar to socket API, such as rdmaconnect, rdmalisten, rdma_accept, etc. It is just a simplified usage scheme for some verbs API. rsocket is a tool included in librdmacm, which can wrap RDMA as a socket interface, replacing the default socket of the system, so that applications can use RDMA without modifying the code. DPDK is derived from Intel's network card kernel bypass technology and has been widely used in data centers. DPDK and RDMA are not directly related, but both can be kernel bypassed, so they are sometimes compared. Compared with RDMA, DPDK has two main differences: first, DPDK must poll, which consumes CPU resources; second, DPDK still needs to process messages in user mode, and the latency and overhead of this part have not been reduced. However, DPDK has an advantage, that is, it does not require the communication peer to provide the same support. RDMA and DPDK each have their own application scenarios. 3.4 Typical Application Scenarios of RDMAThere are two main application scenarios for RDMA: one is high-performance computing, including distributed machine learning training (especially when using GPUs); the other is computing and storage separation. The former focuses on high bandwidth, such as tensorflow, paddlepaddle, etc. The latter focuses on low latency, and representative cases include nvmf, smb direct, etc. In addition, RDMA is also used in scenarios such as big data processing (such as spark) and virtualization (such as virtual machine migration), and more usage scenarios are still being explored. 4 SummaryFor high-performance networks, RDMA achieves low latency, high bandwidth, and low CPU consumption for network transmission through kernel bypass and zero copy. Generally, applications need to use RDMA through special APIs, and RDMA requires the underlying network to have a zero-packet configuration. Its main application scenarios include high-performance computing and computing and storage separation. |
>>: Deep Dive into Ansible Configuration and Host Inventory: Easily Manage Automation Tasks
Xiao Yaqing, Minister of Industry and Information...
In 2022, have you already switched to a 5G phone ...
[Barcelona, Spain, March 1, 2023] The "Joi...
On the afternoon of August 31, the 2021 World 5G ...
There is no doubt that 2020 is a huge boom period...
At the beginning of the year, the tribe shared in...
This article mainly explains the operation of soc...
On December 19, 2019, Hangzhou DPtech Co., Ltd. (...
The Evolution of Ethernet: From 10BASE-T to 40GBA...
As one of the three giants of traditional e-comme...
As the pandemic shapes a new normal, value chains...
ProfitServer is a foreign hosting company founded...
[[334143]] This article is reproduced from Leipho...
According to foreign media reports, NASA is upgra...
AP is the most commonly used device for building ...