Remote Direct Memory Access (RDMA) is a direct memory access technology that transfers data directly from the memory of one computer to another without the intervention of the operating systems of both computers. RDMA was first implemented on the Infiniband transmission network. Later, industry manufacturers transplanted RDMA to traditional Ethernet, reducing the cost of using RDMA and promoting the popularization of RDMA technology. However, on Ethernet, according to the difference in the degree of integration of the protocol stack, it is divided into two technologies: iWARP and RoCE, and RoCE includes two versions: RoCEv1 and RoCEv2 (the biggest improvement of RoCEv2 is the support for IP routing). With the rapid development of technologies such as high-performance computing, big data analysis, artificial intelligence, and the Internet of Things, and the popularization of centralized storage, distributed storage, and cloud databases, more and more data needs to be obtained from the network for business applications, which places increasingly higher requirements on the exchange speed and performance of data center networks. Traditional TCP/IP hardware and software architecture and applications have problems such as large delays in network transmission and data processing, multiple data copies and interrupt processing, and complex TCP/IP protocol processing. RDMA (Remote Direct Memory Access) is a technology that was created to solve the server-side data processing delay in network transmission. RDMA transfers data from user applications directly to the server's storage area, quickly transferring data from one system to the remote system's storage through the network, eliminating multiple data copying and text exchange operations during the transmission process and reducing the CPU load. The principle of RDMA technology and its comparison with the TCP/IP architecture are shown in the figure below. RDMA technology realizes the direct transmission of data in the data buffer between two nodes during network transmission. The local node can directly transmit data to the memory of the remote node through the network, bypassing the multiple memory copies in the operating system. Compared with traditional network transmission, RDMA does not require the intervention of the operating system and TCP/IP protocol, and can easily achieve ultra-low latency data processing and ultra-high throughput transmission. It does not require the intervention of remote node CPU and other resources, and does not need to consume too many resources for data processing and migration. RDMA technology mainly includes: IB (InfiniBand): RDMA technology based on the InfiniBand architecture, proposed by IBTA (InfiniBand Trade Association). Building an RDMA network based on IB technology requires a dedicated IB network card and IB switch. iWARP (Internet Wide Area RDMA Protocol): RDMA technology based on TCP/IP protocol, defined by IETF standards. iWARP supports the use of RDMA technology on standard Ethernet infrastructure, but the server needs to use a network card that supports iWARP. RoCE (RDMA over Converged Ethernet): RDMA technology based on Ethernet, also proposed by IBTA. RoCE supports the use of RDMA technology on standard Ethernet infrastructure, but requires the switch to support lossless Ethernet transmission and the server to use RoCE network card. InfiniBand Technology Introduction InfiniBand is an RDMA technology based on the InfiniBand architecture. It provides a channel-based point-to-point message queue forwarding model. Each application can directly obtain the data message of this application through the created virtual channel without the intervention of other operating systems and protocol stacks. The application layer of the InfiniBand architecture uses RDMA technology, which can provide RDMA read and write access between remote nodes and completely unload the CPU workload; the network transmission uses high-bandwidth transmission; the link layer sets a specific retransmission mechanism to ensure service quality, and does not require data buffering. InfiniBand must be run in an InfiniBand network environment and can only be implemented using an IB switch and an IB network card. InfiniBand technology has the following characteristics: • The application layer uses RDMA technology to reduce the latency of data processing on the host side. • Message forwarding control is done by the subnet manager, without complex protocol interaction calculations like Ethernet. • The link layer ensures the quality of service through a retransmission mechanism, without the need for data buffering and packet loss. • It has the characteristics of low latency, high bandwidth and low processing overhead. iWARP Technology Introduction iWARP is an RDMA technology based on Ethernet and TCP/IP protocols and can run on standard Ethernet infrastructure. iWARP does not specify physical layer information, so it can work on any network layer using TCP/IP protocol. iWARP allows many transport types to share the same physical connection, such as network, I/O, file system, block storage and message communication between processors. iWARP Protocol Stack iWARP consists of three sub-protocols: MPA, DDP, and RDMAP: The RDMAP layer protocol is responsible for the conversion of RDMA read and write operations and RDMA messages, and forwards RDMA messages to the DDP layer. The DDP layer protocol is responsible for fragmenting overlong RDMA messages into DDP packets and forwarding them to the MPA layer. The MPA layer adds the forwarding backward identifier, data message length, CRC check data and other fields to the fixed identifier position of the DDP data segment to form the MPA data segment for TCP transmission. iWARP Technical Features iWARP reduces the host-side network load in the following ways: • TCP/IP processing is offloaded from the CPU to the RDMA network card, reducing CPU load. • Eliminate memory copies: Applications can transfer data directly to peer application memory, significantly reducing CPU load. • Reduce application context switching: Applications can bypass the operating system and directly issue commands to the RDMA network card in user space, which reduces overhead and significantly reduces the delay caused by application context switching. Since the TCP protocol can provide flow control and congestion management, iWARP does not require Ethernet to support lossless transmission. It can be achieved only through ordinary Ethernet switches and iWARP network cards. Therefore, it can be applied on the wide area network and has good scalability. Introduction to RoCE Technology RoCE technology supports carrying the IB protocol on Ethernet and implementing RDMA over Ethernet. RoCE and InfiniBand technologies have the same software application layer and transmission control layer, with only differences in the network layer and Ethernet link layer. The RoCE protocol is divided into two versions: RoCE v1 protocol: Based on Ethernet to carry RDMA, it can only be deployed on the Layer 2 network. Its message structure is to add a Layer 2 Ethernet message header to the original IB architecture message, and identify the RoCE message through Ethertype 0x8915. RoCE v2 protocol: Based on UDP/IP protocol to carry RDMA, it can be deployed in a three-layer network. Its message structure is to add a UDP header, an IP header, and a layer 2 Ethernet message header to the original IB architecture message, and identify the RoCE message through the UDP destination port number 4791. RoCE v2 supports source port number hashing and uses ECMP to achieve load balancing, which improves network utilization. RoCE enables Ethernet-based data transmission to: • Improve data transfer throughput. • Reduce network latency. • Reduce CPU load. RoCE technology can be implemented through ordinary Ethernet switches, but the server needs to support RoCE network cards and the network side needs to support lossless Ethernet. This is because in the IB packet loss processing mechanism, the loss of any message will cause a large number of retransmissions, seriously affecting data transmission performance. In the RoCE network, it is necessary to build a lossless Ethernet to ensure that there is no packet loss during network transmission. For more information about lossless Ethernet technology, refer to the article "FCoE Full Solution Series" - Enhanced Ethernet Technology. Building a lossless Ethernet requires the following key features:
In a RoCE environment, PFC and ECN need to be used simultaneously to ensure bandwidth without packet loss. The functional comparison between the two is as follows: Although IB, Ethernet RoCE, and Ethernet iWARP use the same API, they have different physical and link layers. In Ethernet solutions, RoCE has obvious advantages over iWARP in terms of latency, throughput, and CPU load. RoCE is supported by many mainstream solutions and is included in Windows service software. RDMA technology is based on the concept of traditional networks, but it is somewhat different from IP networks. The most critical difference is that RDMA provides a message service that allows applications to directly access virtual memory on remote computers. The message service can be used for inter-process communication (IPC) in the network, remote server communication , and data transfer with storage devices with the assistance of some upper-layer protocols . There are many upper layer application protocols ULPs (Upper Layer Protocols) , such as iSCSI RDMA extension (iSER), SCSI RDMA protocol (SRP), etc. Mainstream SMB, Samba, Lustre, ZFS, etc. also support RDMA. RoCE and InfiniBand, one defines how to run RDMA on Ethernet, while the other defines how to run RDMA in an IB network. RoCE hopes to migrate IB applications (mainly cluster-based applications) to converged Ethernet, while in other applications, the IB network will still be able to provide higher bandwidth and lower latency than RoCE. Technical differences between RoCE and IB protocols:
RoCE and iWARP, one is based on the connectionless protocol UDP , and the other is based on a connection-oriented protocol (such as TCP) . RoCEv1 can only be limited to a layer 2 broadcast domain, while RoCEv2 and iWARP can support layer 3 routing. Compared with RoCE, in the case of large-scale networking, iWARP's large number of TCP connections will occupy a large amount of memory resources and have higher system specifications. In addition, RoCE supports multicast , while iWARP has no relevant standard definition. |
<<: Why do enterprises need a dedicated core network?
>>: In the case of Li Yunlong, the principle of SSL/TLS protocol can be explained as follows
TMThosting has launched the Dedicated Server &...
1. What is WonderShaper WonderShaper is a tool fo...
The Internet has been quietly changing over the y...
As some telecom operators seek to accelerate the ...
According to CCTV reports, the 46th "Statist...
VMISS has once again released a 30% discount code...
When any technology or service is sold in large q...
The IP address and MAC address can be compared to...
Wireless routers have become an indispensable net...
Building equipment suppliers are prioritizing gre...
[51CTO.com original article] There is no doubt th...
Many companies are already using various team col...
DigitalVirt is a Chinese hosting company founded ...
Some operators have already started running befor...
As Single Pair Ethernet (SPE) gains more and more...