Understanding Lossless Networks in One Article

Understanding Lossless Networks in One Article

According to OpenAI's data analysis, the amount of computing required for the latest AI model training has doubled every 3-4 months since 2012. In the past five years, GPU computing power has increased by nearly 90 times, while network bandwidth has only increased by 10 times. As the scale of AI training clusters increases and the computing power of single nodes increases, AI training has gradually shifted from computing constraints to network communication constraints.

On the other hand, the access performance of storage media SSD (solid state drive) has been improved by 100 times compared with distributed storage HDD (mechanical hard disk). The access performance of SSD using NVMe interface protocol can be improved by 10,000 times. The proportion of network latency has increased from less than 5% to 65%. Network latency has also become an important factor affecting storage efficiency.

[[442870]]

When receiving/sending messages, the traditional TCP/IP protocol requires the kernel to perform multiple context switches. Each switch takes about 5us-10us, requires at least three data copies, and requires computing power for protocol encapsulation. RDMA networks provide higher single-stream communication bandwidth than TCP. The kernel bypass and memory zero-copy features of RDMA networks greatly shorten the communication processing latency and data movement latency of the protocol stack. RDMA allows user-mode applications to directly read and write remote memory without the need for the CPU to access multiple copies of memory, and can bypass the kernel to write data directly to the network card, achieving high throughput, ultra-low latency, and low CPU overhead.

The latest version of the RDMA protocol RoCEv2 runs on UDP and does not have mechanisms such as sliding windows and confirmation responses. Once a packet is lost, it needs to rely on upper-layer applications to check and retransmit, which greatly reduces the RDMA transmission efficiency. When the network packet loss rate is >10-3, the effective RDMA throughput will drop sharply; 2% packet loss will reduce the RDMA throughput to 0. To ensure that the RDMA throughput is not affected, the packet loss rate must be guaranteed to be less than one in 100,000, and preferably no packet loss. In short, the high efficiency of RDMA depends on a lossless network.

Causes of network congestion

To achieve a lossless network, we actually need to solve the congestion problem on the network.

The main causes of network congestion are as follows:

1) Asymmetric design of uplink and downlink. Network design usually adopts an asymmetric approach, and the bandwidth of uplink and downlink is inconsistent (convergence ratio). Taking a switch as an example, when the total uplink packet sending rate of the downstream server exceeds the total uplink bandwidth, congestion will occur on the uplink port.

2) ECMP. Most data centers use the Fabric architecture and ECMP to build multiple links with equal load balancing. The HASH factor is set and a link is selected for forwarding. This process does not consider whether the selected link itself is congested. When the selected link traffic is saturated, network congestion will occur.

3) TCP Incast. When a server sends a request to a group of nodes, the nodes in the cluster will receive the request at the same time and respond almost simultaneously, thus generating a "micro-burst flow". If the outbound port buffer connected to the server on the switch is insufficient, congestion will occur.

Characteristics of Lossless Networks

"0 packet loss", "low latency" and "high throughput" are the three core features of lossless networks.

These three indicators influence each other:

  • 0 packet loss: This will suppress link bandwidth, resulting in low throughput and increase the transmission delay of large flows;
  • Low latency: reduces switch queues, resulting in low throughput;
  • High throughput: Maintaining high link utilization leads to congestion and queueing of switches, resulting in "high latency" for small flows.

There are two aspects to building a lossless network: one is the optimization of the network itself, and the other is the integration and optimization of the network and application systems. The latter includes network-computing integration, network-storage integration, etc., which are not within the scope of this article. This article focuses on the former.

The goal of network optimization is to maximize the throughput and minimize the latency of the entire network, which includes three levels:

1) Flow control: used to match the rate between the sender and the receiver to avoid packet loss;

2) Congestion control: used to solve the problem of traffic rate control when the network is congested, so as to achieve full throughput and low latency;

3) Traffic scheduling: It is used to solve the load balancing problem of business traffic and network links, and to ensure the service quality of different business traffic.

Flow Control Technology for Lossless Networks

Flow control technology is the basic technology to ensure zero packet loss in the network. The data transmission rate is controlled by the traffic receiver.

The PAUSE frame agreed in IEEE802.3 Annex 318 is the basic protocol for implementing flow control in Ethernet. When the receiving device's receiving capacity is less than that of the upstream device, it will actively send a PAUSE frame to the upstream device, requesting the upstream device to suspend the transmission of traffic for a period of time. The PAUSE frame can only prevent the upstream device from sending ordinary data frames, but cannot prevent the sending of MAC control frames.

The destination MAC address of the PAUSE frame is the reserved MAC address 0180-C200-0001, and the source MAC is the MAC address of the device sending the PAUSE frame. The value of the MAC Control Opcode field is 0x0001. The PAUSE frame is a type of MAC control frame. When congestion occurs on the receiving device, the source port in this segment will usually receive multiple PAUSE frames in succession. As long as the congestion status of the receiving device is not resolved, the relevant port will continue to send PAUSE. Although flow control based on the Ethernet PAUSE mechanism can prevent packet loss, it will cause all packets on a link to stop sending. To solve this problem, IEEE introduced the priority flow control function PFC (priority-based flow control) in 802.1qbb, also known as Per Priority Pause or CBFC (Class Based Flow Control)

PFC allows the creation of 8 virtual channels on an Ethernet link, and specifies a priority for each virtual channel, allowing any virtual channel to be suspended and restarted separately, while allowing the traffic of other virtual channels to pass uninterrupted. When the buffer used by the queue drops below the PFC threshold, a PFC backpressure stop message is sent to the upstream to notify the upstream to resend the packet, thereby ultimately achieving zero packet loss transmission of the message.

Congestion Control in Lossless Networks

Congestion control is a method of controlling the total amount of data entering the network to keep network traffic at an acceptable level.

ECN (Explicit Congestion Notification) technology: When the traffic receiving end senses congestion on the network, it notifies the sending end through protocol messages, causing the traffic sending end to reduce the message sending rate, thereby avoiding packet loss caused by congestion at an early stage and maximizing network performance.

DCQCN (Data Center Quantized Congestion Notification): Currently the most widely used congestion control algorithm in RoCEv2 networks. It combines the QCN algorithm and the DCTCP algorithm and requires data center switches to support WRED and ECN. DCQCN can provide better fairness, achieve high bandwidth utilization, and ensure low queue cache occupancy and less queue cache jitter.

The DCQCN algorithm consists of three parts: the switch (CP, Congestion Point), the receiving end notification point NP (Notification Point), and the sending end feedback point RP (Reaction Point). When the switch finds that the outbound port queue exceeds the threshold, it will carry the ECN congestion mark (ECN field is set to 11) to the message when forwarding the message according to a certain probability to indicate the congestion in the network. The marking process is completed by the WRED (Weighted Random Early Detection) function. RP sends back pressure information to RP based on the received ECN congestion mark.

Figure 1: Schematic diagram of the DCQCN algorithm

ECN overlay: ECN Overlay applies the ECN (Explicit Congestion Notification) function to VXLAN. Overlay network congestion can be detected in a timely manner, allowing the traffic receiver to notify the sender to reduce the speed and alleviate network congestion.

iQCN (intelligent Quantized Congestion Notification): By allowing the forwarding device to intelligently compensate and send CNP notification messages, the problem of aggravated network congestion caused by the rapid speed increase due to the failure of the traffic sending end network to receive CNP messages in time. The CP (Congestion Point) with iQCN function records the received CNP messages and maintains the flow table containing CNP message information and timestamps. The CP will continuously monitor the port congestion level of this device. When the port congestion is more serious, it will compare the time interval of CNP messages with the network card speed increase time: if it is found that the time interval of CNP messages received from NP is less than the network card speed increase time of RP, it is judged that the network card can be decelerated normally, and CP will forward CNP messages normally; if it is found that the time interval of CNP messages received from NP is greater than the network card speed increase time of RP, it is judged that the network card cannot be decelerated in time and there is a risk of speed increase, and CP will actively compensate and send CNP messages.

Traffic Scheduling Technology for Lossless Networks

Traffic scheduling technology refers to the process of network nodes distributing the load (traffic) to multiple links for forwarding when forwarding traffic. Common load sharing mechanisms in the network include Equal Cost Multi-Path Routing (ECMP) and Link Aggregation (LAG).

ECMP (Equal-Cost Multi-Path routing) load balancing: It achieves the purpose of equal-cost multi-path load balancing and link backup. ECMP is applied to the network environment where multiple links reach the same destination address. When the priorities and metrics of multiple routes are the same, load balancing can be achieved; when the priorities and metrics of multiple routes are different, route backup can be achieved. In the latter, the data packet sent to the destination address can only use one of the links, and the other links are in backup or invalid state, and it takes a certain time to switch in a dynamic routing environment. When ECMP load balancing, the router uses the five-tuple of the data packet (source address, destination address, source port, destination port, protocol) as the hash factor, generates a HASH-KEY through the HASH algorithm, and selects a member link in the load sharing link to forward the data packet. Under normal circumstances, the router uses the main route to forward data. When the main link fails, the main route becomes inactive, and the router selects the route with the highest priority in the backup route to forward data. When the main link returns to normal, since the main route has the highest priority, the router reselects the main route to send data.

LAG (Link Aggregation Group) load balancing: multiple physical interfaces are bundled into one logical interface to increase link bandwidth. This logical interface is called an aggregate interface or Eth-Trunk interface. When using Eth-Trunk to forward data, since there are multiple physical links between the devices at both ends of the aggregation group, it is possible that the first data frame of the same data stream is transmitted on one physical link, and the second data frame is transmitted on another physical link, which may cause data packets to be out of order. By adopting the mechanism of flow-by-flow load balancing, the address in the data frame is generated by the HASH algorithm to generate a HASH-KEY value, and the corresponding outbound interface in the Eth-Trunk forwarding table is associated. Different MAC or IP addresses have different HASH-KEYs and different outbound interfaces, ensuring that the frames of the same data stream are forwarded on the same remote link, and realizing load balancing of traffic on each physical link in the aggregation group.

summary

The efficient operation of RDMA depends on lossless networks. "Zero packet loss", "low latency" and "high throughput" are the three core features of lossless networks. Flow control, congestion control and flow scheduling are the three core technologies of lossless networks. Through the construction of lossless networks, computing power, networks and storage can match each other and work together to play a greater role.

<<:  Review of eight hot spots in the optical communication industry in 2021: Highlights frequently emerge under the halo of 5G

>>:  Sent a data packet to xxxhub and found...

Recommend

Deploy on demand: China Telecom plans to open 320,000 5G base stations in 2021

[[386510]] Today, China Telecom announced its ful...

Seize the critical period for large-scale application of 5G

As of the end of April, more than 1.6 million 5G ...

The story of spectrum: Gigabit is just the beginning

At the end of 4G development, the most advanced m...

What is 5G? How fast is 5G? Learn about 5G in one article

Recently, the world's first 5G railway statio...

The Internet can also get stuck?! There are three magic weapons to solve it!

Highways will be congested, and the Internet, the...

Why do we need RPC when we have HTTP?

This article briefly introduces the two forms of ...