AI adds power, lossless network leads to the next stop

AI adds power, lossless network leads to the next stop

How efficient can network transmission be? Let's explore the secrets of "lossless network".

"Eastern Data and Western Computing" is hot, and there is a "new path" for the accelerated development of the digital economy

With the launch of the "Implementation Plan for the Computing Hub of the National Integrated Big Data Center Collaborative Innovation System", "Eastern Data and Western Computing" and "Innovative Development of Data Centers" have become hot topics of great concern in the industry. "Eastern Data and Western Computing" is a national strategic project to build a "national hub node of the national integrated computing network", which aims to improve the unbalanced layout of my country's digital infrastructure and enable data centers to play a "leading role" in the development of the digital economy.

The "East Data West Computing" project will form a new computing network pattern oriented towards data flow in the future. In this context, we urgently need a next-generation network that can provide bearer and enable the development of the digital economy to truly embark on a fast and steady road. As a representative of the development of the next-generation network, "lossless network" has entered our field of vision with its vigorous innovative development momentum.

What is lossless networking?

Lossless, as the name implies, means "zero" loss. The loss here refers to the main indicators such as protocol packet forwarding, response time, processing time, and device throughput during network transmission. The answer is obvious. A lossless network is a network environment that can achieve "zero packet loss, low latency, and high throughput". Its goal is "the lower the latency, the better, and the higher the efficiency, the better". Therefore, compared to the "lossy" network environment with packet loss and high latency, lossless networks have made improvements and innovations in congestion control, flow control, packet forwarding, and routing selection to meet the needs of efficient storage of massive computing power and massive data in data centers, greatly improving the user experience.

  • The two key skills - PFC and ECN

With the rise of cloud computing, big data, artificial intelligence and 5G, network data has exploded, which has put forward higher requirements for data processing performance and data center construction. Currently, in business scenarios such as HPC (High Performance Computing), distributed storage, and AI, the use of RDMA protocol to reduce CPU processing and latency and improve application performance has become the development direction of data center networks in the computing era.

Among them, the RDMA network achieves lossless protection by deploying PFC (Priority-based Flow Control) and ECN (Explicit Congestion Notification) functions in the network.

PFC is a queue-based back pressure technology that ensures that the traffic of the RDMA-exclusive queue on the link is controlled, and back pressure is applied to the traffic of the upstream device when congestion occurs at the switch ingress port. In a single-machine scenario, PFC can quickly and effectively adjust the server rate to ensure that the network does not lose packets. However, in a multi-level network, problems such as unfair speed reduction, PFC storm, and PFC deadlock will occur. Therefore, when PFC is enabled in a data center, it is necessary to strictly monitor and manage Pause frames to ensure network reliability.

Figure 1: PFC Process

ECN is a flow-based end-to-end flow control technology that ensures end-to-end congestion control. When the switch egress port is congested, ECN marks the data packets and asks the traffic sender to reduce the sending rate. ECN is better than PFC in terms of effect, but it also has the following problems:

  • ECN requires the receiving end to generate back pressure messages, and the feedback path cycle is relatively long;
  • Random marking would be unfair;
  • Waterline design is relatively complex and needs to be designed in combination with network architecture and business characteristics;

Figure 2: ECN process

  • How ECN and PFC form a golden partnership

From the perspective of lossless network design, in order to give full play to the high-performance forwarding of the network, when configuring ECN and PFC together, it is necessary to adjust the buffer waterline threshold of ECN and PFC through expert experience, so that ECN is triggered before PFC. That is, the network continues to forward data at full speed, and the server actively reduces the packet sending rate. If the problem still cannot be solved, PFC can be used to make the upstream switch suspend packet sending. In this way, although the throughput performance of the entire network is reduced, it will not cause packet loss.

Figure 3: ECN + PFC combination process

ECN & PFC's "distant friendship and near attack" under the challenges of traffic and rate

In a RoCE network, building a lossless Ethernet requires supporting the following key features:

  • PFC: Provides priority-based flow control on a hop-by-hop basis, enabling multiple types of traffic to run on an Ethernet link without affecting each other.
  • ECN: When congestion occurs on a device, the receiver sends a CNP (Congestion Notification Packet) to the sender to reduce the sending rate by marking the ECN field in the IP header of the message, thereby achieving end-to-end congestion management and slowing down the spread of congestion.

Among them, the biggest difficulty of ECN is that the waterline setting is relatively complicated, and it needs to be designed in combination with the network architecture and business characteristics. However, the traffic in the existing network is complex and changeable, resulting in the static ECN waterline threshold function based on expert experience not being able to cover all traffic scenarios, and unable to ensure that lossless services achieve optimal performance. AI ECN uses AI algorithms to achieve waterline adjustment of lossless queues. Through the traffic model trained by AI, it can predict the changing trend of network traffic in real time and dynamically adjust the ECN waterline threshold, thereby achieving accurate scheduling of lossless queues and ensuring the optimal performance of the entire network.

H3C AI ECN algorithm AI-powered performance leads the industry

In this context, H3C launched the AI ​​ECN intelligent lossless algorithm , which can train the traffic model through reinforcement learning algorithm according to the network traffic model (N + 1 Incast value, queue depth, large and small flow ratio and other traffic characteristics), perceive and predict the trend of network traffic changes in real time, automatically adjust the optimal ECN waterline, and accurately schedule the queue. While trying to avoid triggering network PFC flow control, it also takes into account the forwarding of delay-sensitive small flows and throughput-sensitive large flows, further ensuring the optimal performance of the entire network.

The AI ​​ECN optimization algorithm processing flow adopted by H3C AD-DC SeerFabric solution is as follows:

Figure 4: AI ECN Process

As an important part of H3C AD-DC SeerFabric lossless network solution, the AI ​​ECN algorithm realizes the dynamic adjustment of the ECN waterline in the outbound port queue, so that network equipment can achieve low transmission delay and high throughput in various communication networks or real-time changing communication networks, and improve the flexibility of network congestion control. In the actual networking test, various performance indicators have been greatly improved, and the goal of boosting RDMA network performance has been well achieved.

Three unique engines drive the accelerated evolution of intelligent lossless networks

Earlier this year, H3C officially released the AD-DC SeerFabric lossless network solution. Based on the cloud-edge AI collaborative architecture, through the optimization and innovation of the industry's AI ECN tuning algorithm, combined with the local AI Inside capabilities of H3C data center switches, it maximizes throughput and reduces latency while ensuring zero packet loss, ensuring accurate forwarding of network services and certainty of network service quality. At the same time, through refined intelligent operation and maintenance, the business experience of the RoCE network is visualized.

The core driving force of H3C AD-DC SeerFabric lossless network solution comes from three key intelligent components:

  • Intelligent analysis engine: Utilizes lossless networks and connected storage and computing resources, with the help of AI algorithms and expert experience, to analyze and build AI lossless optimization models for different traffic scenarios in data centers. Through real-time learning and training of existing network traffic, it automatically adapts to the characteristics of different business traffic models, dynamically generates optimal network parameters, and realizes lossless forwarding of the network;
  • Intelligent control engine: Automatically sends the tuning parameters dynamically generated by the intelligent analysis engine to the device to achieve global optimal operation of the lossless network;
  • Edge AI engine: The switch is embedded with a high-performance AI computing module, which uses the offline AI traffic model of the intelligent analysis engine to monitor the network status in real time. It automatically makes local intelligent adjustments to the RDMA queue waterline based on the traffic characteristics of the existing network, optimizes network parameters, and ensures lossless forwarding performance of the local network.

Figure 5 AD-DC SeerFabric lossless network solution architecture

Entering the intelligent era driven by technologies such as 5G, cloud and AI, high-throughput, low-latency lossless networks have become a common demand for the development of network services. In the future, based on "Cloud Native" and "Digital Brain 2021", H3C Group will continue to work with industry partners to actively promote the standardization and application of intelligent lossless networks, provide standard and open products, solutions and services, continue to empower the ecosystem, and contribute to the construction of new national data centers. From lossless networks to lossless worlds, all kinds of scenes in science fiction movies are about to become a reality. H3C will continue to discover more exciting lossless worlds with you.

<<:  An article to reveal the hot and cold knowledge of SRv6, the "newcomer" of the network

>>:  How low-code platforms enable machine learning

Recommend

Russian scientists propose data encoding method for 6G standard

Russian scientists propose data encoding method f...

How to reduce the incidence of human error in data centers

Data center companies often encounter hardware an...

A brief discussion on "lossless network": ECN and PFC technology

Basic concepts of lossless network First of all, ...

Industrial IoT and manufacturing will become one of the largest 5G markets

Private 5G networks are attractive to the largest...

The 5G news of the three major operators finally landed

Recently, with China Unicom announcing the launch...

TCP three-way handshake and four-way wave and 11 states

[[331585]] Source: 22j.co/buCw Three-way handshak...

How intermittent-link ribbon fiber revolutionizes the communications industry

Fiber optic technology has revolutionized communi...

Interpretation of the 2017 Global Data Breach Cost Research Report

In early July, the 2017 Global Cost of Data Breac...