When the network transmission protocol SRD meets DPU

When the network transmission protocol SRD meets DPU

What?

SRD (Scalable Reliable Datagram) is a protocol launched by AWS in 2017 to solve Amazon's cloud performance challenges. It is a high-throughput, low-latency network transmission protocol designed specifically for AWS data center networks, based on Nitro chips, and implemented to improve HPC performance.

SRD does not preserve packet order, but sends packets through as many network paths as possible while avoiding path overload. To minimize jitter and ensure the fastest response to fluctuations in network congestion, SRD is implemented in AWS's self-developed Nitro chip.

SRDs are used by HPC/ML frameworks on EC2 hosts through the AWS EFA (Elastic Fabric Adapter) kernel bypass interface.

Features of SRD:

  • The order of packets is not preserved and is handed over to the upper messaging layer for processing
  • Send packets through as many network paths as possible, use the ECMP standard, and control the packet encapsulation at the sender to control the ECMP path selection to achieve multi-path load balancing.
  • Proprietary congestion control algorithm, based on dynamic rate limiting for each connection, combined with RTT (Round Trip Time) flight time to detect congestion, can quickly recover from packet loss or link failure
  • Due to out-of-order packet delivery and no support for segmentation, the QP (queue pair) required for SRD transmission is significantly reduced

Why?

Why not TCP?

TCP is the main means of reliable data transmission in IP networks. It has served the Internet well since its birth and is still the best protocol for most communications. However, it is not suitable for delay-sensitive processing. The best round-trip delay of TCP in the data center is about 25us. The abnormal value caused by congestion (or link failure) waiting can be 50 ms or even several seconds. The main reason for these delays is the retransmission mechanism after TCP packet loss. In addition, TCP transmission is a one-to-one connection. Even if the delay problem is solved, it is difficult to reconnect quickly in the event of a failure.

TCP is a general protocol and is not optimized for HPC scenarios. As early as 2020, AWS has proposed the need to remove TCP.

Why not RoCE?

InfiniBand is a popular high-throughput, low-latency interconnect for high-performance computing that supports kernel bypass and transport offload. RoCE (RDMA over Converged Ethernet), also known as InfiniBand over Ethernet, allows InfiniBand transports to run over Ethernet, which could theoretically provide an alternative to TCP in AWS data centers.

The EFA host interface is very similar to the InfiniBand/RoCE interface. However, the InfiniBand transport is not suitable for AWS scalability requirements. One reason is that RoCE requires PFC (priority flow control), which is not feasible on large networks because it causes head-of-line blocking, congestion spreading, and occasional deadlocks. PFC is more suitable for data centers smaller than AWS. In addition, even with PFC, RoCE still suffers from ECMP (equal-cost multi-path routing) conflicts under congestion (similar to TCP) and suboptimal congestion control.

Why SRD?

SRD is a reliable, high-performance, low-latency network transmission designed specifically for AWS. This is a major improvement in data center network data transmission. Inspired by InfiniBand reliable datagrams, SRD has also undergone many changes and improvements in combination with workloads in large-scale cloud computing scenarios. SRD uses the resources and characteristics of cloud computing (such as AWS's complex multipath backbone network) to support new transmission strategies and bring value to tightly coupled workloads.

Any real network will have a series of problems such as packet loss, congestion and blocking. This is not something that happens once a day, but it happens all the time.

Most protocols (like TCP) send packets in order, which means that a single packet loss disrupts the on-time arrival of all packets in the queue (an effect known as "head-of-line blocking"). This can actually have a huge impact on packet loss recovery and throughput.

The innovation of SRD lies in intentionally sending packets through multiple paths. Although the packets are usually out of order when they arrive, AWS achieves extremely fast reordering at the receiving end, ultimately greatly reducing transmission latency while fully utilizing network throughput capacity.

SRD can push all the data packets that make up a data block to all possible paths at once, which means that SRD will not be affected by head-of-line blocking and can recover from packet loss scenarios faster and maintain high throughput.

As we all know, P99 tail latency means that only 1% of requests are allowed to be slow, but this also reflects the final performance caused by all packet loss, retransmission and congestion in the network, and can better explain the "real" network situation. SRD can make P99 tail latency drop sharply (about 10 times).

The main functions of SRD include:

  • Out-of-order delivery: Remove the constraints of in-order message delivery, eliminate head-of-line blocking, AWS implements a packet reordering processing engine in the EFA user space software stack
  • Equal Cost Multi-Path Routing (ECMP): There may be hundreds of paths between two EFA instances. By using the consistent flow hashing properties of large multi-path networks and the SRD's ability to react quickly to network conditions, the most efficient path for messages can be found. Packet Spraying prevents congestion hotspots and allows for quick and seamless recovery from network failures.
  • Fast packet loss response: SRD responds to packet loss much faster than any higher-level protocol. Occasional packet loss, especially for long-running HPC applications, is part of normal network operation, not an abnormal situation.
  • Scalable transport offload: With SRD, unlike other reliable protocols (such as InfiniBand Reliable Connections and IBRC), a process can create and use a queue pair to communicate with any number of peers.

How

The key to how SRD actually works is not the protocol, but how it is implemented in hardware. In other words, for now, SRD only works when using AWS Nitro DPUs.

The packets delivered out of order by SRD need to be reordered before they can be read by the operating system, and the CPU, which is busy with many things, cannot be relied upon to process the chaotic packet stream. Even if the CPU is fully responsible for the SRD protocol and reassembling the packet stream, it is undoubtedly a case of using a cannon to kill a mosquito - a waste of talent, which will keep the system busy processing things that should not take too much time, and will not really improve performance at all.

Under this unusual "protocol guarantee" of SRD, AWS leaves message order recovery to upper layers when parallelism in the network causes packets to arrive out of order, because it has a better understanding of the required ordering semantics and chose to implement the SRD reliability layer in the AWS Nitro card. The goal is to have SRD as close to the physical network layer as possible and avoid performance noise injected by the host operating system and hypervisor. This allows for rapid adaptation of network behavior: fast retransmissions and rapid slowdowns in response to queue buildup.

When AWS says they want the packet to be reassembled "on the stack," they're actually saying they want the DPU to do the work of putting the pieces back together before returning the packet to the system. The system itself has no idea that the packet is out of order. The system doesn't even know how the packet got there. It just knows that it was sent somewhere else and it arrived without error.

The key here is the DPU. AWS SRD only works on systems that have Nitro configured in AWS. Many servers using AWS now have this additional hardware installed and configured, and the value of enabling this feature is that it will improve performance. Users need to enable it specifically on their own servers, and if they need to communicate with a device that does not have SRD enabled or is not configured with a Nitro DPU, they will not get the corresponding performance improvement.

As for whether SRD will be open source in the future, which many people are concerned about, we can only wait and see!

<<:  Why You Should Avoid Public WiFi

>>:  H3C ranks first! Its market share of Chinese enterprise network switches exceeds 30% in the first three quarters of 2022

Recommend

Flash is dead, and its first victim appears: the router can no longer log in

Flash was once the memory of a generation, but it...

Network Slicing "Hot Pot Theory": Same Pot, Different Dreams

In the dog days of summer, when people are "...

China builds quantum communication network to resist all known hacker attacks

On January 7, news came from the University of Sc...

5G concepts are performing well. Who will become the best among the strong?

On Monday, the two markets showed a weak and vola...

5G network construction: NSA or SA first?

Different from the era from 3G to 4G, the evoluti...

Mid-year review: 10 hottest web startups in 2021

Rising Star If the IT industry has learned anythi...

NB-IoT standard is expected to be commercially available in 2017

June 16, 2016 was the most important day for the ...

Single-mode fiber: What's next?

As the demand for high-speed, reliable networks c...