Does Overlay require RDMA and Segment Routing?

Does Overlay require RDMA and Segment Routing?

In the past few years, there have been some debates. Is it Segment Routing over SD-WAN or SD-WAN over SegmentRouting? Next, let's expand this topic to Overlay. Is it SegmentRouting over Overlay or Overlay over SegmentRouting? The former is Ruta and SR over UDP, and the latter is SRv6. In essence, the application perspective corresponds to the network perspective. Half a month ago, some network engineering friends complained to me: Network engineers engaged in cloud native are about to lose their jobs, and there was some sadness in their words.

With every technological change, if you cannot become a bulldozer or a steamroller, then you can only become a part of the road, being flattened and crushed.

Kong Yiji used to be a network engineer, but he failed the IE exam and didn't know how to make a living. He became poorer and poorer, and was about to beg for food. Fortunately, he was good at making crystal plugs, so he could make cables for others and earn a living. Unfortunately, he had another bad temper, which was that he liked to drink and was lazy. After a few days, he disappeared with the network cable pliers. After a few times, no one asked him to make cables.

Kong Yiji knew that he couldn't talk to them, so he had to talk to the children. Once he said to me, "Have you learned about the Internet?" I nodded slightly. He said, "I have learned about the Internet,... I will test you. How are the lines of the eight lines of the network cable arranged?" I thought, is a beggar qualified to test me? I turned my head and ignored him. Kong Yiji waited for a long time, and said very earnestly, "Can't you arrange it?... I will teach you, remember! You should remember these line sequences. When you become a network engineer in the future, you will need them to make crystal heads." I thought to myself that I was still far from the level of a network engineer, and we network engineers never make crystal heads; I felt funny and impatient, and replied lazily, "Who wants you to teach me, isn't it orange, white, orange, green, white, blue, blue, white, green, brown, white, brown?" Kong Yiji looked very happy, knocked the long nails of two fingers on the counter, nodded and said, "Yes, yes!... There are two ways to make crystal heads, do you know?" I became more and more impatient, and walked away with a pout.

After that, I didn’t see Kong Yiji for a long time. At the end of the year, the shopkeeper took down the chalk board and said, “Kong Yiji still owes 19 coins!” At the Dragon Boat Festival of the following year, he said again, “Kong Yiji still owes 19 coins!” But he didn’t say anything at the Mid-Autumn Festival, and I didn’t see him at the end of the year. I haven’t seen him until now - maybe Kong Yiji is really dead.

Over the past thirty years, the network has been hobbled by several oligarchs due to its heavy asset characteristics, and major technical errors are everywhere. The network thirty years ago was indeed very complicated, x.25/FR/ATM, and even ordinary users still needed to use AT commands to dial 163 and 169 to access the Internet. The network naturally has its mysterious veil, which makes application developers daunted. Even introductory books such as "Computer Networks" are incomprehensible to application developers, not to mention those complex routing protocols and broadcast storms or routing loops that can be formed accidentally. These are the technologies that network engineers used to boast about. Nowadays, Ethernet is everywhere and there seems to be no complicated configuration. The SDN that network engineers messed around with, along with BGP invented decades ago, has really caused many accidents. No wonder when an application fails, the first error message is "Please check your network."

Driven by cloud computing, the status of network engineers has become increasingly low. The cloud computing resources of many companies are controlled by the computing team. Network engineers are becoming increasingly unfamiliar with cloud networks, which eventually forces applications to develop cloud native. As a result, network engineers are completely crushed. Although network engineers have been desperately learning Python to engage in DevOps in the past few years, and even using Python to write routing protocols such as BGP, they are getting further and further away from applications, and the technologies they invent are becoming increasingly difficult to use. Therefore, the death of SDN is inevitable. The two xs in the figure below are both related to the network.

As for those who are engaged in SD-WAN, they seem to not even understand the development of distributed database consistency, so they will naturally stumble due to software complexity.

Does Overlay require RDMA?

Let's talk about a simple question first. Some people say RDMA is a treasure, but AWS doesn't care. This is essentially the way of thinking of network engineers. The logic behind this should be whether Overlay needs RDMA? I don't know if the person who wrote this has read the SRD driver code. If not, you can understand the #include in the front, right?

RDMA has a relatively good ecosystem for implementing kernel bypass at the virtual machine level, which is something that must be considered. From this perspective, SRD and eRDMA are essentially the same.

The reason why I saw a series of problems such as DDIO and the memory instruction extension of the storage-computing integrated structure and the future CXL can easily operate the memory and I/O isolation control jitter on various adapters, and encountered a memory bottleneck in a 400G project, is that I hope to save the last DMA buffer copy and directly access the network card memory through CXL.cache to alleviate the jitter caused by DDIO and PCIe bus, and add some vectorized instruction sets for such memory operations. Therefore, NetDAM builds a programmable multi-machine shared memory abstraction layer. However, the basis of these is still to provide a SMC (Shared Memory Communication) for the virtual machine, but the original QP mechanism of RDMA is converted to the addressing operation of IP address + memory address, and some instruction set expansion space is given, which makes the system capacity expandable.

So the next question is whether the simple hashing of SRD is effective?

Under the standard Spine-leaf architecture, the original TCP needs to use flowlet forwarding and may cause reordering and jitter due to the order preservation requirement. SRD is somewhat similar to QUIC, which splits the communication into smaller blocks and does not need to preserve order at the transport layer, which is very good.

I am glad to see AWS emphasizes Jitter instead of latency, which is even better. Jitter has a greater impact on reliable transmission than latency. However, AWS may have forgotten that many data processing operations have obvious chain characteristics, and it is difficult to achieve the following communication with the help of RDMA:

This is why nVidia chose to change the previous Ring-Allreduce to Tree-based allreduce in NCCL after acquiring Screw. In essence, the chain reaction will increase the jitter. If each hop passes through the host's PCIe, the jitter will be amplified even more:

But we have to realize that most of today's fastest supercomputers use 2D-Torus, 3D-Torus, and even 6D-Torus topologies.

Essentially, this problem is that the NOC of the slice network connects multiple machines through the bus. The QP structure of RDMA determines that its addressing and chain reaction capabilities are not good. NetDAM essentially removes the QP structure with the help of the concept of Segment-Routing. At the same time, it provides standard UDP-based SMC communication for Internet terminals, which will be very useful for scenarios such as IoT in the future.

As for why Segment-Routing is useful here, take a look at the CHI bus. Essentially, it is because of power consumption and wiring issues. These problems that exist in NOC also exist within the data center.

Conclusion: The demand for RDMA in Overlay and virtual machines comes from SMC and Kernel Bypass in the existing ecosystem. Whether eRDMA or SRD is the only option at this stage. The only question is whether there can be more optimization in the underlying implementation, such as the order preservation of memory operations, packet loss tolerance, consistency, and whether transactions can be implemented.

If we really want to compare, it should be RDMA vs. NetDAM, rather than simply setting the pace. NetDAM itself also needs to wait for CXL to mature gradually. For example, the operation drivers of network card memory and graphics card memory for CXL in Linux are gradually improved, but this process will take at least 3 to 5 years. Maybe CXL needs to define some more flexible topology structures like CHI. If you don’t understand, continue to read NetDAM’s paper:

https://arxiv.org/abs/2110.14902

Embrace change, does Overlay need SR?

Many network teams have the same opinion about projects like Ruta: Isn't SRv6 good? If the SID is too long, it can be compressed. Why is it necessary to have another one? But they forget that the essential difference is whether Segment Routing is placed in the overlay or underlay. For their own benefit, the network team will naturally choose a protocol that can take over the overlay.

Most of the people who asked me to work on Ruta were application teams, especially those with audio, video and CDN services, as well as some container network teams. Working on SR on Overlay is an inevitable choice because of the scheduling of business traffic and multi-cloud interconnection.

For example, in a hybrid cloud scenario, we often encounter many challenges when deploying SD-WAN for customers. For example, after going to AWS, we need to build an IPSec tunnel between our own SD-WAN router and AWS, and Azure requires a dedicated NVA node to redistribute BGP. Other clouds are also facing the era of VPC itself being in a static routing state. For example, in order to help a customer go to Alibaba Cloud, Zha wrote a BGP+aliyun-cloud-shell applet to help redistribute routes on both sides.

Therefore, in Zha's paper, using Segment Routing to build Transparency VPC within VPC is more effective than the original Transit VPC technology and can be more cloud-agnostic, because a large number of cloud-native K8S nodes and container networks themselves are built on an Overlay based on VPC.

Isn’t it clear that the network team is almost unemployed because of cloud native? The business itself also has service-chaining requirements, and these requirements are difficult to trigger in a chain manner on the traditional VPC architecture. For example, how to implement supercomputing and other services on the Overlay and low-latency scenarios such as MPI-RingAllreduce? Is it possible to reduce the burden of API-Gateway on the overlay through protocol encoding? These are all problems that applications need the help of the network team to solve, but unfortunately the network is still daydreaming about its own Overlay provided by SR.

Conclusion: Building SegmentRouting for applications on top of VPC is the key. Let go of your religious beliefs about SRv6 and embrace change.

IPv4 over SRv6, SR-MPLS

Another problem that operators often encounter is that after you build an SRv6 or SR-MPLS network, they usually complain about why the service is not switched over. This is an obvious question. Who has the time to change the code for you? They also require the application to have root permissions and forward through the kernel? They have finally completed the kernel bypass for their application, and now they want you to go through the kernel again...

Especially for those small router terminals in households, there are tens of millions of them in the country. Is it worth replacing them all for SRv6? Of course, there is motivation for equipment manufacturers, but it is definitely not possible for operators. At the same time, when there are still a lot of broadband access equipment that is difficult to upgrade and transform, why not think of a 4over6 technology? Use Ruta to implement Binding-SID mapping on the access side to easily import traditional networks into SRv6 or SR-MPLS networks?

The reincarnation of technology

In fact, many technologies are like this. Architects need to consider the ecosystem and reuse. Even if you understand where the future lies, patience is more important than confidence, because you need to change and iterate step by step towards the terminal, rather than simply starting over. You need to learn to use all available resources in the ecological chain, rather than simply standing still.

<<:  Where is the domestic Wi-Fi 6 chip heading?

>>:  Byte One Follow-up: Do you know the TCP/IP four-layer model? What protocols are there in each layer?

Blog    

Recommend

Wi-Fi 6 is here! Wireless veteran explains the next generation of Wi-Fi

[[263958]] Why is it called Wi-Fi 6? Each new Wi-...

Why 99% of business leaders are paying attention to this issue

Digitalization and the provision of digital servi...

From comfort zone to challenge zone, operators enter a period of deep adjustment

Data released by the Ministry of Industry and Inf...

The future development trend of the Internet will transition from HTTP to IPFS

We know that IPFS is a new Internet underlying pr...

What happens if you apply for 8G memory on a machine with 4GB physical memory?

Hello everyone, I am Xiaolin. This morning I saw ...

Mellanox: Reconstructing the network world with data at the center

Not long ago, as a leading provider of end-to-end...