NVIDIA Network Senior Product Manager Chen Long: Unveiling the Evolution of InfiniBand Network Cluster Architecture

NVIDIA Network Senior Product Manager Chen Long: Unveiling the Evolution of InfiniBand Network Cluster Architecture

Whether it is the evolution of data communication technology, the innovation of Internet technology, or the upgrade of visual presentation, all benefit from more powerful computing, larger capacity and more secure storage, and more efficient networks. In line with customer needs, NVIDIA Networks has proposed a cluster architecture solution based on InfiniBand networks, which can not only provide higher bandwidth network services, but also reduce the consumption of computing resources by network transmission loads, reduce latency, and perfectly integrate HPC with data centers, building the cornerstone of the development of the Metaverse.

Recently, at the "Human-Computer Interaction and High-Performance Network" session of the MetaCon Metaverse Technology Conference hosted by 51CTO, NVIDIA Network Senior Product Manager Chen Long focused on the high-performance network solution of the Meta cluster and gave an overall introduction. He deeply revealed to Metaverse technology enthusiasts what InfiniBand is and InfiniBand accelerated computing and storage, and other exciting content!


1. Background of InfiniBand Network Cluster Architecture

As we all know, we are now in an information age, and all information is based on digitalization. The popularity of 5G and IoT has built hundreds of millions of terminals for us and opened up digital terminals. Billions of data sources are continuously generated and fed into the cloud every moment, and technologies represented by big data, AI, and blockchain can continuously analyze and refine data, mine potential data, and feed back to terminals to serve society while efficiently storing data in the cloud built by data centers.

In recent years, with the continuous improvement of web3.0, VR and AR technologies, the boundaries of the Internet have been continuously broken. Under the leadership of Meta, the era of the metaverse is quietly approaching. While the technological revolution enriches our lives, the cornerstone supporting the information age has not changed. Computing, storage and networking are still the main themes of its technological development.


Whether it is the evolution of data communication technology from 2G to 5G, or the continuous innovation of web1.0, web2.0, and web3.0 Internet technologies, or the transition of visual presentation from pictures to videos to VR and AR, the following three points support its development.

First, more powerful computing is needed: 1. More cores for parallel computing; 2. Heterogeneous computing, breaking through the limitations of the X86 CPU architecture. The emergence of computing units such as RISC-V, ARM, GPU, and FPGA meets the needs of professional computing.

Second, larger and more secure storage. With storage capacities ranging from TB to EB, the storage architecture is also more diverse, with centralized, distributed, and parallel storage meeting the needs of large-capacity, high-performance data storage.

Third, as a bridge between data, computing and storage, the network has developed extremely rapidly while computing and storage have evolved, from 100G, 200G, to 400G. At the same time, it has also expanded from the original TCP and UDP network connections to RDMA, continuously improving the performance of network transmission.

On the other hand, as data service providers, in order to cope with the trend of cloud deployment, they have proposed the concept of cloud native, redesigned the architecture from the application perspective, and made the service more efficient. At the same time, the design center is also facing another change. Cloudification has not only changed the business of traditional data centers, but also had a substantial impact on the demand for HPC. More and more HPC applications such as simulation modeling, graphics rendering, AI training, and digital twins are deployed in the cloud to meet the development needs of various industries.

In this context, NVIDIA Networks proposed a cluster architecture solution based on InfiniBand network in combination with customer needs to efficiently meet customer needs. Compared with traditional network solutions, InfiniBand not only provides higher bandwidth network services, but also reduces the consumption of computing resources by network transmission load, reduces latency, and perfectly integrates HPC with data centers, building the cornerstone of the development of the Metaverse.

2. What exactly is Meta Cluster?

Meta, the flagship company of the Metaverse, and NVIDIA, a company in the field of graphics and image AI, jointly released the Meta dedicated cluster architecture based on the DGX SuperPOD architecture, which is used as the main weapon for Meta to seize the technological commanding heights in the Metaverse era and is responsible for the application of AI algorithms and data computing scenarios. The scale of the first phase of the cluster that has been launched this time is 760 DGX servers equipped with 6080 A100s. All GPUs use 200Gbps InfiniBand HDR network cards to achieve efficient data transmission. The overall cluster performance has reached 189 million TF32 precision computing power.

Such high computing power makes Meta the industry's highest-performance cluster in the field of AI. Such a large-scale cluster is just the beginning of Meta's ambition to build a dominant position in the metaverse era. The subsequent second-phase expansion will reach 16,000 GPUs, the overall cluster capacity will reach 500 million, and the storage data bandwidth will reach 16TB.

Let's take a look at the cluster architecture diagram released by Meta. Such a large cluster will use 20 800-port InfiniBand cabinet switches as the backbone layer of the network, connected to 100 Pods below, and 8 40-port InfiniBand HDR switches deployed in each Pod to achieve a non-blocking CLOS architecture for the entire network, thereby achieving a 20-fold improvement in performance computing, 9-fold NCCL parallel computing capabilities, and 3-fold AI model modeling parameter training compared to the previous Facebook AI cluster. In addition, the cluster also deploys 10PB of NFS centralized data storage, 46PB of block storage, provides memory data recovery, and 175PB of block storage, all of which are fast transmission solutions for data transmission using the InfiniBand network.

What is the reason why InfiniBand has become the primary solution for Meta clusters? From the perspective of InfiniBand's network development history, network cards are different from Ethernet cards. In fact, many designs of Ethernet network cards that we are familiar with now are borrowed from InfiniBand. As can be seen from the figure, InfiniBand developed 10G cards as early as 20 years ago, and it has evolved to 40Gbps in 2008. Subsequently, new products have been developed every three years or so. This year, 400G NDL network cards have been mass-produced, so InfiniBand has become the preferred solution for GPU clusters, and generational evolution will become a two-year generation. In 2023, NVIDIA will release an 800G XDR network card, and in 2025, it will release a 1.6TB GDL network card, laying a solid foundation for eliminating the gap between data transmission.

3. The Secret of InfiniBand Network Architecture

From the panoramic view of the InfiniBand solution, we can see network cards, switches, cables, end-to-end network hardware devices, as well as DPU and gateway devices, which not only build a complete data center network equipment, but also connect the nodes of the WAN in the same city, realizing a complete hardware network transmission solution. There are two points worth mentioning:

The first is the box switch. We provide a 1U 40-port 200G switch, which has a 20% higher switching capacity than competitors of the same level. And for large customers like Meta, we provide the industry's only 20U ultra-large cabinet switch, which can achieve ultra-large-scale switching of up to 800 ports.

Second, InfiniBand provides the industry's new concept of DPU network card, which realizes offloading and isolation of business loads, achieves end-to-end network management and maintenance, maximizes compatibility with old equipment, and allows equipment to seamlessly connect to the high-performance InfiniBand network. Based on these hardware, we have also pioneered the emerging concept of network computing, realizing computing on switches, and combined with SHIELD, SHARP, GPU RDMADirect and other functions, making our network more intelligent and efficient.


4. How does InfiniBand achieve accelerated computing?

When it comes to computing, people who don't know much about RDMA applications may wonder how a network responsible for transmission can achieve computing acceleration? The problem is that real data transmission is not just a matter of network devices. Taking the well-known TCP message forwarding as an example, a large amount of data and protocol message processing requires deep CPU involvement. For example, message encapsulation, forwarding, and context switching all require a lot of CPU overhead to achieve. Under such a mechanism, when the data traffic below 10G bandwidth is not large, the CPU resource occupation is not obvious. However, when the traffic rises to more than 100G, we will find that the entire CPU overhead will increase significantly. In some scenarios, the CPU consumption will reach more than 20 cores to achieve 100G data transmission. Therefore, in the context of general servers entering 100G transmission, the cost of consuming the CPU resources for transmission is to help accelerate computing.

RDMA is such a technology that realizes direct data transmission within the servers at both ends of the communication. The CPU will not be involved in the entire data operation. This not only reduces the CPU overhead, but also prevents the CPU from becoming a bottleneck for data transmission, allowing our data transmission to evolve to 200G, 400G or even 1TB of data.

From the figure, we can see that for an ordinary server, when RDMA technology is not used, because the CPU is responsible for a large amount of protocol overhead processing, 47% of the resources work in the Kernel state, and only about 50% of the resources are used for program calculations, which limits the application expansion of the entire server. If we use RDMA technology, a large amount of data plane that consumes CPU resources is completely unloaded on the network card, and we can control the resources in the Kernel to 12% of the CPU, doubling the CPU resources in the user state. This not only improves the performance of the entire transmission, but also frees up CPU resources to deploy more computing loads, achieving an increase in the overall bandwidth while increasing business deployment and improving the utilization efficiency of the entire server.

In addition, how to accelerate the GPU?

With the rapid popularization of AI technology, the application of GPU is becoming more and more important. In addition, since there are tens of thousands of cores on the GPU to perform calculations, the demand for data transmission will be even greater. While CPU servers are generally transitioning to 100G, 200G networks for GPU servers have become standard, and we are transitioning to 400G and even 800G networks. Therefore, the demand for network transmission for GPU will be more urgent.

In addition to technologies such as RDMA, the solution also needs to further expand the constraints on the network data plane to allow the GPU to run at full speed. In the standard GPU server architecture, we know that the GPU is connected to the CPU in the PCIe mode. Under this architecture, when the GPU transmits data on the server, all data must pass through the CPU.

From the above figure, we can see that if this transmission method is used, the data transmission of GPUs across servers needs to implement five steps of data copying. First, the GPU memory inside the server must transfer its own data to the local CPU memory through the PCIe bus, and then the local CPU memory will copy the data to the memory of the dedicated RDMA transmission pipeline. Then, through RDMA technology, the data is transferred from the memory of this server to the memory of another server, and then the memory of another server will copy it to the memory that interacts with the local GPU memory. Finally, this part of the data is copied to the GPU memory. With five steps of data copying, we can see that this operation will become very complicated, and the CPU memory in the middle will become a bottleneck for data forwarding.

To solve this problem, GPU Direct RDMA technology is needed. This technology can enable the GPU and network card to bypass the CPU directly and realize direct data connection between the network card and the GPU. In this way, only one step of data copy is needed, and the data on the GPU at the sending end can jump directly from its video memory to the video memory of the GPU at the destination end, realizing fast data copying. This simplifies the process, reduces latency, and achieves the effect of accelerating GPU applications.

After using GPU Direct RDMA technology, it can achieve 90% latency savings for AI clusters, and the I/O bandwidth of message transmission with a packet size of more than 4K has achieved a tenfold performance improvement. At the same time, under the premise of such a significant improvement in network performance, the performance of parallel computing tasks in AI clusters has been improved by more than double, greatly improving the efficiency of AI clusters and improving the input-output ratio. It is for this reason that Meta is determined to use InfiniBand's network as the network solution for the industry's largest-scale AI cluster in the Metaverse era, thus confirming the effect of InfiniBand's network in accelerating GPU computing.

In the above, we have explained how InfiniBand can opportunistically accelerate CPU and GPU computing from the perspective of network cards. Of course, as the most critical switch in the network, how does InfiniBand accelerate network computing? Here we need to mention InfiniBand's application SHARP.

We know that there are a lot of AllReduce operations in the AI ​​training process. To put it bluntly, the GPU responsible for distributed computing must update its own data to different computing GPUs at the same time. In this case, under this framework, the data must be repeatedly transmitted over the network to keep the data synchronized on each GPU. And the calculation type of AllReduce is nothing more than simple but frequent operations such as summing, XORing, and finding the maximum value. After knowing this calculation mode, we can imagine turning the switch into a computing node, gathering all the GPU data on the switch for calculation, and uniformly distributing it to each GPU. In this way, since the forwarding bandwidth of the switch is much larger than that of the server, such an architecture not only has no bottleneck in data transmission, but also only needs one transfer in the data network to complete all the calculation processes, which greatly simplifies the calculation process, reduces latency, and eliminates bottlenecks.

As can be seen from the above figure, after using the network computing function on a server cluster of dozens of DGXs, the performance of the overall cluster in completing training tasks has increased by 18%. This means that when using an InfiniBand network cluster, the switch not only completes high-performance data transmission, but also completes nearly 20% of the computing tasks, which improves performance for customers while saving a lot of server investment costs.


5. How does InfiniBand achieve accelerated storage?

As we all know, computing and storage are the two most important components of any cluster. Although the number of storage servers in a cluster is significantly less than that of computing servers, in essence, servers engaged in storage are only a small part of data storage. In a broad sense, storage is actually spread throughout every corner of the cluster.

Here, we classify and arrange these common storage devices according to the following four dimensions.

1. Bandwidth for data storage

2. Data access latency

3. Storage device capacity

4. Storage cost per unit capacity

It is not difficult to see that the IRAM memory SSD resource pool, hard disk resource pool and tape resource pool can be arranged diagonally. This means that the storage cost-effectiveness of the cluster is the highest and the storage solution is the most reasonably configured.

However, if mechanical hard disks and solid-state hard disks exist as single devices, the storage solution cannot achieve a diagonal arrangement. The reason is actually very simple. Taking mechanical hard disks as an example, due to the limitation of storage bandwidth, a single hard disk cannot provide higher I/O and larger capacity. Therefore, when distributed storage emerged, the pooling solution perfectly solved this problem, which greatly increased the bandwidth of hard disks and increased the capacity. Today, with the rise of solid-state drives, although the bandwidth has increased by one or two orders of magnitude, it is still not fast enough compared to memory, and the storage capacity is not large enough. Therefore, pooling through network solutions will become an inevitable trend for solid-state drives, and at this time the network will bear hundreds of G of traffic pressure.

Therefore, for storage, InfiniBand's acceleration is essentially achieved by pooling storage devices in parallel, thereby improving data performance and achieving acceleration effects.

The cluster is reconstructed through the InfiniBand network, and the computing units and storage units are separated into pools. InfiniBand is used as the backplane bus of the entire cluster to efficiently interconnect them, laying the hardware foundation for software-defined clusters. In this way, the high-performance cluster becomes an ultra-high-performance server that can flexibly configure computing and storage resources according to the different load characteristics of various tasks, maximizing efficiency while achieving higher performance. In addition, when the cluster is expanded in the future, the required resources can be expanded in a targeted manner according to the actual situation, thereby improving the elasticity of the cluster. All of this needs to be built on a highly reliable, high-bandwidth, and low-latency network.

To learn more about the Metaverse network and computing, please visit the official website of the MetaCon Metaverse Technology Conference at: https://metacon..com/





<<:  All in ONE! Borei Data launches an integrated intelligent observability platform

>>:  Ruishu Information: Beware of misappropriation of payment interfaces, API security protection is imperative

Recommend

DNA of Fintech Data Chain

At the 2020 Financial Street Forum Annual Meeting...

NETSCOUT's OneTouch AT G2 is your network testing nightmare

[51CTO.com original article] Xiao Nie just return...

Migrate WHM/cPanel data to DA (DirectAdmin)

I shared an article about migrating from CP to DA...

What exactly are the CE, C++, and C+L bands?

[[400269]] This article is reprinted from the WeC...

Amid COVID-19, have we neglected border security?

[[342703]] The coronavirus pandemic has triggered...

When WiFi Master Key quietly takes away your password, do you really not mind?

Many Internet companies or products , in the name...