With the rapid development of artificial intelligence technology, the continuous expansion of model scale has brought unprecedented challenges to the underlying computing power. In order to support large-scale training of massive data in the AIGC field, a large number of servers form a large-scale computing power cluster through high-speed networks, interconnect and jointly complete training tasks. However, the expansion of cluster size is also accompanied by a surge in communication overhead, which has become a key factor restricting computing efficiency. During model training, GPUs frequently switch between computing and waiting for data synchronization, resulting in the idleness of valuable computing resources. Only by continuously improving communication efficiency and minimizing communication costs can computing resources be fully utilized. Therefore, in order to give full play to the powerful computing power of GPU computing resources, it is necessary to build a new high-performance network base and use the large bandwidth of high-speed networks to boost the efficiency of the entire cluster computing. In 2023, Tencent Cloud publicly demonstrated its self-developed Xingmai high-performance computing network for the first time, comprehensively improving the training efficiency of large enterprise models and accelerating the iteration and application of large model technology on the cloud. A year later, the Xingmai high-performance computing network was fully upgraded. The upgraded Xingmai Network 2.0 is equipped with fully self-developed network equipment and AI computing power network cards, supporting large-scale networking of more than 100,000 cards, and the network communication efficiency is 60% higher than the previous generation, which increases the efficiency of large model training by 20%. Wang Yachen, vice president of Tencent Cloud, compared AI big models to an F1 race. Tencent Cloud specially designed the Xingmai high-performance computing network "track" and developed the TiTa and TCCL network protocols as the "road control system and professional team" to make the powerful F1 car "Tencent Cloud High-Performance Computing Cluster HCC GPU server" play the maximum computing performance, helping customers to stay ahead in the competition of AI big models. At the same time, a professional repair team is equipped to quickly locate and rescue once a fault occurs, so that the event can be quickly resumed. Wang Yachen, Vice President of Tencent Cloud This time, Xingmai Network 2.0 has comprehensively upgraded the four key components: self-developed network equipment, communication protocols, communication libraries and operating systems. Track upgrade-self-developed network hardware Through self-developed network hardware equipment, Xingmai Network's "track" has also been completely upgraded. The capacity of the self-developed switch has been upgraded from 25.6T to 51.2T, and the rate of the optical module has been upgraded from 200G to 400G, which reduces network latency by 40%, doubles the overall networking scale, and the same training cluster can support more than 100,000 cards. It also supports pluggable control cards, which comprehensively reduces power consumption and operation and maintenance costs. It is worth noting that Xingmai Network 2.0 is equipped with Tencent's self-developed new computing network card CNIC, which is the first network card designed for AI training in the public cloud industry. The network card uses the latest generation of FPGA chips, and the bandwidth of the entire card can reach 400Gbps, with the industry's highest 3.2T whole-machine communication bandwidth. Command center upgrade-self-developed communication protocol TiTA The self-developed TITA protocol is equivalent to a command center, allocating traffic flow, avoiding congestion in a single lane, and releasing the speed limit of the car. Compared with the previous generation, TiTa protocol 2.0 has been transferred from deployment on the switch to the network card on the end side, and the protocol algorithm has also been upgraded from the original passive congestion algorithm to a more intelligent active congestion control algorithm, which can actively adjust the packet sending rate to avoid network congestion; and through intelligent congestion scheduling, network congestion can be quickly self-healed. This improves the network communication performance under MoE training by 30% compared to 1.0, bringing a 10% increase in training efficiency. Fleet Upgrade-TCCL The communication library TCCL of Xingmai Network 1.0 is equivalent to an intelligent navigation system, shortening the arrival path. The TCCL communication library of Xingmai Network 2.0 is like a more professional team. Originally, it only added navigation to the car, but now it can modify the car itself according to different scenarios to keep the car in optimal performance at all times. In the TCCL 2.0 stage, Tencent Cloud upgraded the communication library such as NVLINK+NET heterogeneous parallel communication and Auto-Tune Network Expert adaptive algorithm. Under MoE model training, it brought a 30% improvement in communication efficiency to Xingmai Network and a 10% improvement in model training efficiency. Repair team upgrade-operation system GOM&GOA The operation system is a repair team. The full-stack network operation system ensures the availability of the road and repairs it immediately after an abnormality occurs, so that the network can resume training as soon as possible. The operation system 2.0 adds a spiritual simulation platform, which collects log records and GPU-related information during the training process, restores the spatial relationship of training tasks and the timing relationship of communications through simulation, locates the stuck and performance jitter failures of large model training, and shortens the positioning efficiency from the traditional days to less than 10 minutes. Faced with the surge in GPU performance, the network has become a bottleneck for cluster computing power. Tencent is planning to build an open and flexible ETH-X supernode system based on Ethernet technology to break through the bottleneck of cluster computing power, reduce cluster costs, and provide stronger support for the further development of AI technology. |
<<: What are the differences between WAN, LAN, PAN and MAN?
Hello, everyone, I am Zhibeijun. Almost all large...
Since the three major operators issued 5G commerc...
[51CTO.com original article] If someone asked wha...
Previously, we talked about how the domain name i...
[[433169]] The Wi-Fi Alliance announced on Tuesda...
If you need a high-end server, you can also take ...
[[267345]] 5G has become a hot topic among people...
Introduction I mentioned before that I would like...
The KDDI network failure that occurred a few day...
The goal of thread communication is to enable thr...
[[222946]] Activities surrounding "3.15"...
[Shenzhen, China, July 30, 2020] Today, at the Cl...
Recently, the Ministry of Industry and Informatio...
Recently, StreamNative solemnly announced the rel...
[[337703]] 【51CTO.com Quick Translation】 The glob...