Tencent releases StarNet 2.0, increasing AI large model training efficiency by 20%

Tencent releases StarNet 2.0, increasing AI large model training efficiency by 20%

With the hot and continuous iteration of large models, AI infrastructure is increasingly becoming one of the core competitive advantages of cloud vendors.

On July 1, Tencent announced that its self-developed Xingmai high-performance computing network has been fully upgraded. The upgraded Xingmai Network 2.0 is equipped with self-developed network equipment and AI computing network cards, supporting large-scale networking of more than 100,000 cards. The network communication efficiency is 60% higher than the previous generation, which makes the training efficiency of large models 20%. This means that if the synchronization of a certain calculation result in the original training takes 100 seconds to complete, it now only takes 40 seconds; the model that originally took 50 days to train only takes 40 days.

The AI ​​big model is like an F1 race. Tencent Cloud has specially designed the Xingmai high-performance computing network "track" and developed the TiTa and TCCL network protocols as the "race command center and professional team" to allow the "Tencent Cloud High-Performance Computing Cluster HCC GPU server", a powerful F1 car, to exert its maximum computing power performance, helping customers to stay far ahead in the competition for AI big models.

The popularity of AIGC has driven the number of AI model parameters to soar from billions to trillions. The scale of model parameters and architecture upgrades have also put forward new requirements for the underlying network.

To support large-scale training of massive data in AIGC, a large number of servers are connected through high-speed networks to form a large-scale computing cluster to jointly complete training tasks.

However, the larger the cluster size, the higher the communication loss will be. At the same time, the communication mode of AI training is quite different from the traditional communication mode, and there are also differences in communication modes for different large model architectures. During the training process of some large models, communication accounts for up to 50%. At the same time, the distributed computing mode also means that a single point of failure will cause the entire cluster to be unavailable, so when a failure occurs, it is necessary to quickly locate and resume training to minimize the loss.

How to improve communication efficiency and reduce the proportion of communication under the premise of large-scale networking, make training stable and highly available, and then improve GPU utilization and model training efficiency, is the core problem that AI networks need to solve.

Data shows that during the training of large models, Xingmai Network 2.0 can reduce the proportion of network communication (the proportion of communication time to the total time) to 6%, which is far lower than the industry level of 10%; the communication load rate reaches 90%, which is the same as the IB network (Infiniband) and 60% higher than the standard Ethernet. The overall capability is at the top level in the industry.

Four major components have been fully upgraded to help speed up AI training

Tencent's self-developed Xingmai Network is a high-performance network system that integrates software and hardware. It includes four key components: self-developed network equipment, communication protocols, communication libraries, and operating systems. Each component adopts Tencent's core technology that is the first of its kind in the industry.

Wang Yachen, Vice President of Tencent Cloud

In terms of hardware, Tencent Xingmai Network is the industry's first high-performance network that uses all self-developed network equipment, including switches, self-developed optical modules, network cards, etc. The self-developed switch capacity has been upgraded from 25.6T to 51.2T. At the same time, it is the first in the industry to introduce 400G silicon optical modules, doubling the speed , reducing network latency by 40%, and supporting large-scale networking of more than 100,000 cards.

It is worth noting that Xingmai Network 2.0 supports Tencent's self-developed new computing network card, which is the first network card designed for AI training in the public cloud industry. The network card uses the latest generation of FPGA chips, with a bandwidth of up to 400Gbps and the industry's highest 3.2T whole machine communication bandwidth. The self-developed computing network card runs the new generation of Tencent's self-developed communication protocol TiTa and is equipped with Tencent's unique active congestion control algorithm.

Compared with the previous generation, TiTa protocol 2.0 has been deployed on the switch and moved to the end-side network card. It has been upgraded from the original passive congestion algorithm to a more intelligent active congestion control algorithm, which can actively adjust the packet sending rate to avoid network congestion; and through intelligent congestion scheduling, it can achieve rapid self-healing of network congestion. This improves the network communication performance under the mixed expert (MoE) model training by 30% compared to 1.0, and brings a 10% improvement in training efficiency.

The high-performance collective communication library TCCL designed specifically for the StarMage network has also been upgraded. Through the upgrade of innovative NVLINK+NET heterogeneous parallel communication, Auto-Tune Network Expert adaptive algorithm and other communication libraries, under MoE model training, the StarMage network has achieved a 30% improvement in communication efficiency and a 10% improvement in model training efficiency.

TCCL's external interface is exactly the same as the native communication library interface. Mainstream AI large-model customers do not need additional adaptation, they only need to replace the communication library to unleash the capabilities of StarPulse.

The combined effects of the upgrades to the communication protocol TiTa and the communication library TCCL have increased the communication efficiency of the StarMage network by 60% and the training efficiency of the MoE large model by 20%.

A network failure or any single point of failure will cause the entire cluster to become unavailable, pausing model training. Therefore, high availability and stability of the network are also extremely important. To ensure the high availability of the StarMage network, Tencent Cloud has developed an end-to-end full-stack network operation system, which is also the fourth key component of the StarMage network.

The operating system 2.0 adds Tencent's exclusive Lingjing simulation platform, which can locate GPU node problems instead of only locating network problems, and can locate slow nodes in minutes for training failures at the 10,000-card level. This provides 360-degree, all-around monitoring of the Xingmai network, which can detect and locate network problems more quickly, greatly shorten the overall troubleshooting time, and resume training as soon as possible when a failure occurs.

Building the best cloud for large models

Currently, Tencent Cloud has launched large-model full-link cloud services such as HCC, AIGC storage solutions, vector databases, industry large-model services MaaS, and Tianyu AIGC content security solutions for AIGC scenarios. More than 80% of the leading large-model companies use Tencent Cloud services.

The large model training cluster uses high-performance cloud servers as nodes in HCC, fully equipped with the latest generation of GPUs. The nodes are interconnected through the self-developed StarPulse network, providing an integrated high-performance computing product with high performance, high bandwidth and low latency.

Tencent Cloud AIGC cloud storage solution is the first cloud storage solution in China to achieve a fully self-developed storage engine. It can double the efficiency of data cleaning and training of large models and shorten the time required by half.

Tencent Cloud VectorDB supports more than 370 billion vector search requests every day, can support storage of hundreds of billions of vectors, millions of QPS and millisecond query latency, and is suitable for training and reasoning of large models, RAG scenarios, AI applications, and search recommendation services, making the efficiency of enterprise data access to AI 10 times higher than traditional solutions.

Tencent Cloud has created the Tianyu AIGC full-link content security solution, providing five major service systems including data services, security experts, machine review, copyright protection, and customer experience management, to protect the company's content security construction from model training to post-operation.

At the same time, supported by its own AI infrastructure, Tencent's self-developed general big model, Tencent Hunyuan Big Model, is also continuously iterating.

With the help of self-developed underlying technologies such as the large model training cluster HCC based on Xingmai Network and the Angel machine learning platform, Tencent has built the Wanka AI training cluster, which can train larger models with fewer resources and the training speed is 2.6 times that of the mainstream framework; the inference cost is reduced by 70% compared with the mainstream framework in the industry, and it supports the adaptation of mainstream domestic hardware.

Tencent Hunyuan has expanded to a trillion-level parameter scale and adopts a hybrid expert model (MoE) structure. It is in the leading position among domestic mainstream large models in terms of general basic capabilities and professional application capabilities. Both corporate customers and individual developers can directly call Tencent Hunyuan through the API on Tencent Cloud to achieve more convenient intelligent upgrades. Tencent has also joined forces with ecological partners to combine large model technology with more than 20 industries and provide large model solutions for more than 50 industries.

The advent of the big model era will usher in the next generation of cloud services. Tencent Cloud is committed to building "the cloud most suitable for big models" and will continue to upgrade the underlying AI infrastructure to help companies seize the AI ​​era.

<<:  Communication styles in microservices architecture

>>:  HPE Aruba Networking Launches Enterprise-Grade Private 5G Network to Simplify Deployment of Dedicated Cellular Networks

Recommend

As VR enters its heyday, how will the three major operators plan their layout?

As we all know, VR is a very popular technology n...

JD.com's Lv Jianwei: Black technology leads the new era of e-commerce

[51CTO.com original article] On July 21-22, 2017,...

How does DH+ compare to Ethernet?

When it comes to industrial communication protoco...

How to configure basic IPv6 addresses? Learn in one minute

1. Understanding IPv6 IPv6 increases the address ...

Why does TCP need three handshakes and four waves?

[[402116]] This article is reprinted from the WeC...

Future Development Path of Home Broadband Infrastructure

Part 01: Background China Mobile Group has furthe...

Network knowledge: Detailed explanation of DNS access principle

Today I will introduce the DNS access principle t...

Using Jenkins to create continuous integration for microservice applications

Experience Overview This scenario guides you to d...

Smartphones supporting Wi-Fi 6/6E will dominate the market by 2025

Wi-Fi 6E will be commercially available in 2021. ...