NVIDIA Ethernet Acceleration xAI Builds World's Largest AI Supercomputer

NVIDIA Ethernet Acceleration xAI Builds World's Largest AI Supercomputer

Oct. 28, 2024—NVIDIA announced that xAI’s Colossus supercomputer cluster in Memphis, Tennessee, has reached a massive scale of 100,000 NVIDIA® Hopper GPUs. The cluster uses the NVIDIA Spectrum-X™ Ethernet networking platform, an RDMA (Remote Direct Memory Access) network designed to deliver exceptional performance for multi-tenant, hyperscale AI factories.

Colossus is the world’s largest AI supercomputer and is currently being used to train xAI’s Grok series of large language models, as well as its chatbot as part of the X Premium user feature. xAI is further doubling the size of Colossus to 200,000 NVIDIA Hopper GPUs.

xAI and NVIDIA built all the supporting facilities and this state-of-the-art supercomputer in just 122 days, and from the first rack landing to the start of training tasks, it took only 19 days. Building a system of this scale usually takes months or even years.

When training a very large model like Grok, Colossus achieved unprecedented network performance. Under the three-layer network architecture, the entire system did not experience any increase in application latency or packet loss due to traffic conflicts. With Spectrum-X's advanced congestion control function, the system data throughput remained at 95%.

This level of performance is simply unachievable at scale with traditional Ethernet, which can only deliver 60% of data throughput when thousands of flows collide.

“AI is becoming increasingly critical, placing greater demands on performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is purpose-built to enable innovators like xAI to process, analyze and execute AI workloads faster, accelerating the development, deployment and time to market of AI solutions.”

Elon Musk said at X: “Colossus is the most powerful training system in the world. Well done to the xAI team, NVIDIA, and our many partners and suppliers.”

“xAI builds the world’s largest and most powerful supercomputers,” said an xAI spokesperson. “With NVIDIA Hopper GPUs and Spectrum-X, we are able to push the boundaries of large-scale AI model training and build an AI factory that is super-accelerated and optimized based on Ethernet standards.”

At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds up to 800Gb/s and is powered by the Spectrum-4 switch ASIC. xAI uses an end-to-end solution that combines the Spectrum-X SN5600 switch with the NVIDIA BlueField-3® SuperNIC to achieve unprecedented performance.

Spectrum-X Ethernet networks specifically for AI have advanced features that deliver low latency and short tail latency while providing efficient, scalable bandwidth, features that were previously exclusive to InfiniBand networks. Spectrum-X features include dynamic routing based on NVIDIA DDP (Direct Data Placement) technology, congestion control calculations, and enhanced visibility and performance isolation for AI networks, all of which are key requirements for multi-tenant generative AI clouds and large-scale enterprise application environments.

<<: 

>>:  Traffic scheduling: DNS, full-site acceleration and computer room load balancing

Recommend

Hostodo: $19.99/year KVM-1GB/12GB/4TB/Las Vegas

Hostodo has released several promotional packages...

3 Ways 5G is Driving Edge Intelligence

5G is closely tied to edge computing. With a whol...

Is 4G enough? More than 40% of users turn off 5G function in new smartphones

So is 5G really that important? Is 4G no longer a...

Learn more about 5G infrastructure

5G New Radio (NR) is a global standard that enhan...

A complete set of DNS related tests in IPv6 environment

[[271457]] Dong Tao, senior operation and mainten...

What happens behind the scenes when the Ping command is issued?

01 Overview [[274853]] As for the ping command, I...

When the 2G/3G network is down, will your IoT work properly?

Over the past few years, we’ve seen a lot of head...