What exactly is Spine-Leaf?

[[401509]]

Today's story begins 67 years ago.

In 1953, a researcher named Charles Clos at Bell Labs published an article titled "A Study of Non-blocking Switching Networks", which introduced a method of "using multi-stage equipment to achieve non-blocking telephone switching."

Since the invention of the telephone in 1876, the telephone exchange network has gone through several stages, including manual switches, step-by-step switches, and crossbar switches. In the 1950s, the crossbar switch was at its peak.

The core of the vertical and horizontal switch is the vertical and horizontal connector. As shown in the following figure:

Crossbar

Schematic diagram of the intersection of vertical and horizontal connectors

This switching architecture is a switch matrix, where each crosspoint is a switch. The switch completes the forwarding from input to output by controlling the switch.

Switch matrix (number of intersections = N2)

As you can see, the switch matrix is very similar to the fibers of a piece of cloth. Therefore, the internal architecture of the switch is called Switch Fabric. Fabric means "fiber, cloth".

I believe that all core network engineers and data communication engineers are very familiar with the word "Fabric". Concepts such as "Fabric plane" and "Fabric bus" often appear in work.

As the number of telephone users increased dramatically and the network scale expanded rapidly, the crossbar-based switches could not meet the requirements in terms of capacity and cost. Thus, the research article by Charles Clos at the beginning of this article was published.

[[401513]]

Charles Clos (first from right)

The core idea of the network model proposed by Charles Clos is to use multiple small-scale, low-cost units to build a complex, large-scale network. For example, the following figure:

The rectangles in the figure are all low-cost forwarding units. When the number of inputs and outputs increases, the number of intersections in the middle does not need to increase much.

This model is the CLOS network model which later had a profound impact.

In the 1980s, with the rise of computer networks, various network topologies began to emerge, such as star, chain, ring, and tree.

Tree networks have gradually become mainstream and everyone is very familiar with them.

Tree Network

In traditional tree-type networks, bandwidth converges level by level. What is convergence? The bandwidth of physical ports is the same, two inputs and one output, isn't that 1:2 convergence?

After 2000, the Internet recovered from the economic crisis, and Internet giants represented by Google and Amazon began to rise. They began to promote cloud computing technology and build a large number of data centers (IDCs) and even super data centers.

Faced with the increasingly large scale of computing, the traditional tree network is definitely not enough. Therefore, an improved tree network began to appear, which is the Fat-Tree architecture.

Fat-Tree is a type of CLOS network architecture.

Compared with the traditional tree type, the fat tree is more like a real tree, where the branches become thicker towards the root. The network bandwidth does not converge from the leaves to the root.

The basic idea of the fat tree architecture is to use a large number of low-performance switches to build a large-scale non-blocking network. For any communication mode, there is always a path for their communication bandwidth to reach the network card bandwidth.

After the fat tree architecture was introduced into the data center, the data center became a traditional three-tier structure:

Access layer : used to connect all computing nodes. Usually exists in the form of cabinet switches (TOR, Top of Rack, cabinet top switch).

Aggregation layer : used for access layer interconnection and serves as the boundary of the second and third layers of the aggregation area. Various firewalls, load balancing and other services are also deployed here.

Core layer : used for interconnection of the aggregation layer and to realize three-layer communication between the entire data center and the external network.

For a long time, the three-layer network structure was very popular in data centers. In this architecture, copper cable wiring is the main wiring method, with a usage rate of 80%, while optical cables only account for 20%.

As time went by, people discovered that the traditional three-tier architecture had many shortcomings.

First, it is a waste of resources.

In the traditional three-layer structure, a lower-layer switch is interconnected with two upper-layer switches through two links.

Since the STP protocol (Spanning Tree Protocol) is used, there is only one link that actually carries traffic. The other uplinks are blocked (only used for backup), which results in a waste of bandwidth.

Secondly, the fault domain is relatively large.

Due to its own algorithm, the STP protocol needs to re-converge when the network topology changes, which is prone to failure and thus affects the entire VLAN network.

The third and most important point is that over time, the traffic trends in data centers have changed dramatically.

After 2010, in order to improve the utilization of computing and storage resources, all data centers began to adopt virtualization technology. A large number of virtual machines (VMs) began to appear on the network.

At the same time, microservice architecture became popular, and many software began to promote functional decoupling. A single service became multiple services, deployed on different virtual machines. The traffic between virtual machines increased significantly.

We call this kind of data flow between horizontal devices "east-west traffic."

Correspondingly, the vertical data flow up and down is called "north-south traffic". This is easy to understand, "up north, down south, left west, right east".

East-west traffic is actually a kind of "internal traffic". This substantial increase in data traffic has brought great trouble to the traditional three-tier architecture - because the communication between servers needs to go through access switches, aggregation switches and core switches.

Data flow example

This means that the working pressure of core switches and aggregation switches is increasing. To support large-scale networks, it is necessary to have aggregation layer core layer devices with the best performance and the highest port density. Such devices are costly and very expensive.

Therefore, network engineers proposed the "Spine-Leaf network architecture", which is our protagonist today - Leaf-Spine network (sometimes also called Spine-Leaf network). Spine means spine in Chinese, and Leaf means leaf.

The leaf-spine network architecture, like the fat-tree structure, belongs to the CLOS network model.

Compared with the three-layer architecture of traditional networks, the leaf-spine network is flattened into a two-layer architecture, as shown in the following figure:

The leaf switch is equivalent to the access switch in the traditional three-layer architecture, and directly connects to the physical server as a TOR (Top Of Rack). Above the leaf switch is the three-layer network, and below it is an independent L2 broadcast domain. If the servers under two leaf switches need to communicate, they need to be forwarded through the spine switch.

The spine switch is equivalent to the core switch. Multiple paths are dynamically selected between the leaf and spine switches through ECMP (Equal Cost Multi Path).

The number of downlink ports on the spine switch determines the number of leaf switches, while the number of uplink ports on the leaf switch determines the number of spine switches. Together, they determine the scale of the leaf-spine network.

The advantages of leaf-spine networks are obvious:

1. High bandwidth utilization

The uplink of each leaf switch works in a load balancing manner to fully utilize the bandwidth.

2. Network delay is predictable

In the above model, the number of connection paths between leaf switches can be determined, and they only need to pass through one spine switch, and the east-west network delay is predictable.

3. Good scalability

When the bandwidth is insufficient, increasing the number of spine switches can expand the bandwidth horizontally. When the number of servers increases, increasing the number of spine switches can also expand the scale of the data center. In short, planning and expansion are very convenient.

4. Reduce the requirements for switches

North-south traffic can go out from leaf nodes or spine nodes. East-west traffic is distributed on multiple paths. This eliminates the need for expensive high-performance, high-bandwidth switches.

5. High security and availability

Traditional networks use the STP protocol, and when a device fails, it will reconverge, affecting network performance or even causing failures. In the leaf-spine architecture, when a device fails, there is no need to reconverge, and traffic continues to pass through other normal paths. Network connectivity is not affected, and the bandwidth is only reduced by one path, with minimal impact on performance.

Cisco's Nexus 9396PX is suitable as a leaf switch

Let's analyze the support capabilities of the leaf-spine network using a case model.

Assume a resource condition like this:

Number of spine switches: 16

Uplink ports per spine switch: 8 x 100G

Downlink ports per spine switch: 48 x 25G

Number of leaf switches: 48

Uplink ports of each leaf switch: 16 × 25G

Downlink ports of each leaf switch: 64 × 10G

Ideally, the total number of servers that such a leaf-spine network can support is: 48×64=3072. (Note that the total northbound bandwidth of the leaf-spine switch is generally not the same as the total southbound bandwidth, usually greater than 1:3. The above example is 400:640, which is a bit extravagant.)

This example also shows that leaf-spine networks bring about a trend, which is a significant increase in the demand for the number of optical modules.

The figure below is a comparison of the number of optical modules used in the traditional three-layer architecture and the leaf-spine architecture. The difference may be as much as 15-30 times.

(From Guotai Junan Securities Research)

Because of this, the capital market pays close attention to leaf-spine networks, hoping to use them to drive the growth of the optical module market, especially high-speed optical modules such as 100G and 400G.

Leaf-spine topology networks began to appear around 2013 and have developed at an astonishing speed. They quickly replaced a large number of traditional three-layer network architectures and became the new favorite of modern data centers.

The most representative one is the data center architecture that Facebook disclosed in 2014. Facebook uses a five-level CLOS architecture, or even a three-dimensional architecture. You can study it if you are interested.

Facebook Data Center Architecture

In addition to Facebook, Google's fifth-generation data center architecture Jupiter also uses leaf-spine networks on a large scale, and the network bandwidth it can support has reached the Pbps level. Each of the 100,000 servers in Google's data center can communicate with each other at a speed of 10 gigabits per second in any mode.

[[401523]]

Google Data Center

Well, that’s all for today’s introduction to leaf-spine networks.

<<: Lenovo Nettop and GDS have reached a strategic cooperation to inject strong momentum into the development of the digital economy era

>>: A quick overview of 5G industry developments in May 2021

my country's 5G enters a substantial acceleration phase and is ready for commercial use

The new round of 5G construction blueprint is being drawn up, and the "multiplication effect" highlights the acceleration of industrial transformation

Blog

Are Paxos and Raft not consensus algorithms/protocols?

Recommend

LOCVPS: 30% off US XEN architecture VPS, 20% off all VPS, Japan/Singapore XEN architecture monthly payment starting from 29.6 yuan

LOCVPS sent a promotional plan for XEN architectu...

Shenzhen OCT InterContinental Hotel, Shenzhen Telecom, and Huawei jointly launched the world's first 5G smart hotel construction

InterContinental Hotel Shenzhen OCT, Shenzhen Tel...

Kvmla special VPS annual payment of 350 yuan, dual-core 2G package, optional Hong Kong/Japan/Singapore data center

Kvmla is a domestic veteran hosting company, a br...

DiyVM: 50 yuan/month-2GB/50GB/5M/Hong Kong CN2 line

DiyVM is a brand of Hong Kong Ruiou International...

Foxit Kunpeng OFD technology debuts at the 7th Military Expo to help the information construction of party, government and military documents

From July 5 to 7, 2021, the "9th China Comma...

What exactly is Spine-Leaf?

1. High bandwidth utilization

2. Network delay is predictable

3. Good scalability

4. Reduce the requirements for switches

5. High security and availability

my country's 5G enters a substantial acceleration phase and is ready for commercial use

PebbleHost: $19.99/month-E3-1220v2/8GB/480G SSD/1Gbps unlimited traffic/UK server

iWebFusion: Starting from $45/month E3-1230v2/16GB/2TB/10TB@1Gbps, multiple data centers in Los Angeles and other places

Huawei and its global partners work together to build a full-scenario smart life

Don’t let “soft power” hinder development in the 5G era

Friendhosting 14th Anniversary 30% off all VPS, 11 data centers to choose from, unlimited traffic VPS

The new round of 5G construction blueprint is being drawn up, and the "multiplication effect" highlights the acceleration of industrial transformation

Are Paxos and Raft not consensus algorithms/protocols?

Big data industry is a new trend. What are the advantages of operators?

An article reveals the practice and thinking of edge computing reference architecture 2.0

Recommend

LOCVPS: 30% off US XEN architecture VPS, 20% off all VPS, Japan/Singapore XEN architecture monthly payment starting from 29.6 yuan

After 6G, will there be 7G and 8G?

What are the challenges of using multiple team collaboration apps?

Five things you need to know about edge computing

Shenzhen OCT InterContinental Hotel, Shenzhen Telecom, and Huawei jointly launched the world's first 5G smart hotel construction

Kvmla special VPS annual payment of 350 yuan, dual-core 2G package, optional Hong Kong/Japan/Singapore data center

DiyVM: 50 yuan/month-2GB/50GB/5M/Hong Kong CN2 line

Foxit Kunpeng OFD technology debuts at the 7th Military Expo to help the information construction of party, government and military documents

Exploration and implementation of 5G+AI in the security industry

Starlink banned from service in India until it gets internet license

What power will the combination of 5G and intelligent automation bring?

CloudCone Black Friday promotion, KVM annual payment starts from $14.2, supports Alipay, Los Angeles data center

Here is a list of RackNerd's cheap promotional packages, starting from $9.89 per year

DesiVPS: $3/month-2GB/20G SSD/2.5TB/San Jose & Netherlands Data Center

Breakthrough in electricity usage could lead to big savings for data centers