Content Delivery Network (CDN) System Design

Content Delivery Network (CDN) System Design

A CDN is a group of geographically distributed proxy servers. A proxy server is an intermediary server between the client and the origin server. These proxy servers are located at the edge of the network, close to the end-user. The placement of proxy servers helps in fast delivery of content to the end-user by reducing latency and saving bandwidth. CDN also has additional intelligence for optimizing traffic routing and implementing rules to protect against DDoS attacks or other abnormal network events.

Essentially, CDN solves two problems:

  • High latency. If your service is deployed in the United States, the latency in Asia will be higher due to the physical distance from the providing data center.
  • Data-intensive applications: They transfer large amounts of data. Over longer distances, problems can occur due to the presence of multiple Internet service providers in the path. Some of them may have smaller links, congestion, packet loss, and other issues. The longer the distance, the more service providers there are in the path, and the higher the chance that one of them will have a problem.

Let's start with the high-level component design and main workflow.

High-level architecture

  • The routing system directs the client to the closest or optimal CDN location. To perform this function effectively, this component receives input from various systems to understand where the request is coming from, where the content is located, how busy the data center is, and so on. There are two most popular routing systems: DNS with load balancing and Anycast. We will discuss them in the video.
  • Scrubber servers are used to separate good traffic from malicious traffic to protect against DDoS attacks. Scrubber servers are typically used only when an attack is detected. Today, scrubber servers are very sophisticated, allowing clients to push very fine-grained firewall rules and have those rules applied across all data centers in real time.
  • Proxy or edge proxy servers serve content to end users. They typically cache content and provide fast retrieval from RAM.
  • The content distribution system is responsible for distributing content to all edge proxy servers in different CDN facilities. Usually a tree distribution model is used. This will be explained in detail later in the video.
  • Origin servers are user infrastructure that hosts the original content distributed on the CDN.
  • The data control system is used to observe resource usage and statistics. This component measures metrics such as latency, downtime, packet loss, server load, etc. This is then fed back to the routing system for optimal routing.

Let's go through the main workflow:

  1. We start with the origin server providing content delegation for a specific DNS domain or a specific DNS name. It tells the CDN that all requests to a specific URL will be proxied.
  2. The origin server publishes the content to the distribution system, which is responsible for distributing the content on a set of edge proxy servers. Both "push" and "pull" models are commonly used, and often both are used.
  3. The distribution system sends eligible content to the proxy server, while keeping track of which content is cached on which proxy server. It also understands which content is static and dynamic, the TTL of data that needs to be refreshed, content leases, and more.
  4. The client requests the appropriate proxy server IP from the routing system, or uses Anycast IP to route to the nearest location.
  5. Client requests go through the Scrubber server.
  6. The scrubber server forwards good traffic to the edge proxy.
  7. The edge proxy server services the client requests and periodically forwards its health information to the data control system. If the content is not available in the proxy, it is routed to the origin server.

Now imagine that you have a single website and need to distribute content to 20 regions, each with 20 agents that need to be stored. 20 regions + 20 replicas, which means you need to transfer data to the CDN 400 times, which is very inefficient. To solve this problem, you can use a tree-like replication model.

Tree-like content distribution

Data is sent to a regional edge proxy server, which is then replicated to a child proxy server in the same region using the CDN's internal network. This way, we only need to replicate content once per region or geographic area. Depending on the scale, a region can be a specific data center or a larger geographic area, where we have two levels of parent proxy servers.

It is crucial for users to get data from the nearest proxy server, because the goal of a CDN is to reduce latency by bringing data close to the user. There are two routing models that CDN companies usually use. The first one is based on DNS with load balancing, which has been the most popular historically. I think the newer and more effective one is the Anycast model, which delegates routing and load balancing to the Internet's BGP protocol. Let's take a look at them.

In a typical DNS resolution, we use the DNS system to get the IP corresponding to a readable name. In our case, we will use DNS to return another DNS name to the client. This is called DNS redirection, and content providers use it to send clients to a specific CDN region. For example, if a customer tries to resolve company.com, the authoritative DNS server provides another URL (such as cdn.us-east.company.com). The client performs another DNS resolution and obtains the IP address of a suitable CDN proxy server in the US-East region. Depending on the location of the user, the DNS response will be different.

So first the client is mapped to the appropriate datacenter based on the user's location. In the second step, it calls one of the load balancers to distribute the load on the proxy servers. To move a client from one region to another, DNS changes must be made to remove the load balancer IP that is in a difficult region. For this to work, the DNS TTL must be set to the lowest possible so that the client picks up the changes as quickly as possible.

But there will still be some traffic going through, and if that zone goes down, traffic will be impacted. I discuss similar issues in another video about scalable API gateways and edge design. I'll put a link to the video in the description.

A more effective approach is the Anycast design.

Anycast is a routing method in which all edge servers located in multiple locations share the same single IP address. It utilizes the Border Gateway Protocol or BGP to route clients based on the natural network flow of the Internet. CDNs use the Anycast routing model to deliver Internet traffic to the nearest data center to ensure improved response times and prevent any data center from being overloaded with traffic in the event of special needs, such as DDoS attacks.

When a request is sent to an Anycast IP address, the router will direct it to the nearest machine on the network. If an entire data center fails or undergoes maintenance, the Anycast network can respond to the failure similar to how a load balancer splits traffic across multiple servers or regions; data will be transferred from the failed location to another data center that is still online and functioning properly.

Anycast Reliability

Unicast IP using DNS and load balancer uses single machine, single IP. Most of the Internet works via the Unicast routing model, where each node on the network is given a unique IP address.

Anycast is - many machines, one IP

While Unicast is the simplest way to run a network, it is not the only way. Using Anycast means the network can be very resilient. Because traffic will find the best path, we can take entire data centers offline and traffic will automatically flow to the next closest data center.

A final benefit of Anycast is that it can also help mitigate DDoS attacks. In most DDoS attacks, many compromised "zombie" computers are used to form what is known as a botnet. These machines can be spread out across the network and generate so much traffic that they can overwhelm a typical Unicast-connected machine. The nature of an Anycasted network is that it inherently increases the surface area for absorbing such attacks. A distributed botnet absorbs a portion of its denial-of-service traffic into every data center that has capacity.

A real world example is Cloudflare, which has built a global proxy network with hundreds of locations around the world. It claims to put them within about 50 milliseconds of 95% of the world's internet-connected population. Since the network is also built on Anycast IP, it offers a total capacity of over 170 Tbps. This means that not only can they serve a large number of customers, they can also handle the largest DDoS attacks by spreading malicious traffic across multiple locations.

<<:  K8s-Service Mesh Practice-Introduction to Istio

>>:  Ping command advanced usage

Recommend

Here is everything you want to know about 5G progress and next steps

The progress of 5G has always been a key topic of...

Kubesphere deploys Kubernetes using external IP

Our company has always had the need to connect al...

Seven types of networks and their use cases

A computer network is a system of interconnected ...

How to make the key cut in 5G network slicing?

Since the advent of the 5G era, the most mentione...

Hosteons announces that it will switch to KVM for free for OpenVZ users

Hosteons released the OpenVZ 7 VPS Migration to K...

What are the big opportunities after NB-IOT in the field of Internet of Things?

With the freezing of the R3 core standard of NB-I...

How to implement message communication elegantly?

[[339299]] This article is reprinted from the WeC...