In order to understand the principle of CDN, I have gone bald.

In order to understand the principle of CDN, I have gone bald.

[[420808]]

This article is reprinted from the WeChat public account "Java Tutorial Class", author JayceH. Please contact the Java Tutorial Class public account for reprinting this article.

CDN Overview

CDN stands for Content Delivery Network. Its basic idea is to avoid bottlenecks and links on the Internet that may affect the speed and stability of data transmission as much as possible, so as to make content transmission faster and more stable.

The working principle of CDN is to cache the resources of the source site on each CDN node. When a request hits the resource cache of a certain node, it is immediately returned to the client, avoiding the need to obtain resources from the source site for each request, avoiding network congestion, alleviating the pressure on the source site, and ensuring the speed and experience of user access to resources.

Let's take a real-life example. When we buy goods on JD.com, the courier can deliver them on the same day. The fundamental principle is to build local warehouses across the country. When users buy goods, through the intelligent warehouse and distribution model, they can choose the nearest warehouse for delivery, thus shortening the logistics delivery time.

​​

The distribution process of commodity inventory can be referred to the figure below, from factory (origin site) -> regional warehouse (secondary cache) -> local warehouse (first-level cache)

​​

The content distribution network, like the smart warehouse distribution network mentioned above, solves the access delay problem caused by distribution, bandwidth, and server performance, and is suitable for scenarios such as site acceleration, on-demand, and live broadcast. It enables users to obtain the required content nearby, solves the problem of Internet network congestion, and improves the response speed and success rate of users accessing websites.

​​

The Birth of CDN

​​

CDN was born more than 20 years ago. In order to solve the problem of excessive pressure on content source servers and transmission backbone networks, in 1995, MIT professor and one of the inventors of the Internet, Tom Leighton, led graduate student Danny Lewin and several other top researchers to try to use mathematical problems to solve the problem of network congestion.

They used mathematical algorithms to process dynamic content routing and eventually solved the problems that plagued Internet users. Later, Jonathan Seelig, an MBA student at the Sloan School of Management, joined Leighton's team. From then on, they began to implement their business plan and finally officially established the company on August 20, 1998, named Akamai. Akamai ended the embarrassing situation of "World Wide Wait" through intelligent Internet distribution.

In 1998, ChinaCache, China's first CDN company, was established.

How CDN works

Connect to CDN

Before accessing the CDN, when we visit a domain name, we directly get the IP address of the first real server. The whole process is as follows (the picture is a bit crude):

​​

When we need to accelerate a website, we can register our own acceleration domain name and source site domain name with the operator, then enter the DNS configuration information of our own domain name and modify the A record to a CNAME record. The reference for Alibaba Cloud acceleration application is as follows:

​​

CDN access process

​​

1. When a user accesses image content, it is first resolved by the local DNS. If the LDNS hits, it is directly returned to the user.

2. LDNS MISS, forwarding authorized DNS query

3. Return the domain name CNAME picwebws.pstatp.com.wsglb0.com. The corresponding IP address (actually the IP address of the DNS scheduling system)

4. The domain name resolution request is sent to the DNS scheduling system, and the DNS scheduling system assigns the best node IP address to the request.

5. Returned resolved IP address

6. The user initiates a request to the cache server, and the cache server responds to the user's request and transmits the content required by the user to the user terminal.

Figure: Schematic diagram of Huawei Cloud full-site acceleration

​​

What problems does CDN solve?

Backbone network is under too much pressure

In 1995, Tom Leighton led a team to try to solve the problem of network congestion by using mathematical problems, thereby solving the problem of excessive pressure on the backbone network. As more and more teenagers surf the Internet, the traffic throughput of the core nodes of the backbone network is insufficient to support the growth of Internet users. CDN can prevent user traffic from flowing through the backbone network.

The backbone network is a global local area network. Tier 1 Internet Service Providers (ISPs) connect their high-speed fiber optic networks together to form the backbone of the Internet, enabling efficient transmission of traffic between different geographical areas.

1. Local Area Network

A local area network (LAN) is a computer group consisting of multiple computers connected in a certain area. For example, when I was in college, the Internet was disconnected after 12 o'clock in the evening, but we could still play CS and Warcraft through the router. That is based on the interconnection of the local area network to achieve data sharing and information communication.

​​

2. Backbone network

Here we quote China Telecom's network architecture. The backbone network can be understood as a nationwide local area network. Through the traffic interconnection of core nodes, the interconnection of the entire network is realized. This is why we call it the Internet.

Beijing, Shanghai, and Guangzhou are the super cores of ChinaNet. In addition to the super cores, ChinaNet also has ordinary cores such as Tianjin, Xi'an, Nanjing, Hangzhou, Wuhan, and Chengdu.

​​

Middle Mile

Usually there is a "three-kilometer" journey in network access

  • The first kilometer is from the source station to the ISP access point
  • The second kilometer is from the source station ISP access point to the ISP access point of the visiting user.
  • The third kilometer (the last kilometer) is from the user's ISP access point to the user's client.

The CDN network layer is mainly used to accelerate the second kilometer (middle mile).

In the CDN infrastructure, two levels of servers are usually used for acceleration:

  • L1 (lower layer): The closer to the user (commonly known as netizens), the better. It is usually used to cache cacheable static data and is called the last mile.
  • L2 (upper layer): The closer to the origin site, the better. This layer is called the first mile. When L1 cannot hit the cache or the content is not cacheable, the request will be transparently transmitted to L2 through L1. If L2 still does not hit the cache or the content is not cacheable, it will continue to be transparently transmitted to the upstream of L2 (which may be the origin site or L3). At the same time, L2 can also converge the magnitude of traffic and request numbers, reduce the amount of return to the source (if it can be cached), and reduce the pressure on the origin site.
  • The part between L1 and L2 is the "internal network" of the CDN, which is called the middle mile.

​​

CDN components

Global Load Balance (GLB)

​​

  • When a user visits a website that has joined the CDN service, the domain name resolution request will ultimately be handled by "Smart Scheduling DNS".
  • It uses a set of predefined strategies to provide users with the node address that is closest to them at the time, so that users can get fast service.
  • At the same time, it needs to maintain communication with CDN nodes distributed in various places, track the health status, capacity and other information of each node, and ensure that user requests are allocated to the nearest available node.

Cache Server

The main function of the cache server is to cache hot data. The data types include: static resources (html, js, css, etc.), multimedia resources (img, mp3, mp4, etc.), and dynamic data (edge ​​rendering).

Well-known open source software related to CDN include:

  • Squid
  • Varnish
  • Nginx
  • OpenResty
  • ATS
  • HAProxy

For specific comparison, please refer to: https://blog.csdn.net/joeyon1985/article/details/46573281

CDN layered architecture

​​

Origin

The origin site refers to the original site where the content is published. Adding, deleting, and changing files on the website are all done on the origin site; in addition, all objects captured by the cache server also come from the origin site.

CDN Scheduling Strategy

DNS Scheduling

Based on the egress IP location of the requesting end's local DNS and the operator's DNS scheduling.

DNS scheduling issues:

  • The DNS cache time will not be refreshed before the TTL expires, which will cause a large delay in automatic scheduling when the node is abnormal, and will directly affect online business access.
  • A large number of local DNSs do not support the EDNS protocol and cannot obtain the customer's real IP. Most of the time, CDN can only make decisions based on the local DNS IP, and cross-regional scheduling often occurs.

HTTP DNS Scheduling

The client requests a fixed HTTP DNS address and obtains the resolution result based on the response. This can improve the accuracy of resolution (unlike DNS scheduling, which can only make decisions based on the local DNS IP), and can effectively avoid hijacking and other problems.

Of course, this model also has some problems. For example, each time the client loads a URL, an HTTP DNS query may be generated, which places high demands on performance and network access.

302 Scheduling

Real-time traffic scheduling based on client IP and 302 scheduling cluster.

Let’s look at an example:

  • After accessing the URL link, the request reaches the scheduling cluster. The client information we can get includes the client's egress IP (which is the same in most cases). The subsequent algorithm can be the same as the DNS-based scheduling, except that the judgment basis is changed from the local DNS egress IP to the client's egress IP.
  • The browser receives a 302 response and follows the URL in the Location to continue initiating an http request. This time the target IP of the request is the CDN edge node, and the CDN node will respond with the actual file content.

Advantages of 302 Scheduling:

  • Real-time scheduling, because there is no local DNS cache, is suitable for CDN peak processing, which is of great significance for cost control;
  • High accuracy, directly obtain the client egress IP for scheduling.

Disadvantages of 302 scheduling:

  • It has to jump every time, which is not friendly to latency-sensitive businesses. It is generally only suitable for large files.

AnyCast BGP routing scheduling

Based on the BGP AnyCast routing strategy, only a few external IPs are provided, and the routing strategy can be adjusted quickly.

Currently, AWS CloudFront and CloudFlare both use this method to perform scheduling at the routing level.

This method can effectively resist DDOS attacks and reduce network congestion.

Of course, the cost and solution design of this method are relatively complicated, so domestic CDNs currently still use the UniCast method.

Some concepts

How CDN works

The local cache data maps the URL and the local cache in the form of key-value. The storage structure is similar to Map and caches in the form of hash+linked list.

​​

CDN hit rate

A core criterion for measuring the quality of our CDN services is that when the resources accessed by users happen to be in the cache system, they can be directly returned to the users, indicating a CDN hit. If there is no hit resource in the CDN cache, a return-to-source action will be triggered.

CDN back to source

When the CDN local cache does not hit, the back-to-source action is triggered.

  • The first-level cache accesses the second-level cache to see if there is relevant data. If so, it returns to the first-level cache.
  • The second-level cache miss triggers a second-level cache back-to-source request to request the corresponding data from the origin station. After obtaining the result, it is cached in the local cache and the data is returned to the first-level cache.
  • The first-level cache obtains data, caches it locally, and then returns it to the user.

CDN warm-up data

The access modes mentioned above are all based on the Pull mode, where users decide which hot data will eventually be stored in the CDN cache. For big promotion scenarios, we often need to preheat activity-related resources to edge nodes (L1) to avoid a large number of users accessing after the promotion starts, causing excessive pressure on the source station. At this time, the Push mode is used.

Summary of CDN Features

1. Resource access acceleration: Local cache acceleration improves the access speed of enterprise sites (especially sites with a large number of images and static pages), and greatly improves the stability of the above sites.

2. Eliminate the bottleneck problem of network interconnection between operators: The mirror service eliminates the impact caused by the bottleneck of interconnection between different operators, realizes cross-operator network acceleration, and ensures that users in different networks can get good access quality.

3. Remote acceleration: Remote access users can automatically select a cache server based on DNS load balancing technology, select the fastest cache server, and speed up remote access

4. Bandwidth optimization: Automatically generate a remote Mirror cache server for the server. When a remote user accesses the server, the server reads data from the cache server, thereby reducing the bandwidth for remote access, sharing network traffic, and alleviating the load on the original site's WEB server.

5. Cluster anti-attack: Widely distributed CDN nodes plus intelligent redundancy mechanism between nodes can effectively prevent hacker intrusion and reduce the impact of various DDoS attacks on the website while ensuring good service quality.

<<:  Han Xia from the Ministry of Industry and Information Technology: my country's 5G standard essential patents account for more than 38%, ranking first in the world

>>:  How Industry 4.0 and 5G will change supply chain visibility

Blog    

Recommend

Talking about IPv6 tunnel technology

IPv6 was originally designed without tunnel techn...

Four tips for network capacity planning and configuration

When designing an enterprise network, there is a ...

5G means data center platforms must evolve

The foundation for seamless 5G implementation 5G ...

Who is responsible for the rampant online black industry?

[[188973]] A set of data: According to the 38th &...

Enterprise Network Data Communication Solution Practice - EIGRP

Practical objectives: Through practical applicati...

Do you really understand the connection control in Dubbo?

[[422543]] This article is reprinted from the WeC...