What is CDN? Is using CDN definitely faster than not using it?

What is CDN? Is using CDN definitely faster than not using it?

​For developers, the term CDN is both familiar and unfamiliar.

I rarely need to touch this when I am doing development, but I always hear others mention it.

We've all heard that it can speed things up, and we roughly know why, but if we dig deeper.

Is using CDN definitely faster than not using it?

I feel a little confused. But it doesn’t matter, today we will change the angle to re-understand CDN.

What is CDN

For numeric and textual data, such as information about names and phone numbers, we need a place to store them.

We usually use MySQL database to store.

The text is stored in mysql

When we need to retrieve this data, we need to read the MySQL database.

However, because MySQL data is stored on disk, for a single instance, a read performance of about 5kqps is already very good.

It seems to be OK, but for a slightly larger system, it is a bit worrying.

In order to improve performance, we added a layer of memory before MySQL as a cache layer, such as the commonly known redis. Reading data is first read from the memory, and only when it cannot be read, it is read from MySQL, which greatly reduces the number of times MySQL is read. With this combination of punches, the reading performance can easily reach tens of thousands of qps.

mysql and redis

Well, up to here, we have talked about the development scenarios that we are more likely to come into contact with in daily life.

But now what I want to process is no longer the text data mentioned above, but image data.

For example, I have a handsome photo. It's the one below.

Every time I hear someone covering Tanya Chua's "Letting Go" on Tik Tok, I can't help but want to post this picture.

And wrote "I still can't forget it".

So here comes the question.

Where should this image data be stored? And where should it be read from?

If we look back at the scenarios of MySQL and Redis, it is nothing more than a storage layer plus a cache layer.

Storage and caching layers

For file objects such as images, it is unlikely that MySQL will be used for the storage layer. Instead, professional object storage should be used, such as Amazon's S3 (Amazon Simple Storage Service, note that it is called S3 because it starts with three S's) or Alibaba Cloud's OSS (Object Storage Service). In the following content, we will use the more common OSS to explain.

As for the cache layer, redis can no longer be used and it needs to be replaced with CDN (Content Delivery Network).

CDN can be simply understood as the cache layer corresponding to object storage.

CDN and OSS

Now we can answer the above question. For users, the image data is stored in the object storage and will be read from the CDN when needed.

How CDN works

Now that we have CDN and object storage, let's take a look at how they work together.

We can right-click and copy the URL of the pictures we usually see to view its URL.

You will find that the URL of the image looks like this.

 https://cdn.xiaobaidebug.top/1667106197000.png

The cdn.xiaobaidebug.top in front is the domain name of CDN, and the 1667106197000.png in the back is the path name of the image.

When we enter this URL in the browser, an HTTP GET request will be initiated, and then the following process will be performed.

CDN query process

Phase 1: Your computer will first obtain the IP corresponding to the domain name cdn.xiaobaidebug.top through the DNS protocol.

• Step 1 and step 2: First check the browser cache, then check the /etc/hosts cache in the operating system. If there is no cache in either, the nearest DNS server (such as the home router in your room) will be queried. If there is a corresponding cache on the nearest DNS server, a response will be returned.

• Step 3: If there is no corresponding cache on the nearest DNS server, the root domain, first-level domain, second-level domain, and third-level domain servers will be queried.

• step4: Then, the nearest DNS server will get the alias (CNAME) of the cdn.xiaobaidebug.top domain name, such as cdn.xiaobaidebug.top.w.kunlunaq.com.

•kunlunaq.com is the DNS scheduling system dedicated to Alibaba CDN.

• Step 5 to step 7: At this time, the nearest DNS server will request kunlunaq.com, and then return an IP address closest to you.

The second stage: corresponds to step 8 in the above figure. The browser uses this IP to access the CDN node, and then the CDN node returns the data.

In the first stage process above, many new terms are mentioned, such as CNAME, root domain, first-level domain, etc. They are described in detail in the previous article "What excellent designs in DNS are worth learning". If you don’t understand them, you can take a look.

We know that the purpose of DNS is to obtain the IP address through the domain name.

But that’s just one of its many functions.

There are many types of DNS messages, among which the A type uses the domain name to look up the IP address corresponding to the domain name, while the CNAME type uses the domain name to look up the alias of the domain name.

For ordinary domain names, the IP address corresponding to the domain name can usually be obtained directly after DNS resolution (also called A type record, A refers to Address).

For example, below, I use the dig command to issue a DNS request and print the process data.

 $ dig + trace xiaobaidebug .top
;; ANSWER SECTION :
xiaobaidebug .top .600 IN A 47.102 .221 .141

It can be seen that xiaobaidebug.top​ directly resolves to the corresponding IP address 47.102.221.141.

But for the CDN domain name, after a wave of queries, the first thing you get is a CNAME record of xx.kunlunaq.com, and then you dig this xx.kunlunaq.com to get the corresponding IP address.

 $ dig + trace cdn .xiaobaidebug .top
cdn .xiaobaidebug .top . 600 IN CNAME cdn .xiaobaidebug .top .w .kunlunaq .com .

$ dig + trace cdn .xiaobaidebug .top .w .kunlunaq .com
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .243
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .241
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .244
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .249
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .248
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .242
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .250
cdn .xiaobaidebug .top .w .kunlunaq .com . 300 IN A 122.228 .7 .251

Seeing this, the problem arises again.

Why is it so troublesome to add a CNAME?

What CNAME points to is actually the CDN-specific DNS server. It is just a small DNS server in the entire DNS system and looks just like any other DNS server. DNS requests will also be sent to this server normally.

But when the request actually hits it, its special feature is revealed. When the query request is sent to the domain name server, it is enough for the ordinary DNS domain name server to return the partial IP corresponding to the domain name, but the CDN-specific DNS domain name server will require the return of the server IP "closest" to the caller.

The CDN-specific DNS resolution server will return the nearest CDN node IP

How do I know which server IP is the closest to the caller?

You can see that the word "recently" is actually enclosed in double quotes.

The CDN-specific DNS domain name server is actually provided by the CDN provider. For example, Alibaba Cloud certainly knows which CDN nodes it has, as well as the current load status, response delay, and even weight of these CDN servers. It also knows the IP address of the caller. The caller's IP can be used to know the operator and approximate location of the caller, and the most suitable CDN server can be selected based on the conditions. This is the so-called "nearest".

For example, if the CDN server closest to you has more traffic and slower response, but a server farther away can better respond to the current request, then it stands to reason that the CDN server farther away might be chosen.

In other words, the selected server may not be geographically closest, but it must be the most suitable server at the moment.

What is back-to-source?

The image URL above is in the format of https://cdn domain name/image address.png.

In other words, this picture is obtained by accessing the CDN.

So, can we directly access object storage to obtain and display image data?

For example, like below.

 https : //ossdomainname/imageaddress.png

This is like asking whether it is possible to read and display text data directly from MySQL without using redis.

Of course.

This is what I did with the pictures I posted on my blog before.

But this is more costly. The cost here can refer to performance cost or call cost. See the figure below.

You can see that the cost of requesting OSS directly is almost twice that of requesting OSS through CDN. Considering my poor family background and also to make the blog get pictures faster, I connected to CDN.

But seeing this, the problem arises again.

In the screenshot above, there is a word called "Back to Source" in the red box.

What is back-to-source?

When we visit https://cdn domain name/image address.png, the request will be hit on the CDN server.

But the CDN server is essentially a layer of cache, not a data source. Object storage is the data source.

When you access the CDN for the first time to get a picture, there is a high probability that the CDN does not have the data for this picture, so you need to go back to the data source to get the picture data and then put it on the CDN. The next time you access the CDN, as long as the cache is not expired, you can hit the cache and return directly, so there is no need to go back to the source.

So the access process becomes as follows.

So in what other situations will back-to-source occur?

In addition to the above-mentioned situation where the CDN cannot obtain data and will cause the server to return to the origin, the cache on the CDN may expire and cause the server to return to the origin.

In addition, even if there is a cache and the cache does not expire, you can also trigger active return to the source through the open interface provided by the CDN, but we rarely have the opportunity to access this.

In addition, users are actually unaware of the fact of returning to the source, because when they read the image, they can only know whether they have read it or not.

The data is also read, but it is further divided into whether it is read directly from the CDN or returned after the CDN goes back to the source to read the object storage.

The difference between direct return with cache and return to source without cache

So, is there a way for us to determine whether a back-to-origin has occurred?

Yes. Let’s continue reading.

How to determine whether back-to-origin occurs

Let’s take the object storage and CDN of a certain cloud as an example.

Suppose I want to request the following picture https://cdn.xiaobaidebug.top/image/image-20220404094549469.png

In order to more conveniently view the HTTP header of the response data, we can use postman.

Use the GET method to request image data.

Then switch to the following tab to view the response header information.

View the response header

Back to source

At this time, the value of X-Cache under the response header is MISS TCP_MISS. This means that the cache is not hit, causing the CDN to go back to the source to check the OSS, and then return after obtaining the data.

At this point, the CDN must have this image cached. We can try to execute the GET method again to get the image.

The value of X-Cache​ becomes HIT TCP_MEM_HIT, which means a cache hit.

This is the practice of a certain cloud. Others such as Tencent Cloud are basically the same. You can almost find relevant information from the response header.

Is using CDN definitely faster than not using it?

Seeing this, we can answer the question at the beginning of the article.

If you do not connect to CDN and directly access the source site, the process is as follows.

Update direct access to the source station

However, if a CDN is connected and there is no cached data on the CDN, a return to the source will be triggered.

Updates go to CDN and back to source

This is equivalent to adding an additional CDN calling process to the original process.

That is, when using CDN, if a CDN cache miss causes a return to the source, the data will be slower than when not using it.

A cache miss may mean that the data does not exist in the CDN at all, or that the data existed but later expired.

Both situations are normal and most of the time no action is required.

But for very few scenarios, we may need to make some optimizations. For example, if your source site data has a major version update, such as changing the CDN domain name, then at the moment of going online, all users will use the new CDN domain name to request images, and the new CDN node will basically trigger a 100% return to the source, and in serious cases it may even drag down the object storage. At this time, you may need to filter out the hot data in advance, use the tool to pre-request a wave, and let the CDN load the hot data cache. For example, the CDN on a certain cloud has such a "refresh preheat" function.

cdn refresh warm up

Of course, you can also use the grayscale release model to let a small number of users experience the new features first, let these users "hot up" the CDN, and then gradually release the traffic.

Another possibility is that this data once existed but later expired and became invalid. For hot data, the cache time of the CDN data can be appropriately increased.

When should you not use a CDN?

From the above description, the biggest advantage of CDN is that for users from all over the world, it can allocate CDN nodes nearby to obtain data, and when repeatedly obtaining the same file data, it has a cache acceleration effect.

This is perfect for scenarios like web pages and pictures. Because the underlying layer uses object storage, that is, as long as it is a file object, such as a video, it can be connected to the CDN for acceleration using this process. For example, the short videos on some apps that we usually watch are done this way.

If you think about it the other way around, then the problem arises.

When should you not use a CDN?

If you have a company intranet service and the images and other files requested by the service are unlikely to be called repeatedly, there is no need to use CDN.

Note the two key points in bold above.

  • Intranet services are designed to ensure that you understand the source of service requests and have read permissions to object storage. If your object storage is also internal, it is likely to be in the same computer room as your service, which is already very close. Connecting to CDN does not allow you to enjoy the benefits of "distributing CDN nodes nearby."
  • Images or other files are unlikely to be reused multiple times. If you connect to a CDN, then every time you access the CDN to get the image, the CDN node is likely to not have the data you want, which is equivalent to having to go back to the object storage to get it every time. Connecting to a CDN is equivalent to adding a layer of proxy to yourself. The more proxy layers, the more time-consuming it is.

Regarding the second point above, if you need a clear indicator to convince yourself, I can give you one. From the above introduction, we know that we can use the X-Cache field in the http header of the CDN response to see whether a request has triggered a return to the source. By counting the number of times and dividing it by the total number of requests, we can get the return to the source ratio. For example, if the return to the source ratio is as high as 90%, then why connect to the CDN?

Summarize

  • For text data, we are used to using MySQL for storage and Redis for caching. But for file data, such as videos and pictures, we need to use OSS for object storage and CDN for caching.
  • If a return to the source occurs when using CDN, it will actually be slower than when not using it.
  • The biggest advantage of CDN is that it can allocate CDN nodes nearby to obtain data for users from all over the world, and it can cache and accelerate the process of repeatedly obtaining the same file data. If your services and object storage are all in the intranet, and the file data is unlikely to be reused, then there is no need to connect to CDN.

<<:  SA: Global 5G users exceed 1 billion, and 5G networks will cover 36% of the world's population

>>:  Network Technology Outlook 2023: Virtual Networks Will Drive NaaS Development, Cloud-Hosted Security Options Will Explode

Recommend

Software-defined architecture enables network optimization for cloud access

Everyone is talking about the huge changes that c...

To improve the security performance of SD-WAN, you need to do this

In order to do a good job in network security, SD...

Ten important components of SDN controller

SDN controller features include modularity, API, ...

Inventory of digital industry keywords in 2017

2017 will soon be a thing of the past, but there ...