We have learned about how to deal with traffic pressure from the perspective of architectural design. As we all know, it is difficult to accurately estimate the user traffic of services such as live broadcasting. Once the user traffic increases to a certain level and reaches a point where one computer room cannot bear it, dynamic scheduling measures must be taken, that is, some users must be reasonably allocated to multiple computer rooms. Moreover, as the traffic continues to increase, the possibility of network instability will also increase. In this case, only by ensuring that users can access the computer room closer to them can they get a better user experience. Taking all the above factors into consideration, in this class, we will focus on discussing the key technologies in traffic scheduling and data distribution. The purpose is to help everyone clearly understand how to properly complete the traffic switching between multiple computer rooms. The traffic involved in live streaming services can be divided into two types: one is the access traffic of static files, and the other is the traffic generated by live streaming. Both types of traffic can be distributed and processed with the help of CDN (Content Distribution Network), which can effectively reduce the pressure on our server. For services like live streaming, which have frequent read and write operations, dynamic traffic scheduling and data cache distribution are important foundations for solving the problem of online interaction for a large number of users. However, their functions overlap with those of DNS (Domain Name System), so they need to work together to realize related functions. Therefore, in the following explanation, relevant introductions about CDN will also be appropriately interspersed. DNS domain name resolution and cachingSwitching service traffic is not as simple as we think, because we will encounter a big problem, that is, DNS cache. DNS is the first step for us to initiate a request. If DNS is slow or wrongly resolved, it will seriously affect the interaction effect of read-multiple-write-multiple systems. So why is DNS refreshing slowly? This requires us to first understand the DNS resolution process. You can refer to the following figure to listen to my analysis: picture When a client or browser initiates a request, the first service to be requested is DNS. The domain name resolution process can be divided into the following three steps: 1. The client will request the DNS resolution service provided by the ISP, and the ISP's DNS service will first request the root DNS server; 2. Find the .org top-level domain DNS server through the root DNS server; 3. Then find the primary domain name server (authoritative DNS) of the domain name through the top-level domain name server. Once the primary domain name server is found, DNS will begin resolving the domain name. Generally speaking, the primary domain name server is provided by the service provider of the domain name we host. As for the specific resolution rules of the domain name and TTL (time to live), these need to be set in the management system of the domain name hosting service provider. When a request is made to the primary domain name resolution service, the primary domain name server will return the entry IP address of the computer room where the server is located, and will also give a recommended cache TTL time. Only at this time can the DNS resolution query process be said to be completed. Next, when the primary domain name service returns the result to the ISP (Internet Service Provider) DNS service, the ISP's DNS service will first cache the resolution result locally in its own service according to the time specified by TTL, and then return the resolution result to the client. Within the TTL validity period of the ISP DNS cache, if the same domain name resolution request is encountered, the result will be returned directly from the ISP's cache. It is conceivable that clients usually cache DNS resolution results. And in actual operations, many clients do not follow the TTL time recommended by DNS for caching, but instead prioritize operations based on their own configured time. At the same time, the ISP service provider in the data transmission channel will also record the corresponding cache situation. If we make changes to the domain name resolution, then in order to get the update, the fastest time it takes for the service provider to refresh its own server (generally it takes about 3 minutes) plus the TTL time. In fact, the worse situation is as follows: Based on the above situation, many domain name resolution services have given the following advice: it is advisable to set the TTL time to less than 30 minutes. In addition, many large Internet companies will also take measures in the client cache settings to artificially shorten the cache time. If the TTL time you set is too short, although the domain name resolution will be refreshed faster, it will also cause service requests to become unstable. Ideally, 93 minutes is a more appropriate duration. However, based on past experience, it normally takes about 48 hours for the DNS cache nationwide to be updated after the domain name is modified; and if you want the DNS cache worldwide to be updated, it will take 72 hours. Therefore, do not change the primary domain name resolution easily unless there is really no other way. If you need to refresh the DNS cache urgently, I have a suggestion for you. You can buy a service that can force push resolution and use it to refresh the DNS cache of the backbone ISP. However, please note that this service is not only very expensive, but also only covers the main lines of major cities. For some areas, the refresh speed is still relatively slow. The specific refresh speed depends on the broadband service provider. But generally speaking, it can indeed speed up the refresh speed of the DNS cache to a certain extent. The slow DNS refresh problem really caused us a lot of trouble. For example, if we perform a failover, it will take three days to complete the switchover operation. Obviously, this is a devastating blow to the availability of the system. Fortunately, in modern times, many technologies have emerged that can make up for this problem, such as CDN, GTM, HttpDNS and other services. Next, let's take a look at these services one by one. CDN Full Website AccelerationYou may be wondering, "Why is speeding up the refresh of DNS cache related to CDN?" Before talking about how to implement CDN acceleration, let's first understand what CDN and its website acceleration technology are. Website acceleration is very important for systems with many reads and writes. Generally speaking, common CDNs provide static file acceleration functions, as shown below: When a user initiates a request to a CDN service, the CDN service will first choose to return the static resources stored in its local cache. If the CDN does not cache the relevant resources locally, or if the resource is dynamic content, such as an API interface, then the CDN will perform a back-to-source operation, that is, return to our server and then obtain the required resources from our server. At the same time, CDN will refresh the local cache according to the resource timeout time returned by our server. In this way, we can greatly reduce the pressure on our computer room when providing static data services, and save a lot of investment in bandwidth and hardware resources. In addition to the function of accelerating static resources, CDN also provides regionalized local CDN network acceleration services. The specific situation is as shown in the picture below. CDN will deploy computer rooms for acceleration services in major provinces and cities, and these computer rooms will be interconnected through high-speed dedicated lines. When the client sends a request to the DNS for domain name resolution, the DNS service in the province or city where the client is located will use the GSLB (Global Server Load Balancing) function to return to the current user the IP address of the CDN data center closest to the province or city where the user is located. Using this method can greatly reduce the number of network link nodes between users and computer rooms. This can speed up the network response speed and reduce the possibility of network requests being intercepted. The path effect of the client requesting service is shown in the following figure: When users request dynamic interfaces of the full-site acceleration website, the CDN node will use the CDN intranet to forward the user's request to our server in the computer room through the shortest and fastest network link. Compared with the method in which the client has to forward the request from multiple ISP service providers from other provinces before sending a request to the server, the above-mentioned method of forwarding through CDN nodes can more effectively deal with the problem of slow network speed, thereby providing clients with a better user experience. After the website completes full-site acceleration, all requests from users will be forwarded and processed by CDN, and all domain names requested by the client will also point to CDN, which will then transfer these requests to our server. During this process, if the computer room changes the IP address of the CDN service, in order to speed up the refresh of the DNS cache, we can use the CDN intranet DNS service (this service is provided by the CDN provider) to refresh the DNS cache in the CDN. In this way, the DNS resolution on the client side will not change, and there is no need to wait for 48 hours as usual, and the domain name refresh operation will become more convenient. Because it takes 48 hours to refresh the cache, most Internet companies do not choose to change the DNS resolution configuration to perform failover when switching data centers. Instead, they rely on CDN to achieve similar functions. However, if the CDN entrance fails, the impact on the website service is also quite large. In foreign countries, in order to reduce the possibility of entrance failure, anycast technology is used in conjunction. By using anycast technology, the service entrances of multiple computer rooms can have the same IP address. In this way, once one of the entrances fails, the operator will transfer the traffic to another computer room. However, in China, due to security considerations, anycast technology is not supported. In addition to the risk of CDN entrance failure, when the request traffic enters the CDN, if the CDN does not cache the relevant content locally and needs to go back to the source, and the local website service also fails at the same time, then there will be a problem that the source cannot be automatically switched to multiple computer rooms. Therefore, in order to improve the availability of the service, we can consider adding GTM (Global Traffic Management) behind the CDN. GTM Global Traffic ManagementBefore you learn about the combined implementation of GTM and CDN, let me first tell you about the working principle and main functions of GTM. GTM is the abbreviation of Global Traffic Management System. I drew a working principle diagram to help you deepen your understanding: picture When the client issues a request for a service domain name, it first initiates a request to the DNS service to resolve the requested domain name. When the client requests the primary domain DNS service to perform domain name resolution, it actually requests the intelligent resolution DNS of the GTM service. Compared with traditional related technologies, GTM adds three additional functions: service health monitoring, multi-line optimization, and traffic load balancing. Let's talk about the service health monitoring function first. GTM monitors the working status of the server. Once it detects that a certain computer room is unresponsive, it will automatically switch the traffic to a healthy computer room. On this basis, GTM also provides a failover function. Specifically, it transfers part of the user traffic to other computer rooms based on the capacity of the computer room and the set weights. Next is the multi-line optimization function. In China, there are different broadband service providers, such as China Mobile, China Unicom, China Telecom, Education Broadband, etc. Users of different broadbands can often obtain the best access performance when accessing the website entrance IP provided by the same service provider; if they access across service providers, the request delay will increase due to the need for cross-network forwarding operations. Therefore, by using GTM, you can find a faster access path based on the CDN sources of different computer rooms. Finally, there is the traffic load balancing function. GTM will reasonably distribute traffic based on the monitoring results of service traffic and request latency, so as to achieve the purpose of intelligently scheduling client traffic. When GTM and CDN website acceleration services are combined, they can produce even better results. The specific combination is as shown in the picture below: picture HttpDNS ServiceThe HttpDNS service has such capabilities, it allows us to bypass the DNS service provided by the local ISP, so that we can effectively prevent DNS hijacking, and it does not have the problem of DNS domain name resolution refresh. Similarly, HttpDNS also provides us with GSLB (Global Server Load Balancing) function. In addition, HttpDNS can also support custom resolution services. With this feature, we can implement grayscale testing or A/B testing and other operations. Normally, HttpDNS can only solve the service scheduling related problems on the App side. Therefore, when the client program uses the HttpDNS service, in order to properly deal with the domain name resolution failure caused by the failure of the HttpDNS service, you need to prepare an alternative plan in advance. Here, I will provide you with a reference order for alternative resolution services: Generally speaking, HttpDNS will be used first; if it does not work properly, the DNS service with the specified IP address will be used next; if this does not work, the DNS service provided by the local ISP will be considered last. Through this arrangement, the security of the client DNS can be greatly improved. Of course, we can also choose to enable DNS Sec to further enhance the security of DNS services. However, for all the services mentioned above, we have to make comprehensive considerations and decisions based on our actual budget, the time and energy we can invest, and other factors. It should be noted that the HttpDNS service is not free, especially for large enterprises, the cost of using it will be higher. This is because many HttpDNS service providers charge fees according to the number of requests when providing query services. Therefore, in order to save costs, we will try to reduce the number of requests. Here we suggest that when using the App, you can use the IP address of the network to which the client is connected and the name of the hotspot (such as Wifi, 5G, 4G, etc.) as an identifier to perform some corresponding DNS cache operations. Business self-realization traffic scheduling1. Limitations of HttpDNS Service and Business SchedulingAlthough the HttpDNS service can solve the problem of DNS pollution, it cannot participate in our business scheduling. Therefore, when we carry out management and scheduling work according to business needs, the support it can provide is relatively limited. 2. Traffic scheduling service based on HttpDNS principle and common implementation methodsIn order to provide users with a better experience, Internet companies have implemented traffic scheduling functions based on the principles of HttpDNS. For example, many live broadcast services that are difficult to effectively control user traffic have implemented traffic scheduling services similar to HttpDNS. The common implementation method of this scheduling service is: the client initiates a request to the scheduling service, and then the scheduling service dispatches the client to a computer room that is closer. 3. Failover of Scheduling Services and How to Improve Their AvailabilityThe scheduling service also has the ability to transfer data center failures. When a server cluster fails, the client's request to the data center will fail, freeze, or be delayed. At this time, the client will actively send a request to the scheduling service. If the scheduling service receives an instruction to switch data centers, it will return the IP address of a healthy data center to the client to improve the availability of the service. In addition, the scheduling service itself needs to improve availability. The specific approach is to deploy the scheduling service in multiple computer rooms, and multiple scheduling computer rooms will synchronize user scheduling result strategies through the Raft strong consistency protocol. For example, when a user requests the dispatch service of Data Center A and is dispatched to Data Center Beijing, in the short term, even if the user requests the dispatch service of Data Center B again, he will still be dispatched to Data Center Beijing. Only when the client switches the network or our service data center fails will the traffic be uniformly changed. 4. Support of auxiliary data for dispatching service decision-making and related situationsIn order to improve the user experience of the client, we need to deploy the client to the computer room that is close and has the best response performance. To this end, we need some auxiliary data to support the dispatch service to allocate the client. These auxiliary data include IP, GPS location, network service provider, ping speed, actual playback effect, etc. The client will regularly collect this data and feed it back to the big data center for analysis and calculation in order to provide reference suggestions for the scheduling service, thereby helping the scheduling service to make better decisions about which computer room should be connected and the corresponding line. In fact, doing so is equivalent to implementing the GSLB function by yourself. However, the data based on which the GSLB function is implemented by yourself is not absolutely accurate. This is because the results of DNS service resolution in different provinces and cities are often different, and if the client cannot connect, it is necessary to try one by one according to the recommended IP to ensure the high availability of the service. 5. Scheduling stability verification and live broadcast and video related scheduling functionsIn order to verify whether the scheduling is stable, we can temporarily store the scheduling results on the client side, and bring the current scheduling results in the request header every time the client requests. In this way, we can monitor whether there are any client errors in requesting to other computer rooms on the server side. Once an erroneous request is discovered, reverse proxy forwarding measures similar to CDN full-site acceleration can be taken through the data center gateway to ensure the stability of the client. For live broadcast and video services, similar scheduling functions are also required. When we play videos or broadcast live, if the monitoring video freezes, the client should be able to automatically switch the video source and report the relevant situation to the big data center for record analysis. If large-scale video freezes are found, the big data center will send an alert to our operation and maintenance and R&D partners. picture Domain names are the key entry point for our services. When a request is made for a domain name, the first step is to resolve the domain name into an IP address via DNS. However, if DNS is requested too frequently, it will have an adverse effect on the response speed of the service. Therefore, many clients and ISP service providers will set up a cache mechanism for DNS. But it is precisely this multi-level cache setting that directly makes it extremely difficult to refresh the domain name resolution. Even if we are willing to spend money to refresh the caches of multiple bandwidth service providers, in some individual areas, we still need to wait at least 48 hours to complete the cache refresh for most users. If we have to switch IPs due to special reasons such as website failure, the impact will be catastrophic. Fortunately, in recent years, we have been able to use CDN, GTM, HttpDNS and other technical means to strengthen the traffic scheduling capabilities for multiple computer rooms. However, it should be noted that CDN and GTM are mainly used to schedule computer rooms, and their scheduling process is transparent and invisible to the business side. Therefore, in scenarios that focus more on user experience and have high concurrency, we often implement a dedicated scheduling system ourselves. In this self-implemented solution, you will find that its ideas are quite similar to those of HttpDNS and GSLB. The difference between them is that the previous services are only basic services, while the services we implement ourselves can also help us schedule user traffic quickly and effectively. It is undoubtedly very convenient and simple to use HttpDNS to allow users to switch computer rooms and video streams. We only need to change the IP address of the link when our App sends a request for encapsulation, and we can complete the computer room switching operation without affecting the business. |
<<: NVIDIA Ethernet Acceleration xAI Builds World's Largest AI Supercomputer
>>: Illustrated TCP three-way handshake: building a network session step by step
CUBECLOUD (Magic Cube Cloud) has launched a promo...
Salute to China's communications industry, Hu...
With the popularity of smart terminals, people ha...
It's the start of the new school year again, ...
Most IT organizations are under pressure to be mo...
Aoyozhuji is a long-established foreign VPS servi...
[[345275]] During the National Day holiday, WeCha...
First of all, I wish all my readers a happy Natio...
HostHatch has released a new promotion plan on LE...
Friendhosting released this year's Black Frid...
At a time when all walks of life are experiencing...
Deutsche Telekom (DT) said it has completed its f...
This sharing will be explained from four aspects....
[[390044]] This article is reprinted from the WeC...
We won’t talk about HTTP and HTTPS first. Let’s s...