vivo HTTPDNS end-to-end experience optimization practice

vivo HTTPDNS end-to-end experience optimization practice

In the information age, users' mobile application visits are increasing day by day. As a key link in connecting to the Internet, DNS resolution has also been put forward higher requirements. In this context, HTTPDNS domain name resolution service has gradually become the mainstream solution in the industry with its features such as anti-hijacking, precise scheduling, and real-time resolution effectiveness. We have built vivo HTTPDNS end-to-end integrated solution. By optimizing the capabilities and architecture of the four major modules of HTTPDNS SDK, HTTPDNS server, unified scheduling gateway, and full-link monitoring, we have significantly improved the access experience of end-side services and supported the efficient and stable development of services.

1. vivo HTTPDNS technical background

1.1 Why build vivo HTTPDNS end-to-end integrated solution?

With the rapid development of business, the number of applications accessed is increasing; users have higher and higher requirements for the access experience of mobile terminal applications. However, in the process of accessing Internet services, we may encounter hijacking of access to illegal resources; the accessed resources cannot be displayed normally; the resources are opened slowly, and the user waits for a long time; the operator's DNS resolution is not accurate, affecting the user's access experience; to solve the above problems, the more common solution in the industry is to use HTTPDNS.

In this context, vivo began to explore and use HTTPDNS in 2017; however, it encountered the following problems during implementation:

  • First, when HTTPDNS is used as a backup DNS, it cannot solve the problem of domain name being blocked by mistake;
  • Second, there is no unified access solution for each service, and the access cost is relatively high;
  • Third, the client's network framework is not unified, making business adaptation difficult;
  • Fourth, the cost of commercializing HTTPDNS in the industry is relatively high, and services cannot be used on a large scale;
  • Fifth, there is a lack of unified user full-link monitoring, and it is impossible to monitor the end-side DNS resolution in real time.
  • In order to solve the above problems in the implementation process, we need to build an end-to-end integrated solution of vivo HTTPDNS.

1.2 How does HTTPDNS solve the core problem

We build an integrated HTTPDNS solution to solve the above core problems and help the development of the business. We mainly focus on the following aspects:

  • The first is to build unified SDK capabilities to solve blocking issues, reduce service access costs, and improve user experience; the core capabilities are mainly domain name resolution optimization, service connection optimization, and unified access solutions;
  • The second is the construction of HTTPDNS server capabilities, which reduces the cost of using HTTPDNS for business and reduces DNS access latency; the core capabilities are intelligent scheduling and multi-level caching strategies on the server side;
  • The third is the construction of full-link monitoring capabilities to achieve unified monitoring of user access; the core capability is to provide end-to-end full-link monitoring capabilities for user access.

Based on the above construction ideas, let's see how we implement it specifically.

2. vivo HTTPDNS platform practice

2.1 vivo HTTPDNS technical architecture

The vivo HTTPDNS platform provides an integrated HTTPDNS solution for the business. The overall architecture mainly includes four modules: HTTPDNS SDK, HTTPDNS server, unified scheduling gateway and full-link monitoring.

  • HTTPDNS SDK mainly provides DNS resolution and business connection optimization;
  • HTTPDNS server provides HTTPDNS high-performance API, cache library, proxy gateway and other capabilities;
  • HTTPDNS scheduling gateway provides DNS resolution strategy and DNS scheduling capabilities;
  • Unified monitoring provides capabilities such as client monitoring, server monitoring, quality monitoring, and regional monitoring.

Based on the capabilities of the HTTPDNS platform, our core process is to integrate the HTTPDNS SDK into the vivo mobile app. The SDK first requests the HTTPDNS scheduling gateway to obtain DNS policy and other configurations; after obtaining the configuration, it initiates domain name resolution to the preferred DNS. When the preferred DNS encounters problems such as resolution failure, connection failure, domain name blocking, domain name hijacking, etc., it initiates a domain name resolution request to the alternative DNS to obtain the correct solution result, initiate a connection request, and also cache the resolution results and optimize the connection.

2.2 HTTPDNS SDK Optimization

The SDK carries the core capabilities of our HTTPDNS integrated solution; in terms of architecture, it mainly provides support for the underlying network protocols, supports HTTP1.X, HTTP2.0 and QUIC transmission protocols, and also supports encryption protocols such as TLS1.1, TLS1.2 and TLS1.3; at the same time, it supports optimization strategies such as Session Ticket based on the TLS protocol; the service layer provides DNS resolution, DNS caching, business connection establishment, DNS policy management and other functions; the application layer provides interface, download, upload and other capabilities; and also provides full-link network monitoring.

Based on the capabilities and usage scenarios of the SDK, we have identified three key optimization directions for the SDK:

  • First, domain name resolution optimization, through resolution strategy optimization and cache optimization, improve DNS resolution success rate and reduce DNS resolution delay;
  • The second is service connection optimization. Through network diagnosis, network speed detection, long connection optimization, optimal routing, and QUIC connection speed racing, we can improve the success rate of service access and reduce service access latency, thereby improving the user experience of our vivo mobile phone users.
  • The third is a unified access solution, which reduces the cost of business access by scheduling gateways and adapting multiple network frameworks.

2.2.1 Domain name resolution optimization

First, we will introduce our exploration of resolution strategy optimization. DNS resolution is the first step for users to access vivo's Internet services. Traditional DNS resolution uses the operator's LocalDNS to resolve domain names. In this process, resolution failures or inaccurate resolution addresses often occur. To address resolution failures and inaccurate resolution addresses, we mainly divide DNS resolution optimization into the following three stages:

  • The first stage is the retry resolution strategy. If DNS resolution fails, an alternative DNS is used to retry resolution.
  • The second stage is the adaptive resolution strategy, which supports adaptive retry resolution when DNS resolution fails or connection fails.
  • The third stage is the IP priority/backup strategy, which is an IP priority or backup resolution strategy based on business scenarios.

First, let's take a look at the retry resolution strategy. The core logic is to switch the DNS solution and retry the resolution after the DNS resolution fails. Vivo's current preferred DNS is the operator's LocalDNS. If the LocalDNS resolution fails or the resolution times out, HTTPDNS will be used to retry the resolution. This can improve the success rate of the resolution while greatly reducing the cost of using HTTPDNS.
However, this strategy cannot solve the scenario of connection failure; at the same time, the flexibility of DNS resolution is relatively low, and changing the strategy requires the business to upgrade the SDK.

To solve the above problems, we launched an adaptive resolution strategy. The adaptive resolution strategy adds a strategy of switching DNS solutions to retry resolution and establish access in scenarios where connection establishment fails based on the retry resolution strategy. It also supports dynamic configuration of DNS policies, which can dynamically adjust the preferred and alternative DNS. The adaptive resolution strategy scenario further improves the user's access success rate. After the retry resolution strategy and adaptive resolution strategy were launched, the DNS resolution success rate increased by 2%.

Can we further optimize the resolution strategy after using it? What should we do if the alternative DNS resolution or connection fails? Can we optimize the DNS resolution process for scenarios with high user access latency requirements?

Based on the above problems, we have further optimized the resolution strategy. For cold start scenarios where the shorter the service access latency, the better, we use the IP priority strategy to dynamically send IP addresses to clients in advance to establish connections, optimize the DNS resolution process, and further shorten the user access latency. In the success rate improvement scenario, we use the IP backup strategy after the alternative DNS resolution fails, and dynamically send IP addresses to services to establish connections, further improving the success rate of user access.

Based on the above optimization solution, the DNS resolution time in the IP priority scenario is reduced by 80%, and the business success rate in the IP backup scenario is further improved by 0.2%~0.4%.

We have just introduced the optimization of the parsing strategy. Now let’s introduce our exploration of the cache strategy optimization. Caching is a core strategy to improve user experience, but if the cache is not used properly, negative effects may occur. We have also made some attempts to optimize the cache strategy.

  • The first stage is to use a fixed cache strategy to directly cache the DNS resolved IP address, and the cache time is also a fixed value set in advance;
  • The second stage is the dynamic caching strategy, which is based on the caching strategy of domain name and network information. Different networks cache different addresses, and the caching time supports dynamic configuration.
  • The third stage is the optimistic caching strategy. For the resolution result, if the cache has expired, the result is first returned to the client to initiate a connection, and then DNS resolution is performed asynchronously, and the result of the asynchronous resolution is cached again.

The core strategy of the fixed cache strategy is to cache the results of DNS resolution. The next time the client resolves, it will directly use the cached address to initiate a connection, so as to achieve the optimization effect of reducing the DNS resolution delay. The cache is set with a fixed expiration time. After the cache expires, DNS resolution is performed. If the resolution is successful, it is cached again.
This strategy has the following main problems:

  • The first is that users switching networks will cause cached results to be accessed across networks;
  • Second, the cache time is fixed, and the cache cannot be intervened in time. If an exception occurs, the failure time will be prolonged.
  • The third problem is that the cache for abnormal IP addresses cannot be optimized, which results in longer abnormality time.

Based on the above problems and pain points, we implemented a dynamic cache strategy; the core strategy is to optimize cross-network cache, cache timeliness and abnormal IP cache:

  • For cross-network caching, the user's network identification information is recorded in the cache. When using the cache, it is necessary to match the domain name plus the network identification information at the same time to avoid cross-network caching.
  • Timeliness optimization is to set an expiration time for the cache. The expiration time supports dynamic configuration, which improves the flexibility of the cache.
  • Abnormal IP cache optimization is to clear the corresponding cache when the user establishes a connection abnormally, and switch DNS to retry resolution when the connection is abnormal.

Based on the above optimization strategies, the overall DNS resolution delay has dropped by more than 30%, with significant optimization effects.

Based on the dynamic caching strategy, we have made further optimizations and provided an optimistic caching strategy. In general, cached results can continue to be used after they expire. For the resolution results, if the cache has expired, we optimistically judge that the cache is valid, first return the cache to the client to establish a connection, then asynchronously initiate DNS resolution and re-cache. This strategy can further reduce the client's DNS resolution latency and improve the user access experience.

The above is our exploration and practice on domain name resolution strategy and domain name caching logic in domain name resolution optimization.

2.2.2 Business connection optimization

Next, we will introduce the optimization of business connection establishment. First, we will introduce the network diagnosis capability. The core principle of network diagnosis is to provide network connectivity, authenticated WiFi, signal strength, system networking strategy, DNS, ping and other detection capabilities through the network connection quality detection module; diagnose whether the current user has network access, network type, network strength, whether the network is restricted and whether the accessed domain name can be pinged, and provide data support for upper-level application connection access optimization.

The main usage scenarios of network diagnosis are to provide network detection functions for video playback and browser web page access; to diagnose the reasons why users have network but cannot connect; to provide user prompts and problem repairs based on the cause of the problem, and to improve the user experience of vivo mobile terminal users.

The second capability of connection optimization is network speed detection. Its core principle is to collect data statistics for each request in the data reading scenario, calculate the global network speed and the network speed of a single request, so as to achieve the purpose of monitoring network quality. For example, if the access party wants to know the global speed in the past x seconds at this time point, it can reorganize the collected network quality data information in reverse order, sum the data of multiple requests in this time period in sections, calculate the data transmitted in x seconds, and divide the data by time to get the global network speed of the user in this time period.

In the video-on-demand scenario, network speed detection can intelligently switch the clarity of the video playback according to the detected network speed, ensuring the smoothness of video playback and reducing the freeze rate of video playback.

The third point of connection optimization is the DNS best routing strategy; the core logic is to select the best address from the IP addresses resolved by multiple DNS policies to initiate connection access. The main process is that the SDK aggregates the DNS resolution results under the operator's LocalDNS, HTTPDNS, public DNS, IP direct connection and other strategies; obtains the IP address under the corresponding network status, and combines the data with the same network ID in the historical behavior library; sorts the IPs according to the intelligent algorithm of access success rate and access time, and establishes connections for the sorted IP addresses in turn.

Using the optimal routing model can improve the success rate of short video playback and browser web page opening.

The fourth point of connection optimization is HTTP2 long connection optimization; connection reuse in HTTP2 can improve network performance and reduce latency, but in actual application, some shortcomings have also been found. For example, if a connection is not used for a long time, there is a certain probability that the connection will be discarded by the device in the network link, and some devices will not notify the client that the current connection has been closed according to the protocol standard. This will cause the client to reuse the connection in the next request, but at this time, because the intermediate link or the server has discarded the current link, an access timeout exception will occur.

To address the above issues, we have implemented the following optimization strategies:

  • The first is passive detection of long connection reuse. When a long connection timeout exception occurs, the SDK will forcibly remove the connection from the cache pool. When the business is retried, the SDK will create a new connection to avoid access timeout exceptions.
  • The second optimization strategy is active detection of long connection reuse: when the connection is reused, a ping frame request is initiated synchronously, and the ping result is checked after 2 seconds. If no ping frame is received and there is no data transmission on the current connection within 2 seconds, the connection is actively disconnected and retried inside the SDK; passive detection needs to generate a timeout exception before hitting the relevant strategy, and active detection also needs to spend 2 seconds for detection. Therefore, we need a more intelligent way to determine whether the connection is reused to reduce the impact of exceptions on business latency.
  • The third optimization strategy is intelligent prediction on the end side, which predicts the time when the persistent connection will be unavailable and makes a decision in advance whether to create a new persistent connection.

The core strategy of end-side intelligent prediction is to collect relevant historical data of user requests for the current life cycle, including network type, request time, request domain name, connection idle time, and abnormal information; based on historical request data, continuously narrow the data interval where reuse timeout problems may occur; then clean the collected data, discard abnormal data, and extract data features to form a data set; make a comprehensive judgment based on the relevant data in the data set to form a conclusion on whether to reuse the current connection or create a new connection.

The fifth point of connection optimization is QUIC connection speed. The SDK supports QUIC connection speed. When QUIC connection speed is turned on and a user initiates access, if the QUIC speed wins, QUIC connection is used; if the HTTP speed wins, HTTP connection is used, thereby improving the success rate of end-side access.

In video playback scenarios, QUIC connection speed has significantly improved playback failure rate, playback freeze rate, and slow start scenarios; in weak network scenarios, the performance advantage of the QUIC protocol is particularly obvious.

The above is vivo’s exploration and practice in connection optimization strategy.

2.2.3 Unified access solution

Next, we will introduce the optimization of the unified access solution. The first point is the implementation of the HTTPDNS scheduling gateway. All SDK configurations are managed and issued through the scheduling gateway, including DNS resolution strategy, cache strategy, and connection strategy. They are all managed through the configuration gateway. The SDK configuration and policy change client does not need to be re-released. The scheduling gateway greatly improves the flexibility of the SDK. The scheduling gateway is accessed through the domain name, which also avoids the situation where the IP is blocked.

The second point is network framework adaptation. Vivo mobile applications use a variety of network frameworks, including OkHttp, Volley, HttpURLConnection, Glide and other network frameworks. The vivo HTTPDNS SDK has adapted these network frameworks to meet the access requirements of various businesses and reduce the cost of business access.

2.3 HTTPDNS server optimization

Next, we introduce the architecture of the vivo HTTPDNS server. The HTTPDNS server mainly provides high-performance APIs, cache libraries, proxy gateways, and other capabilities; high-performance APIs provide intelligent resolution, authentication, cache query, and other capabilities; cache libraries provide multi-level cache, lazy update, and other capabilities; proxy gateways provide EDNS, intelligent scheduling, IP detection, and other capabilities. Through these capabilities, vivo users are provided with highly available, low-latency HTTPDNS resolution services. At the same time, an HTTPDNS management backend is also provided, supporting DNS management, system management, scheduling strategy management, authentication management, access management, and other capabilities.

The core capabilities of the server are mainly divided into intelligent scheduling and multi-level caching. Intelligent scheduling on the server is to obtain the resolution results of multiple partners, and cache the best results through asynchronous IP detection and other strategies. The SDK obtains the best IP address from the server for business access. Multi-level caching optimizes the cache results from synchronous refresh to automatic asynchronous refresh of the first-level cache and the second-level cache based on the TTL expiration time, which greatly improves the performance of the server and also reduces the cost of using HTTPDNS.

2.4 HTTPDNS Visual Monitoring

The vivo HTTPDNS platform provides full-link visual monitoring capabilities; it can monitor the time and requests of users from DNS resolution to the completion of the entire request; based on monitoring, it can efficiently locate anomalies in each stage of network requests; it also provides regional monitoring and anomaly warning capabilities at the provincial operator level, solving the difficulty of no monitoring of business access links. Based on regional monitoring at the provincial operator level, corresponding optimization plans can be formulated for network access environments in different regions, and the early warning capability can detect anomalies in a timely manner and optimize them in a timely manner.

2.5 HTTPDNS Business Effect

After the above optimization practices, as of now, the vivo HTTPDNS platform has covered more than 100 services of vivo mobile phones, and the number of HTTPDNS resolutions has reached 1.5 billion times per day; the client's resolution delay has dropped from an average of 180ms to 115ms, a decrease of 36%, with significant optimization effects; the server-side resolution success rate has reached 99.5%, providing stable and reliable resolution services for the business; the server-side response time is about 4ms, reaching the industry-leading level; the server-side cache hit rate has reached 90%, reducing the cost of HTTPDNS while shortening the response time of DNS resolution.

In terms of success rate improvement, the DNS resolution success rate increased from 97% before optimization to 99.85% after optimization, basically solving all DNS-related problems; the client access success rate also increased from 97% before optimization to 99% after optimization. The optimization effect is significant. After optimization, the user experience of vivo terminal applications has been significantly improved.

In terms of anti-hijacking, the iMusic domain name was hijacked in a certain region in February 2023. Through monitoring, it was found that the domain name was hijacked to a foreign address; vivo HTTPDNS platform monitoring found that the domain name was resolved normally using HTTPDNS, and the domain name success rate and connectivity rate were normal; CDN monitoring business traffic was normal and no abnormalities were found.

In terms of accidental blocking of domain names, in April 2023, the .com.cn root domain was mistakenly blocked in a certain region's telecommunications and mobile networks, and the DNS resolution address returned 127.0.0.1; vivo browser, short video and other services connected to the vivo HTTPDNS platform were not affected, which improved the availability, brand image and reputation of vivo services.

The above is the exploration and practice of the vivo HTTPDNS platform, as well as its specific performance after business access.

3. Summary and Outlook of vivo HTTPDNS

3.1 Summary of vivo HTTPDNS construction

There are many practices of the vivo HTTPDNS platform that are not described in detail, and business optimization will continue to be built.

  • In terms of domain name resolution optimization, thanks to the optimization of DNS adaptive resolution strategy and IP priority/backup strategy, the DNS resolution success rate has been significantly improved.
  • In terms of cache optimization, dynamic caching and optimistic caching strategies further reduce DNS resolution time, optimize link access latency, and reduce the overall DNS resolution latency by 30%+.
  • In terms of connection optimization, relying on end-side network speed detection, optimal routing, long connection prediction model, QUIC connection speed racing and other strategies, the client success rate has been increased to 2%.
  • The unified scheduling gateway serves as a bridge between the vivo HTTPDNS platform practice SDK and the server, enabling end-to-end optimization strategies to be connected in series.
  • Intelligent scheduling on the server side improves the resolution success rate of vivo HTTPDNS, and the server-side resolution success rate reaches 99.9%, further enhancing the access experience of end-side services.
  • Full-link monitoring provides end-to-end full-link monitoring capabilities for the vivo HTTPDNS platform, providing data support for business development.

3.2 Future Prospects of vivo HTTPDNS

Finally, here are some of our outlooks for the future. In the future, we will continue to focus on cutting-edge technologies for device-side optimization and explore multi-channel acceleration and device-side scheduling optimization solutions in terms of network acceleration.

In terms of multi-channel acceleration, explore dual mobile network, dual WiFi or mobile network plus WiFi acceleration solutions.

In terms of terminal-side scheduling, we use terminal-side proximity scheduling, intelligent addressing, and dedicated low-latency network acceleration solutions to further improve the user experience on the terminal side.

Experience optimization is the core embodiment of vivo's user orientation. Improving the user access success rate requires our continuous investment and continuous optimization. We will continue to explore new technologies and solutions for experience optimization together with the industry.

<<: 

>>:  Practice: Ping suddenly has high latency? Spanning tree architecture, the Cisco switch that is highly praised by network engineers is also suffering from the old sin!

Recommend

What you need to know about HTTP protocol

Today we will analyze the HTTP protocol, which is...

DogYun German CUVIP line cloud server simple test

How about DogYun? I just had a little bit of prep...

Follow WeChat! Weibo launches new emojis: they can also “split”

Weibo and WeChat are two well-known social platfo...

How to implement a real-time monitoring system for tens of billions of visits?

[51CTO.com original article] The author has joine...

Why some cities are reluctant to adopt 5G

Most of the discussion about 5G has centered arou...

5G "new infrastructure", new scenarios, new models

Since the beginning of the year, the central gove...

The real year of 5G: What it means for cloud technology

We are now in the third year of “The Year of 5G.”...

Do you know which city has the fastest Wi-Fi speed in the world?

Since the coronavirus crisis, fast internet has b...

Why is your broadband speed never as fast as your operator says?

According to some users, in order to improve the ...

Huawei releases Net5.5G full range of solutions to stimulate new growth for operators

[Barcelona, ​​Spain, February 26, 2024] During MW...

Three major 5G concepts and four key technologies

The situation is tense and there is little conten...