The Internet is like this: Design of distributed domain name resolution system in G-line data center

The Internet is like this: Design of distributed domain name resolution system in G-line data center

As financial institutions gradually evolve towards a multi-data center architecture, agile disaster switching and recovery capabilities become critical. How to dispatch traffic between data centers on the basis of ensuring production data or storage consistency to ensure the continuous operation and rapid recovery of business has become a key demand of the industry. As the core support for this demand, the Domain Name Resolution System (DNS) has evolved from a simple mapping of domain names to IP addresses to a key hub for data center traffic management and scheduling. In recent years, failures of the domain name resolution system in the data center intranet have occurred one after another, causing large-scale network paralysis and business interruption. How to prevent and respond to domain name resolution system failures and design a high-availability architecture, disaster recovery mechanism and emergency plan for the domain name resolution system has become the focus of the data center technology team. Today we will talk about the design ideas and thoughts of the domain name resolution system of Bank G's data center.

This article will gradually expand from system architecture to policy configuration, and lead everyone to understand step by step how the G-line domain name resolution system architecture is formed. Some DNS technical professional terms are involved. You can refer to the other two basic articles "A Brief Discussion on DNS Domain Name Resolution System" and "A Brief Discussion on DNS Domain Name Resolution System - Local DNS". This article will not go into details.

1. Architecture Design

“Planning is the key to success”. When building a global network domain name resolution system that can affect the operation of the entire data center, the first step to consider is the overall availability. It is necessary to ensure that partial system failures do not affect the overall external service capabilities. This requires the designed system architecture to have low coupling and high redundancy characteristics. It is also recommended to support heterogeneous product deployment.

The DNS protocol standard is very flexible. Root, authoritative, and recursive server roles can be deployed independently or on the same server. For large data centers, from the perspective of robustness and capacity, it is still necessary to decouple the roles at each layer and deploy them independently and heterogeneously, making full use of the characteristics of each role in the domain name resolution system. This is also confirmed by industry research. Large enterprise-level data centers mostly use root, authoritative, and recursive server deployment methods. The entire layered decoupled architecture design example is as follows:

picture

Root, authority, and recursive hierarchical, each role server is deployed in a multi-center distributed manner.

01 Root Server

The root servers are deployed in a highly redundant distributed manner, with at least two root servers deployed in each data center to cope with the failure of a single device in the local data center, while providing data center-level redundancy capabilities.

02 Authoritative Server (Name Server)

The authority adopts a multi-level authorization method. The root authorizes the authoritative server of the first-level domain name, and the authoritative server of the first-level domain name authorizes the authoritative server of the second-level domain name, and so on. The authoritative servers at all levels can be expanded horizontally by adding authorization. The authoritative server of the second-level domain name can be authorized to the system that needs domain name autonomy to manage itself. Such as full-stack cloud platform, domain control, intranet CDN platform, subsidiary organization, etc.

03 Recursive server (Local DNS server)

Taking into account the different attributes of office and production, the recursive server is deployed in an independent manner for office and production, providing domain name resolution services for office terminals and production servers respectively. The office recursive server introduces different trusted products and non-trusted products for heterogeneous deployment.

Each node of the recursive server in row G is deployed in a load-balanced cluster. This design has many advantages:

  • Load sharing: Recursive servers directly face customers and face high access pressure. Load balancing can achieve high-performance parsing capabilities.
  • Horizontal expansion: The user side needs to directly configure the recursive server IP. If the recursive server needs to be expanded, upgraded, or replaced, a short interruption may have a global impact. By publishing the load balancing virtual address, the changes of the backend server cluster can be shielded, which facilitates subsequent operation and maintenance.
  • Health check: The load balancing system has a DNS-based health check, which can detect recursive server resolution anomalies in real time and achieve automatic isolation.
  • Security protection: Some load balancing products have DNS protection capabilities, which can prevent DDOS attacks and intercept abnormal resolution requests;
  • Heterogeneous deployment: Resource pools can use products from different brands for heterogeneous deployment to avoid product-level failures.

2. Domain Name Planning

Domain name planning may be the key to the sustainability of the domain name resolution system. A good domain name architecture planning can achieve clear upper and lower domain name relationships, differentiate management responsibilities, and flexibly expand and split capacity. Here we mainly consider three key factors: dynamic and static division, internal and external division, and risk isolation tenant division to support the business needs of multiple data centers:

01 Differentiation between static and dynamic

"Dynamic" and "static" refer to dynamic domain names and static domain names. In a high-availability architecture, the high availability of the application system is usually achieved through a domain name resolution system. The domain name server dynamically resolves different IP addresses based on the location of the client request, or resolves multiple IP addresses through polling to achieve load balancing of multi-active system service IPs, or needs to switch domain names to achieve the master-slave switching capability of the application system, etc. The domain name that uses the domain name resolution system health check monitoring and intelligent resolution algorithms to achieve a single domain name and multiple IP addresses is called a dynamic domain name. In contrast, the domain name that is simply bound to the IP address is called a static domain name.

Dynamic domain names require the domain name resolution system to consume more resources, including real-time health check strategies, load balancing, topology algorithms and other intelligent resolution algorithms, etc. In terms of architecture design, dynamic domain names may become a bottleneck for system performance in the future. In order to facilitate independent deployment, dynamic domain names should use independent subdomains, such as intranet CDN systems, multi-AZ full-stack cloud systems, etc.

02 Distinguish between inside and outside

"Inside and outside" refers to the intranet domain name and the Internet domain name. Servers in the Internet DMZ zone usually need to access both the intranet and the Internet. To avoid confusion between the internal and external domain names, the intranet and the Internet should not use the same first-level domain name. The pure intranet environment should be completely isolated from the Internet and should not have the ability to resolve Internet domain names.

03Risk Isolation and Tenant Differentiation

It is recommended to use independent subdomains for independent institutional domain names, such as branches, credit cards or subsidiaries. If the business of a single institution grows too fast or the management structure is adjusted later, it will be easy to split it independently.

In summary, the overall domain name planning is as follows: the domain name is authorized by the root domain name (".", a unified first-level domain name is used within the data center, the name should be distinguished from the Internet domain name, the second-level domain name is distinguished according to the organization, the third-level domain name is distinguished according to dynamic and static use, and sub-domains can be divided according to function.

picture

3. Strategy Design

01Intelligent analysis mechanism

The intelligent resolution capability of the domain name resolution system is the basic capability for data centers to achieve multi-activity of services. In layman's terms, intelligent resolution is the ability of the domain name resolution system to return the corresponding service IP address based on user attributes and combined with the monitoring of system service status. Typical usage scenarios include "near-source access":

The application system provides the same service in both data centers. Different data centers publish different service addresses. The requirement is to publish the same domain name. When users access this domain name, they can access the service node of the data center closest to their physical location. If this service node fails, it will automatically access the service node of another data center. Users are unaware of which data center service node is accessed.

Typical HTTP/HTTPS applications of "near-source access" include "CDN content distribution network". "Near-source access" can greatly reduce the bandwidth resources for communication between data centers and reduce costs. At the same time, due to the short access distance, it can also increase the speed of user access and enhance user experience. In this scenario, the domain name resolution system will determine which data center's service address to return based on the source address of the user's request. At the same time, the domain name resolution system will continuously detect the availability of the service port of the application system through the health check mechanism. If the detection fails, the address will be automatically isolated. According to the resolution order, the next available address will be resolved to achieve automatic switching capabilities.

It can be seen that there are two key points of intelligent resolution: 1. Determine which service address to return based on the source address, and 2. Health check mechanism. The health check mechanism is relatively simple and will not be described in detail here. The main point is how to obtain the user address. Students who are familiar with the principles of DNS may find a problem. When the domain name resolution system is deployed in different roles, the authoritative server can only obtain the address of the recursive server, but not the real address of the user.

Technically, this problem can be solved in two ways: The first is to use EDNS (Extension Mechanisms for DNS) technology. EDNS is an extension of the original DNS protocol, which aims to enhance its functionality and flexibility. Its main feature is to increase the size of DNS messages from 512 bytes to 4096 bytes, and introduce additional option fields, so that DNS messages can carry more types of information. Here we mainly use the ECS (Client Subnet) extension field. The ECS extension is not a direct field of EDNS, but is transmitted as a part of an option (OPT Resource Record) of EDNS. When the DNS resolution request is forwarded to the authoritative server through the recursive resolver, the ECS option will contain information about the client address that initiated the request, so the authoritative server can obtain the user's real address. The second solution is a common method currently used on the Internet. An independent recursive server is built in each data center. The address of the recursive server represents the user's geographical location. The authoritative server can know where the request comes from based on the recursive server address.

picture

From the perspective of architecture design and combined with the deployment of products, Bank G currently adopts the second solution. The first solution has some advantages, such as low cost, not requiring too many authoritative servers, and not complicated configuration. At the same time, the authoritative server can directly obtain the user's IP, and can obtain very effective data in analyzing user behavior by parsing logs; but this solution has a fatal problem. After the source address insertion function is enabled, the cache capacity of the recursive server needs to be abandoned, because each request must carry a different source address, which means that each request must make an independent judgment on the authority, otherwise it will not be possible to resolve nearby. Giving up the cache capacity may increase the system pressure of the authoritative server by hundreds or thousands of times. When intelligent resolution is enabled at the same time, it may directly lead to insufficient performance and failure of the authoritative server. The second solution is mature, simple and easy to implement, and there are large-scale deployment cases on the Internet. However, it should be reminded that each recursive server needs to open access relationships to all root and authoritative servers. In terms of cost control, recursive servers need to be deployed in multiple locations. In locations with fewer users, low-end servers or virtual machines can be considered, and they do not need to be deployed in locations where intelligent resolution is not required.

02 Cache Mechanism

In the intelligent resolution, we mentioned the cache mechanism of the recursive server. In the design of the domain name resolution system architecture, the design of the cache strategy directly affects the overall performance of the domain name resolution system. After the recursive server enables the cache, during the cache time, the recursive server will no longer forward the request to the authoritative server for resolution, but will directly return the cached result to the user, which not only greatly alleviates the access pressure of the authoritative server, but also improves the speed of domain name resolution.

It is very necessary to enable caching, but the cache time is a factor that must be considered. If the cache time is too short, it will not effectively reduce the pressure on the authoritative server. If the cache time is too long, it may cause the domain name address change to not be updated quickly, and the intelligent resolution mechanism of the domain name resolution system may not take effect quickly. It is recommended to set the cache time through the TTL time configured on the authoritative server. Different types of domain names can have different TTL time strategies. For important business domain names that require fast address switching, it is recommended to reduce the TTL time value, and it can even be set to 0 for special applications. For business domain names that have a certain business impact tolerance, you can refer to the general time to set it.

picture

Taking into account the overall performance of the domain name resolution system and the requirements of most application systems, designing a universal TTL time can improve the communication cost of application requirements. Taking the health check failure timeout of 30 seconds and the TTL time of 60 seconds as an example, the possible business impact time can be calculated to be 60~90 seconds. If it is less than the RTO time of the business, it is an acceptable TTL time.

picture

In addition, there are other domain name caches on the client side that may cause domain name switching to fail, including operating systems, middleware, Java components, browsers, etc. It is recommended that administrators should fully evaluate components that may have domain name caches when deploying applications. If so, they should be set to not cache or follow the domain name TTL value.

03Domain name resolution delay

Domain name resolution, as the first step in network access, will inevitably consume a certain amount of latency. If it is an inter-server visit, the increased latency may have an impact on latency-sensitive application systems, which is also a concern of application systems before considering using domain name access. There are two main possible scenarios for resolution latency: (1) If the recursive server has a cache, the recursive server will directly return the result. Due to the principle of deploying recursive servers nearby, the resolution latency is usually around 1-2ms; (2) If the recursive server has no cache or the cache time exceeds the TTL setting, the recursive server will re-initiate an iterative query. Depending on the number of domain name iterative queries, the latency may range from a few milliseconds to tens of milliseconds. If the TTL value is 60 seconds, a slower domain name resolution may occur every 60 seconds.

For long-connection applications, the impact of domain name resolution latency is only reflected in the first resolution process. After the TCP connection is established, DNS is no longer requested before the connection is interrupted (no longer increasing access latency), and the impact of resolution latency is relatively low. For short-connection, high-frequency access applications, since the recursive server is deployed close to the client, clients in all locations access the local recursive server. There will only be a brief delay increase after the cache times out, and the average latency will not increase significantly.

For services that are sensitive to latency requirements, it is recommended that the recursive server enable the "cache refresh" function. When the cache expires, the recursive server will actively update the resolution record to the authoritative domain name to ensure that the client obtains the resolution result from the cache. If the DNS product does not support cache refresh, you can add a health check for the corresponding domain name on the front-end load balancing and use the load balancing to quickly update the cache.

For CS access, if the domain name resolution delay or resolution failure has a significant impact on the application, you can consider using HTTPDNS to obtain the domain name resolution result directly from the authoritative server via HTTP, but this requires certain modifications to the C-end.

04Disaster Recovery Strategy

The disaster recovery strategy of the overall domain name resolution system is mainly considered from the two aspects of architecture and performance.

In terms of architecture, since the root and authoritative servers of the domain name resolution system are deployed in a distributed manner, the failure of any single node has no impact on the entire system. The recursive server is deployed in a load balancing cluster, and the recursive server is deployed in a resource pool. The load balancing uses domain name resolution detection as a health check method, and a single device can be automatically isolated if it is unavailable. The load balancing group is deployed in multiple clusters to ensure the availability of the load balancing itself. Heterogeneous deployment is considered for the root, authoritative, and recursive servers to avoid product-level failures. The currently popular "dual-plane" deployment (using two sets of heterogeneous products to deploy two completely equivalent environments, one for active use and one for standby use, and directly switching to the standby environment when problems occur in the active environment) is not very meaningful under the distributed architecture. If there are two sets of heterogeneous environments, services can be provided at the same time without the need to use hot standby to waste resources. In addition, if there are two sets of environments with equal performance and capacity, then when performance problems occur in the active environment, the standby environment is also powerless. It is better to use them at the same time to double the overall performance.

picture

picture

In terms of performance, the overall performance capacity of the system design should meet the development needs of the next five years. By enabling the recursive server cache, the performance pressure of the authoritative server can be greatly reduced, ensuring that the domain name resolution system performance has sufficient redundancy. When the performance pressure of the authoritative server is too high, first determine the domain name that causes the sudden increase in pressure, increase the TTL time of the domain name to increase the cache time of the domain name on the LDNS to relieve the pressure on the authoritative server. In an emergency, consider directly shutting down the intelligent resolution to reduce the server performance occupied by the intelligent resolution, and use static resolution to provide domain name services.

4. Domain Name Security Design

01Security Threat Defense

In the face of security threats such as DNS flood attacks, DNS pollution and DNS covert tunnels, we have taken a series of monitoring and protection measures to ensure the security of the system.

  • DNS flood attack is a DDOS attack that mainly launches a large number of DNS requests in a short period of time, exhausting the performance of the domain name resolution system. It is relatively rare in the intranet. It is more likely to be caused by improper application configuration, resulting in a large number of accesses when going online. In this case, it can be handled in the way that the performance reaches the upper limit. The domain name resolution system should monitor two aspects: resolution log data and resolution traffic data, and establish a domain name resolution monitoring mechanism. For a short-term domain name query volume explosion, the query initiator client address and the queried domain name can be located in time, and the subsequent disposal can be carried out by shutting down the client or adjusting the domain name TTL time.

picture

  • DNS pollution, also known as DNS cache poisoning or DNS poisoning, aims to preemptively respond to forged authoritative DNS response messages and tamper with the resolved IP address, causing the domain name requested by the user to be incorrectly resolved to the IP address specified by the attacker. This type of attack can be prevented by enabling the anti-DNS poisoning function on the recursive server. When enabled, the recursive server will use random domain name strings when performing iterative queries, making it difficult for attackers to forge DNS request response messages for poisoning.
  • DNS hidden tunnel is not an attack on the domain name resolution system, but an attack technique that uses the DNS query principle to transmit intranet data to the Internet. The intranet is completely isolated from the Internet and cannot resolve Internet domain names, so there is no DNS tunnel risk. This type of attack mainly occurs at the Internet boundary. Protection against this type of attack mainly relies on the deployment of full-flow security devices or dedicated DNS protection devices to monitor DNS request traffic. Analysis of DNS resolution logs can also assist in locating risky clients.

02Analysis of other control measures

Do I need to close DNS TCP port 53?

It is recommended that the firewall between zones close TCP53 access. The DNS protocol actually stipulates that both TCP 53 and UDP 53 are DNS service ports. Under normal circumstances, DNS uses the UDP protocol for communication. When the DNS data message is too large, such as zone transfer operation, DNS will transmit messages through the TCP port. However, daily DNS use only requests for domain names and response resolution addresses, and there is no requirement for large message transmission. On the contrary, abnormal behaviors, such as DNS hidden tunnels, malicious domain name resolution, and stealing domain name information through zone transfer, will require larger messages. Therefore, if you only use the domain name normally, it is recommended that the firewall close TCP 53 port access.

Do I need to enable DNSSEC?

It is not recommended to enable it. DNSSEC turns DNS communication into encrypted communication, which can improve the security of DNS and effectively prevent DNS hijacking and DNS pollution. However, DNSSEC involves encrypted communication, which will greatly reduce the overall performance of DNS and make it more prone to performance exhaustion. At the same time, encrypted communication will increase additional bandwidth overhead. It may be necessary to open TCP53 port to allow the transmission of large messages, which will introduce other security risks. In addition, traffic analysis products will fail because they cannot analyze encrypted traffic, and the monitoring of hidden tunnels will completely fail. Similarly, the packet capture analysis methods required for DNS problem location will also fail. In short, the risks and management difficulties introduced by enabling DNSSEC increase, so it is not recommended to enable it.

V. Conclusion

The domain name resolution system of Bank G's data center supports efficient traffic scheduling and active-active services among multiple data centers through the principles of low coupling and high redundancy. With the promotion of full-stack cloud, big data, and artificial intelligence technologies, the domain name resolution system will face the need for larger-scale domain name requests and traffic scheduling. Bank G's domain name resolution system will also keep pace with the times, meet the challenges brought by these emerging technologies through continuous technology updates and architecture upgrades, and achieve continuous and secure operations.


picture

Author | Zhang Lin

We focus on the research and application of network layer 4 to 7 technologies and related security technologies. We are the best at applications in the network field and the best at networks in the application field. We strive to shine for another 20 years.

<<:  Explore different VGG networks. What do you discover?

>>:  Gateway programming: How to reduce R&D costs through user gateways and caches?

Recommend

Everything You Should Know About SFP Modules

In today's network environment, many users ar...

Should I switch to a Wi-Fi 6 router as the holidays approach?

If 2019 is the first year of Wi-Fi 6 commercializ...

IPv6 Security Thinking: Risk Analysis of Recursive DNS in IPv6 Networks

DNS (Domain Name System) is an important core inf...

Summary of wireless network wiring principles and methods!

Select the location of the wireless AP For wirele...

In 2017, the cybersecurity industry says no to black production!

[51CTO.com original article] In Keigo Higashino&#...

How to design a powerful API interface

[[343143]] In daily development, we always come i...

We cannot allow "free-network tools" to threaten network information security

Recently, the official website of the Ministry of...

iONcloud: 15% off cloud hosting in San Jose/Los Angeles, Linux/Windows available

iONcloud is a cloud hosting platform opened by Kr...