7.2 Our computer room is disconnected from the Internet! What should I do?

7.2 Our computer room is disconnected from the Internet! What should I do?

1. Background

At 10:04 on July 2, 2024, the public network physical optical cable of Room A of our station was disconnected, resulting in the inaccessibility of the public network of Room A. This article will analyze the problems we found in this failure and the governance optimization measures from the perspective of DCDN architecture and multi-active governance.

2. Stop loss process

After the fault occurred, SRE and network engineers received a large number of dedicated line interruption and public network detection alarms, and quickly held an online meeting to coordinate fault location and stop loss operations.

During this period, core services (such as home page recommendations and playback) were not affected because automatic disaster recovery at the source station room level was configured on the DCDN side.

First, we located a large number of packet loss anomalies on a single operator line. We prioritized redirecting the operator's user traffic to a CDN dedicated line node with a dedicated line back to the source. At this point, the traffic of this part of the users was restored, but the overall business was not fully restored.

The entire computer room A is completely inaccessible from the public network, but the core business scenarios of computer room B have increased traffic due to the effectiveness of automatic disaster recovery and the observed business SLO is normal. The decision is made to execute the multi-active business of the entire site to switch traffic to computer room B to stop the loss. At this time, the multi-active business has stopped the loss, but the non-multi-active business is still damaged;

Continue to downgrade non-multi-active business traffic and redirect user traffic to the CDN dedicated line node back to the source. At this time, non-multi-active business traffic has completed loss prevention.

3. Problem Analysis

Figure 1: North-South traffic architecture diagram/0702 fault logic diagram

Figure 2: B2-CDN ring network diagram

Let's first briefly introduce the B station source station architecture. As can be seen from Figure 1 above, B station's online business has two core computer rooms, each of which has two Internet access points (public network POPs), and these two Internet access points are distributed in different provinces and cities. The core idea of ​​this design is to decouple network access (hereinafter collectively referred to as POP) and computing power center (hereinafter collectively referred to as computer room) to achieve the effect of disaster tolerance for access layer failures.

At the same time, as can be seen from Figure 2, in order to improve the stability and efficiency of the return from the self-built CDN node to the core computer room of the source station. We have completed the design and construction of the B2-CDN ring network, and realized the return from the edge L1 & L2 self-built CDN nodes through the ring network, enriching the way for the business to obtain data from the edge node to the core source station. In fact, the original intention of the design of the B2-CDN ring network is to give each L1 & L2 self-built CDN node more means when processing edge cold flow, warm flow, and hot flow, so as to explore edge network scheduling methods that are more suitable for the business characteristics of B station. The bottom layer of the B2-CDN ring network realizes the Full-Mesh of each node through the second-layer MPLS-VPN technology, and on this basis, the interconnection between each node and the core computer room of the source station is realized through the third-layer routing protocol (OSPF, BGP). At the same time, each business retains the ability to return to the core computer room through the public network as a backup return solution in the event of extreme failure of the B2-CDN ring network.

B station interface requests are mainly accelerated back to the source station through DCDN. DCDN nodes are divided into two types: public network nodes that return to the source through the public network and dedicated line nodes that return to the source through dedicated lines. Under normal circumstances, DCDN public network nodes can return to the source station through dual public network POPs, and DCDN dedicated line nodes return to the source station through intranet dedicated lines. In addition, at the DCDN level, there is a Health Check function for the source station, which automatically removes the source station IP that detects abnormalities. For example, when a DCDN node requests to return to the source POP A and an abnormality occurs, it will retry to POP B. DCDN public network nodes can normally cross back to the source station through dual POPs. In response to packet loss or interruption from DCDN to a certain source station POP point, the disaster recovery plan automatically takes effect, with almost no impact on the business.

However, in this failure, the failure of the dual POP to computer room A is equivalent to the disconnection of computer room A from the public network. Unlike a single POP failure, the conventional mutual disaster recovery solution between the dual POPs cannot take effect. In addition to several core business scenarios that are not affected because of the pre-configured computer room-level fault disaster recovery strategy, multi-active services that are not automatically disaster-tolerant need to perform computer room-level flow switching to stop losses. Since the DCDN dedicated line node can be returned to the source through the B2-CDN ring network dedicated line and is not affected by this failure, it eventually becomes an escape route for non-multi-active services.

Looking back at the entire stop loss process, we found the following problems:

  • The computer room has an extreme network failure, the demarcation is slow and the emergency plan is not complete
  • Some multi-active businesses still need to manually cut traffic to stop losses. Can it be done faster or even automatically?
  • How to proactively escape from a computer room entrance failure when a non-multi-active business

4. Optimization measures

In response to the problems encountered in this failure, we re-evaluated the contingency plan and improvement measures for single-computer room failures. We can see that the overall loss-stopping plan for multi-active services is consistent, focusing on the efficiency of automatic disaster recovery and manual flow switching; while non-multi-active services require multiple escape methods: returning to the source through the DCDN intranet node, or forwarding across computer rooms through the API gateway.

Computer room extreme network failure plan

As mentioned above, the origin site has two public network POPs and a dedicated line, so logically, if any two of the entrances are abnormal, there is still a chance to ensure business availability. Therefore, we have taken the following measures:

  • Expand the computing power and scale of DCDN dedicated line nodes to improve the carrying capacity in extreme situations;
  • The dispatch plan for abnormal situations of dual public network POP egress groups domain names and DCDN node types, and supports fast switching of non-multi-active domain names to dedicated line nodes. Since multi-active domain names can stop losses by switching traffic, they do not need to be dispatched to dedicated line nodes in order not to increase the load on dedicated line nodes.
  • The efficiency of fault demarcation has been improved, the reporting links of important monitoring have been optimized, decoupled from the business links, and disaster recovery has been deployed on the public cloud. At the same time, the network topology panel has been optimized to clearly display the status of each link, as well as the alarm and display methods, to facilitate quick problem location.

Figure 3: DCDN traffic scheduling architecture: daily state/disaster recovery state

Multi-active construction continues to advance and regular drills

Figure 4: Simplified diagram of the same-city multi-active architecture

At present, our site's business is mainly based on the multi-active architecture in the same city. As shown in Figure 4, we logically divide multiple computer rooms into two availability zones. Each availability zone bears 50% of the daily traffic. The overall multi-active architecture is layered:

  • Access layer:
  • DCDN: North-south traffic control, based on user latitude information hash routing to source station rooms in different availability zones, supports automatic disaster recovery in availability zone dimensions;
  • Layer 7 load balancing/API gateway: north-south traffic control, supporting interface-level routing, timeout control, same/cross-zone retries, circuit breakers, current limiting, and client-side flow control, etc.
  • Service discovery/service governance components: fine-grained east-west traffic control, framework SDK supports priority calls within the same availability zone, and service and interface-level traffic scheduling;
  • Cache layer: mainly Redis Cluster and Memcache, providing Proxy components for access, does not support cross-zone synchronization, and requires independent deployment in two zones; maintains data final consistency by subscribing to database Binlog, and requires transformation for pure cache scenarios;
  • Message layer: In principle, production/consumption is closed within the availability zone, and Topic-level messages are supported to be synchronized across availability zones. The three consumption modes of Local/Global/None are adapted to different business scenarios.
  • Data layer: mainly MySQL, KV storage, master-slave synchronization mode; provides Proxy components for business access, supports multi-zone read, nearby read, write traffic routing to the master, forced read master, etc.;
  • Management and control layer: Invoker multi-active management and control platform, supporting multi-active meta-information management, north-south/east-west traffic switching, DNS switching, plan management, and multi-active risk inspection;

For businesses that have completed multi-active transformation, we have built a multi-active management and control platform to uniformly maintain the multi-active meta-information of the business, and support multi-active flow switching management and control of the business in the north-south and east-west directions. The platform supports the maintenance of flow switching plans and supports rapid flow switching for single business, multiple businesses, and full-site maintenance. At the same time, the platform provides multi-active related risk inspection capabilities, routinely inspects risks from the perspectives of multi-active traffic ratio, business capacity, component configuration, and cross-computer room calls, and supports the governance and operation of related risks.

After completing the plan maintenance and risk management in advance, we regularly conduct north-south traffic switching drills for single businesses and multiple business combinations to verify the service itself, its dependent components, its dependent downstream capacity, current limiting and other resource load conditions, and normally ensure the effectiveness of multiple active-active systems and normal switchability and disaster tolerance.

Automatic disaster recovery at the computer room level

For core services involved in scenarios with strong user awareness, a disaster recovery strategy at the source station computer room level is configured on the DCDN side. When a single source station computer room entrance fails, traffic can be automatically routed to another computer room to stop the loss.

The automatic disaster recovery of multi-active services was not fully configured by default. It prioritized the main scenarios such as home page recommendations and playback-related services, and the remaining business scenarios were switched according to the resource pool water level. At present, the average CPU utilization of our resource pool has reached 35%+, and the average peak CPU utilization of online services is close to 50%. We have sorted out the resource requirements of the single computer room for switching the entire site business. At the same time, the multi-active switching will also link the platform to adjust the HPA strategy and prepare a rapid elastic plan for the resource pool to ensure the health of the large-scale resources. In the future, it will support the configuration of automatic disaster recovery strategies for more user-sensitive scenarios such as community interaction, search, and space. When encountering a computer room-level failure, there is no need for manual intervention, and the multi-active business can directly stop the loss through disaster recovery.

Figure 5: Multi-active business north-south traffic architecture: daily state/disaster recovery state

Non-multi-active traffic escape

Some services are not currently deployed in multiple data centers with multiple active servers, and only one data center can handle traffic. Therefore, in the original solution, this part of non-multi-active service traffic will only be returned to data center A, and cannot cope with the failure of the public network entrance of data center A. As in this incident, non-multi-active service traffic cannot be cut off to stop the loss, and relies on downgrading to go through CDN dedicated line nodes.

In order to cope with scenarios such as single computer room public network entrance, four-layer load, and seven-layer load failure, we plan to configure source station-level automatic disaster recovery for non-multi-active business rules on the DCDN side, and merge and unify the routing configuration of multiple computer rooms and multiple clusters on the seven-layer load SLB to ensure that the traffic of non-multi-active services can enter computer room B and be routed to the API gateway in the event of a failure; on the API gateway side, determine whether the interface is multi-active, and forward traffic on the non-multi-active interface through the intranet dedicated line to achieve traffic escape.

Figure 6: North-South traffic architecture for non-multi-active services: daily state/disaster recovery state

5. Summary

Failures at the level of a single computer room are a great test of the integrity and effectiveness of the multi-active transformation, and fault drills are necessary for verification. In the second half of the year, we will continue to focus on multi-active risk management. In addition to normal traffic switching drills, we will also launch north-south and east-west network disconnection drills.

<<:  Nginx log analysis: writing shell scripts for comprehensive log statistics

>>:  Implementing P2P video streaming using WebRTC

Recommend

All IPV4 addresses are exhausted, and you still don’t know what it is?

On November 26, 2019, the last IPv4 address in th...

Intel to jointly develop 5G technology with Indian telecom operator Reliance Jio

On June 23, according to Mobile World Live, Intel...

A fancy way to solve inter-VLAN routing

In a local area network, we use VLAN to group dif...

How to get out of the maze of mixed network management

When people are walking on a broad road, the road...

Let's talk about the basic principles of common serial communication

Why do we need to talk about serial communication...

Application of 5G technology in smart agriculture

Smart agriculture and precision farming combine t...

Misaka: $44/year KVM-2GB/32G NVMe/2TB/Germany (optional CN2)

Misaka is a Chinese merchant (the same company as...

Why is 50 ohms used in RF?

[[416676]] In RF circuits, RF devices with variou...

GSMA: Global 5G deployment will slow down due to the epidemic

On November 16, the Global System for Mobile Comm...

Detailed explanation of the differences between IPv6 and IPv4!

IPv6 is the abbreviation of Internet Protocol Ver...