What the frequent failures of Internet data centers can teach us

What the frequent failures of Internet data centers can teach us

Recently, Internet failures have occurred one after another, and several occurred in May alone. NetEase's backbone network was attacked, which seriously affected its gaming business. Alipay was interrupted for 2 hours due to the digging of the transmission fiber. Then Ctrip's network was interrupted for nearly 12 hours due to personnel error. It is not difficult to see from these incidents that all failures are caused by human or external environmental factors, not the data center equipment itself. According to previous statistics, it can also be seen that human factors account for 80% of the causes of failures in data centers. Many failures can be avoided by strengthening the management of people, rather than the technology itself. Most of the factors that cause data center failures come from the outside rather than from the inside. Therefore, to ensure the stable and uninterrupted operation of the data center, it is necessary to predict various possible failures in the external environment in which it lives, so as to test the various response measures of the data center.

[[137213]]

Data centers will eventually be connected to the backbone network of operators. The operator's network is not controlled by the data center, so it is necessary to consider the impact of various backbone network failures on the data center. For example, the most common backbone network service interruption is that the backbone network is generally composed of a ring network. If one side is interrupted, the service will be immediately switched to the other side. However, in actual applications, there may be a double interruption, such as a router failure at the backbone network node, or a broadcast storm, or a network attack. In short, the result is that the link to the data center is interrupted. At this time, the data center is powerless. If there is no good backup system, it can only wait. According to the operator's fault handling efficiency, it may reach the hourly level, which will have a serious impact on the data center's business, especially for large data centers such as Alipay, where the transaction volume per hour is in the tens of millions. The hourly interruption loss is serious, and such a business interruption can often cause the data center's rating to drop by two levels, because a high-reliability data center requires no hourly business interruption throughout the year. In this case, some people proposed the technology of off-site data centers, that is, by establishing data centers in multiple locations, the services of these data centers can be backed up. When the data center carrying the services fails, the services can be directly switched to other data centers. This can ensure the continuity of the data center business operation, and even achieve automatic switching without perception at the business level. This switching is very simple, and is basically completed by adjusting the routing. The access traffic mirrored to the original data center is directed to the accessible normal data center through routing. The routing switching can be automatically switched through protocol detection. When the protocol detection fails, the routing automatically switches. In this switching process, technical means can be used to achieve no packet loss or very little packet loss, so that the business level is not aware of it. Now many large data centers are building disaster recovery centers, which is to avoid disaster-level accidents, and the data center business will not be affected. The functions of disaster recovery centers and off-site data centers are similar, and the main purpose is to back up the existing data center business. Of course, it is very expensive to build a data center directly for disaster recovery, and it is useless in most cases. Only data centers with deep pockets will do this. More often, multi-data center backup is used, that is, data centers in multiple locations work at the same time. When one of the data centers fails, the business will be switched to other data centers for sharing. In this way, as long as the original data center design has a certain degree of redundancy, it will be sufficient. In a short period of time, the operation of the other data centers can keep the business unaffected. After the faulty data center is repaired, the business will be switched back. In recent years, solutions such as disaster recovery centers, remote data centers, and multiple data centers have received widespread attention. Technically speaking, there is no difficulty in implementation, but there are not many actual applications, mainly due to cost factors. Building a disaster recovery center may not be used once a year, which is a huge waste of investment. The multi-data center solution may be more acceptable. Multiple remote data centers share the business of the faulty data center. In this way, these data centers only need to reserve a certain amount of business processing capacity.

We have done our best in data center design, but we often ignore human factors. The business system of the data center is very complex, but the more complex the system, the more likely it is to have human failures. Whether human failures are intentional or unintentional, they should be avoided as much as possible. The best way to avoid human failures is to strengthen the control of operators. During the operation of the data center, remember not to adjust and change the equipment parameters easily. By setting various access control lists, people at different levels have different access to different devices. The commands typed on the device must also be authorized by the command line. Commands that are not authorized by the command line cannot be executed. Despite these management measures, human failures cannot be completely avoided. For example, many times when business adjustments are required, it is inevitable to change the operating parameters of the data center. This is related to the strong technical ability of the operator. Some operators have a strong understanding of the business and can quickly make correct adjustment instructions, but some operators may make mistakes and cause business abnormalities. These unintentional human failures can be reduced by increasing operator training or letting experienced personnel perform them. By strengthening the training of personnel skills, the occurrence of such failures can be basically reduced. Intentional human failures are more subtle and difficult to avoid. After all, few outsiders know what everyone is thinking. The data center contains a large amount of private data. By obtaining these data centers, they can even make illegal profits. Some people may just want revenge. These intentional operations can cause failures that are difficult to avoid even through management systems. Just like these air crashes, it was finally confirmed that they were all caused by the captain. We cannot know the captain's psychological activities, so this kind of man-made failure is the most difficult to find and avoid. In order to reduce this kind of failure, we can only educate them emotionally, improve the professional ethics of data center personnel, and often sound the alarm for data center personnel. Although data center failures rarely cause casualties, they can cause serious financial losses, which are often uncompensable for a data center staff member. Let the data center staff know the serious consequences of such operations, so as to reduce this kind of intentional human failure.

The negative impact of failures on data centers is heavy, but it is inevitable. Any data center has experienced problems of varying sizes. The key is to take preventive and remedial measures to reduce the impact. Data centers are not afraid of failures, but are afraid of failures without a repair mechanism or failure to repair them in a timely manner. In this way, any small failure can evolve into a major accident. Keep a cautious attitude towards data center failures and keep the alarm bell ringing.

<<:  Cloud Data Center in the "Internet +" Era

>>:  Schneider Electric is a pioneer in green data centers

Recommend

5G wireless network signaling process

1. 5G initial access 1. Overview of powering on a...

What is CDN? Is using CDN definitely faster than not using it?

​For developers, the term CDN is both familiar an...

Spiderpool: How to solve the problem of zombie IP recycling

In the Underlay network, how to recycle zombie IP...

Pre-terminated trunk copper cable and method of using the same

High-density cabling products and standard modula...

The secrets of Netty network programming, just read this one

Netty version: 4.1.55.Final Traditional IO model ...

How to configure PoE switch settings with NVR?

All modern video networks use IP cameras. IP came...

ZJI: 520 yuan/month Hong Kong server-2*E5-2630L/32GB/480G SSD/30M bandwidth/2IP

ZJI has released a special promotional dedicated ...

Edge computing/fog computing and what it means for CDN providers?

CDN is usually a large number of distributed syst...