The ancestral motto of data center operation and maintenance is "no trouble, no failure"

Although the saying "no trouble, no failure" is crude, it makes sense, especially in operation and maintenance. According to statistics from relevant consulting agencies, 70% of data center failures are man-made failures, which are strongly related to human activities. It can be seen how terrible people are to data centers. Man-made failures can also be divided into intentional and unintentional. Intentional failures refer to those who know that some operations will cause data center failures but still insist on doing them. These people often hope to achieve ulterior motives by paralyzing the operation of the data center. This type of failure accounts for 80% of man-made failures, and the rest are unintentional.

[[229968]]

The data center itself is a complex and huge system. It is impossible for the operation and maintenance personnel to be proficient in all technical details. When they come into contact with unfamiliar or unfamiliar areas, operations are likely to cause unexpected results. There are also many devices with low-quality software. Repeated operations and issuances are likely to cause software problems, resulting in business interruptions. This situation is not uncommon in data centers. There are tens of thousands of devices in data centers, and the number is huge. Problems will arise if they are moved. Therefore, a stable data center should not be easily changed, and it should be left to run in the best state.

As we all know, whenever there are major festivals and events, large data centers will shut down the network and stop all operations and activities in order to reduce the occurrence of failures, reduce the risk of human operation, and reduce the risk of triggering bugs. This method is effective, and except for some hardware failures, other types of problems rarely occur.

We all know that turtles have a long lifespan, living for hundreds of years and looking young and graceful. This is because turtles rarely move and move slowly, which greatly prolongs their lifespan. Data center operation and maintenance also prefers quietness rather than movement, and moves less and more carefully, which can minimize the occurrence of failures. The data center of the financial banking industry has very high requirements for reliability. In order to avoid failures, the bank's data center has formulated a strict operating system. All operations must comply with unified specifications. Any issuance and change of any command must be reviewed in advance by the bank, and even verified in a simulated environment before it is implemented in the live network. The data center operation of the banking industry is the most standardized, making the data center's reliability the best.

However, in order to quickly respond to business needs and improve resource utilization, operations and maintenance have to be done frequently. It is basically impossible to do without. A data center may have changes scheduled every night, as well as equipment software upgrades, configuration optimization, equipment replacement, etc. There are always endless changes in the data center, which inevitably leads to some new problems during the operation process, resulting in the data center always being unable to stabilize and the business being frequently affected. This actually violates the purpose of the ancestral teachings of operations and maintenance.

There is too much technical knowledge required in the data center, covering dozens of disciplines and categories. No one can master all of them, and it is difficult to fully master even one. When formulating corresponding operations, limited knowledge will always lead to some inconsideration. Once there is an omission, problems may arise during the operation. No one is absolutely sure about the change operation. Accidents may happen in everything, just like surgery. Even the smallest surgery is risky, and the family members must sign to exempt the operator from liability in case of an accident.

Since we cannot avoid the trouble, we should find ways to prevent it from causing any problems.

First, we need to divide and conquer. Divide and conquer means to separate high-risk from low-risk, high-importance from low-importance, simple from complex, and frequently changing from infrequent. In the final analysis, we are doing two things: encapsulating complexity and isolating changes. Divide and conquer at the operation and maintenance architecture layer is already very common in the industry, such as separating application servers from database servers, separating transaction databases from user databases, and isolating production environments from test environments. A data center is composed of many small systems, which should be loosely coupled and isolated from each other. In this way, if a small system fails, the impact is local and will not affect the whole system.

The second is to manage people. To reduce the failures caused by human intervention, we must strengthen the constraints and management of people. People with different technical levels have different operating permissions. If a novice wants to operate online, he must be guided by an experienced engineer. It is necessary to formulate detailed personnel management rules and regulations to constrain the operation and maintenance personnel, assess, monitor and manage the operation and maintenance personnel, enhance the sense of responsibility of the operation and maintenance personnel, and reward and punish them. Formulate strict rules and regulations. Generally, data centers need to provide services to the outside world 24 hours a day, all year round, so data center personnel must be given sufficient rest time, go to get off work on time, avoid long hours and fatigue, and reduce the probability of errors.

The third is to manage things. When the data center needs to change and optimize operations, the operation and maintenance team needs to have an overall discussion, analyze the foreseeable risks, and ensure that the operation will not affect the running business. Each change is a decision made through discussion by the entire technical team, rather than an individual's behavior, so that technical human failures can be minimized. A rollback plan should be formulated, and the abnormal situation should be rolled back immediately. Afterwards, the cause should be analyzed before making a second change. After all, the operation and maintenance personnel are not professional equipment, and they may not be very clear about the internal processing and implementation of the equipment. For major change operations, the technical personnel of the equipment manufacturer can be invited to participate and support to reduce the risk of operational errors. Every operation must be fully prepared, and necessary simulation drills, advance business migration, and preparation of emergency channels are required to reduce the risk of failures.

"No trouble, no failure" is a golden saying. It sounds reasonable, but it is difficult to do in reality. The data center is a place where data flows at high speed. Business needs are changing all the time. In order to meet the needs of business deployment and development, it is impossible not to change and trouble the data center. "No trouble" is just an ideal state. However, it is indeed necessary to actively reduce the frequency of data center operations as much as possible and move as little as possible, which can greatly reduce the probability of failure. People are the most important factor in data center activities. Without human participation, there would be no data center. However, people also bring growth troubles to the data center. People still play a vital role in the operation and maintenance process. As the operator of the data center, you must always keep in mind the teachings of your ancestors.

<<: 5G is coming soon. Will it save you money? Operators: Stop dreaming!

>>: Riverbed is your smarter choice to accelerate your business with the power of digital experience management

Detailed explanation of SSL protocol communication process and symmetric encryption and asymmetric encryption in HTTPS

In the Video 3.0 era, Huawei is working hard to promote video services to become a basic service for operators and achieve commercial success

Blog

OVHcloud: $0.97/month - 2GB/20GB/100M unlimited traffic/Europe, America & Asia Pacific

Blog

WiFi 6 has limited potential without smart management

Blog

GSMA releases a white paper on 2G/3G network decommissioning experience in Asia Pacific to help operators reduce costs and increase efficiency, and move forward with ease

Blog

Hostodo: $17.99/year KVM-1GB/12GB/4TB/Las Vegas

Blog

Jingwen Internet's year-end special offer: 30% off on all VPS + free memory, 200 yuan off for independent servers, 300 yuan for 1,000 yuan recharge

Blog

Have you learned how to build the CC2530 development environment?

Blog

The EU will accelerate the layout of 5G big data. Industry organizations say it is urgent to narrow the gap with China, the United States and other countries

The European Commission issued an initiative on S...

The ancestral motto of data center operation and maintenance is "no trouble, no failure"

Detailed explanation of SSL protocol communication process and symmetric encryption and asymmetric encryption in HTTPS

5G application complex network security issues cannot be ignored

Getting to the bottom of HTTP and WebSocket protocols

In the Video 3.0 era, Huawei is working hard to promote video services to become a basic service for operators and achieve commercial success

OVHcloud: $0.97/month - 2GB/20GB/100M unlimited traffic/Europe, America & Asia Pacific

WiFi 6 has limited potential without smart management

GSMA releases a white paper on 2G/3G network decommissioning experience in Asia Pacific to help operators reduce costs and increase efficiency, and move forward with ease

Hostodo: $17.99/year KVM-1GB/12GB/4TB/Las Vegas

Jingwen Internet's year-end special offer: 30% off on all VPS + free memory, 200 yuan off for independent servers, 300 yuan for 1,000 yuan recharge

Have you learned how to build the CC2530 development environment?

Recommend

Ericsson and Swisscom sign standalone 5G network agreement

Don’t abuse HTTP cache anymore! Here’s a recommended best practice for cache settings!

As the demand for connectivity surges, 5G becomes the key to sustainable development of the ICT industry

AWS launches 5G service, officially enters the mobile network market

Network slicing will become the ideal architecture for 5G, but commercial deployment still faces multiple challenges

Magical IPv6, mobile phones can be assigned independent IP addresses

The EU will accelerate the layout of 5G big data. Industry organizations say it is urgent to narrow the gap with China, the United States and other countries

Inspur HCM Cloud is officially launched, ushering in the era of professional human resource management in the cloud

Millimeter wave and Sub-6GHz complement each other and empower various industries

Wi-Fi signal is not good? Hybrid network architecture is a recommended choice

Have you already moved to SDN network?

This article tells you how to realize the IP territorial function. Have you learned it?

Huawei's Eric Xu: Working with operators to create an agile and user-friendly B2B private line service

What are the security standards for 5G?

In 20 days, Huawei delivered a miniature version of a smart city