Although the saying "no trouble, no failure" is crude, it makes sense, especially in operation and maintenance. According to statistics from relevant consulting agencies, 70% of data center failures are man-made failures, which are strongly related to human activities. It can be seen how terrible people are to data centers. Man-made failures can also be divided into intentional and unintentional. Intentional failures refer to those who know that some operations will cause data center failures but still insist on doing them. These people often hope to achieve ulterior motives by paralyzing the operation of the data center. This type of failure accounts for 80% of man-made failures, and the rest are unintentional.
The data center itself is a complex and huge system. It is impossible for the operation and maintenance personnel to be proficient in all technical details. When they come into contact with unfamiliar or unfamiliar areas, operations are likely to cause unexpected results. There are also many devices with low-quality software. Repeated operations and issuances are likely to cause software problems, resulting in business interruptions. This situation is not uncommon in data centers. There are tens of thousands of devices in data centers, and the number is huge. Problems will arise if they are moved. Therefore, a stable data center should not be easily changed, and it should be left to run in the best state. As we all know, whenever there are major festivals and events, large data centers will shut down the network and stop all operations and activities in order to reduce the occurrence of failures, reduce the risk of human operation, and reduce the risk of triggering bugs. This method is effective, and except for some hardware failures, other types of problems rarely occur. We all know that turtles have a long lifespan, living for hundreds of years and looking young and graceful. This is because turtles rarely move and move slowly, which greatly prolongs their lifespan. Data center operation and maintenance also prefers quietness rather than movement, and moves less and more carefully, which can minimize the occurrence of failures. The data center of the financial banking industry has very high requirements for reliability. In order to avoid failures, the bank's data center has formulated a strict operating system. All operations must comply with unified specifications. Any issuance and change of any command must be reviewed in advance by the bank, and even verified in a simulated environment before it is implemented in the live network. The data center operation of the banking industry is the most standardized, making the data center's reliability the best. However, in order to quickly respond to business needs and improve resource utilization, operations and maintenance have to be done frequently. It is basically impossible to do without. A data center may have changes scheduled every night, as well as equipment software upgrades, configuration optimization, equipment replacement, etc. There are always endless changes in the data center, which inevitably leads to some new problems during the operation process, resulting in the data center always being unable to stabilize and the business being frequently affected. This actually violates the purpose of the ancestral teachings of operations and maintenance. There is too much technical knowledge required in the data center, covering dozens of disciplines and categories. No one can master all of them, and it is difficult to fully master even one. When formulating corresponding operations, limited knowledge will always lead to some inconsideration. Once there is an omission, problems may arise during the operation. No one is absolutely sure about the change operation. Accidents may happen in everything, just like surgery. Even the smallest surgery is risky, and the family members must sign to exempt the operator from liability in case of an accident. Since we cannot avoid the trouble, we should find ways to prevent it from causing any problems. First, we need to divide and conquer. Divide and conquer means to separate high-risk from low-risk, high-importance from low-importance, simple from complex, and frequently changing from infrequent. In the final analysis, we are doing two things: encapsulating complexity and isolating changes. Divide and conquer at the operation and maintenance architecture layer is already very common in the industry, such as separating application servers from database servers, separating transaction databases from user databases, and isolating production environments from test environments. A data center is composed of many small systems, which should be loosely coupled and isolated from each other. In this way, if a small system fails, the impact is local and will not affect the whole system. The second is to manage people. To reduce the failures caused by human intervention, we must strengthen the constraints and management of people. People with different technical levels have different operating permissions. If a novice wants to operate online, he must be guided by an experienced engineer. It is necessary to formulate detailed personnel management rules and regulations to constrain the operation and maintenance personnel, assess, monitor and manage the operation and maintenance personnel, enhance the sense of responsibility of the operation and maintenance personnel, and reward and punish them. Formulate strict rules and regulations. Generally, data centers need to provide services to the outside world 24 hours a day, all year round, so data center personnel must be given sufficient rest time, go to get off work on time, avoid long hours and fatigue, and reduce the probability of errors. The third is to manage things. When the data center needs to change and optimize operations, the operation and maintenance team needs to have an overall discussion, analyze the foreseeable risks, and ensure that the operation will not affect the running business. Each change is a decision made through discussion by the entire technical team, rather than an individual's behavior, so that technical human failures can be minimized. A rollback plan should be formulated, and the abnormal situation should be rolled back immediately. Afterwards, the cause should be analyzed before making a second change. After all, the operation and maintenance personnel are not professional equipment, and they may not be very clear about the internal processing and implementation of the equipment. For major change operations, the technical personnel of the equipment manufacturer can be invited to participate and support to reduce the risk of operational errors. Every operation must be fully prepared, and necessary simulation drills, advance business migration, and preparation of emergency channels are required to reduce the risk of failures. "No trouble, no failure" is a golden saying. It sounds reasonable, but it is difficult to do in reality. The data center is a place where data flows at high speed. Business needs are changing all the time. In order to meet the needs of business deployment and development, it is impossible not to change and trouble the data center. "No trouble" is just an ideal state. However, it is indeed necessary to actively reduce the frequency of data center operations as much as possible and move as little as possible, which can greatly reduce the probability of failure. People are the most important factor in data center activities. Without human participation, there would be no data center. However, people also bring growth troubles to the data center. People still play a vital role in the operation and maintenance process. As the operator of the data center, you must always keep in mind the teachings of your ancestors. |
<<: 5G is coming soon. Will it save you money? Operators: Stop dreaming!
At the beginning of the new year, Alibaba Cloud...
On February 9, 2022, Eseye, a provider of IoT con...
The report of the 20th National Congress of the C...
Power over Ethernet (PoE) has revolutionized the ...
A sudden epidemic has affected the development of...
China Unicom is currently actively promoting the ...
The tribe began sharing about Yunding Network in ...
According to Google user statistics, as of June t...
“SD-WAN has seen incredible market growth since 2...
The internet has become an integral part of our l...
It is time for operators to release their monthly...
【51CTO.com original article】With the rapid rise o...
spinservers has released a special promotional US...
The Wi-Fi industry is currently developing rapidl...
[[406782]] In addition to vendors working more cl...