The unwritten rules in data center operation and maintenance

The unwritten rules in data center operation and maintenance

Data centers are important places for information processing. The equipment inside carries many important businesses and has high requirements for continuous and stable operation. However, the operation of the business still depends on the stable operation of thousands of electronic devices. In order to ensure that these devices run without problems, or that problems are not perceived by the business level, the technicians of data center operation and maintenance have come up with many ways, some of which have gradually become the standards of industry operation and maintenance, and many data centers follow and implement them. In fact, sometimes technicians have no choice but to follow these unwritten rules. The fundamental purpose is to ensure the continuous and stable operation of data center business. Business interruption is a big deal for data centers. Many losses caused by business interruption are charged by seconds, and all regulations are for the data center. Then let's take a look at what interesting unwritten rules there are in operation and maintenance work.

[[207284]]

Internet access must be blocked during major holidays

Whenever major holidays come, the data centers of major network operators, important industry enterprises, etc. have closed their networks. The so-called network closing means stopping all human operations and business changes to the data center, allowing the equipment to run on its own without human intervention. Network closing does not mean reducing the number of personnel on duty, but strengthening the staff's duty to ensure that the data center runs without problems, and if there are problems, they are handled and eliminated in a timely manner. At this time, network closing can reduce some human failures. You must know that 80% of the failures are caused by human operations. It is safest not to touch it. No one wants their data center to fail at a critical moment and stand out. Just like the 19th National Congress of the Communist Party of China, which is about to be held, all mainstream data centers have closed their networks and no longer allow any network changes (except for equipment failures). Some data center computer rooms have even been locked and no one can enter. This system is also explored by the data center in the operation and maintenance work. From the historical experience in the past, as long as human intervention is reduced and the equipment is allowed to run on its own, the probability of problems will be greatly reduced. Therefore, in the critical period, no changes are made, and the data center is allowed to run on its own, and the probability of failure is minimized.

Restart your device regularly

If our mobile phones are used for a long time, the speed will become slower. If we restart them and use them again, we will find that they are much better. In fact, the same is true for the equipment in the data center. The equipment in the data center runs uninterrupted all year round. After a long time of running, various memory garbage and various software bugs are easily exposed, and the risk of equipment problems increases. Regularly restarting the equipment will help reduce the occurrence of failures and extend the service life of the equipment. If the business on the device is not backed up, restarting the device may affect the business. Therefore, before restarting the device, you must make a good assessment to avoid the impact of active restart on the business. If the interruption time caused by restarting the device once is acceptable to the business, then you can actively restart the device once a year, such as half a year or a year. If the software version used by the device is older, you can also use this opportunity to upgrade the software. Don't think that restarting the device is a shameful thing. This is like a horse pulling a cart. After a long time, the horse also needs a rest. Some data centers conduct fault simulation drills once or twice a year, which includes restarting the equipment to check the stability and redundancy of the data center system. It is very good to have this drill. Not only can the equipment take a temporary rest, but also loopholes in the operation of the data center can be discovered in time and repaired. Do not wait until a serious problem occurs before considering restarting the device to recover, as this will often cause serious losses to the business.

Strengthen equipment operation management

There are many devices in the data center. Different devices come from different manufacturers and use different functions. The operators of these devices must be strictly managed. Avoid people who are not familiar with the equipment from operating the equipment incorrectly. These man-made failures are countless. Therefore, it is necessary to control the access rights to the equipment. Different devices are managed by different people and controlled by the people who are most familiar with them. For some equipment change operations, it is necessary to make an assessment in advance to see whether the configuration complies with the specifications and whether there are known risks. Let the equipment manufacturer also participate in the change operation to prevent the change from not meeting expectations. The data center has very strict management of login equipment and has different permission requirements for different personnel. If you need to apply for higher access rights, you need to apply to senior leaders and explain the reasons and reasons for the operation. This is an important part of the operation and maintenance management of the data center.

Isolation/offline/restart three axes

When a fault occurs during the operation of a data center, the first thing to do is to restore the service, and locating the cause of the fault is secondary. Therefore, when handling a fault, the operation and maintenance personnel must first clarify the fault location. If it cannot be completely determined in a short period of time, they must also try to restore the service. At this time, the three axes are commonly used: isolation, offline, and restart. These three axes are all for specific equipment, because data center failures are all from specific equipment. Failures during stable operation are basically caused by problems with one or some of the equipment. Isolation is to switch the faulty device port, VLAN or traffic based on the scope of the service failure and switch it to other normal channels. If the scope of the fault cannot be determined, consider offline the device, that is, take the device offline, and switch the service of the entire device to other devices. For example, if a server service is abnormal, migrate the virtual machine on this server to other servers to restore the service as soon as possible. Sometimes, there is no backup between devices and offline processing is impossible. For example, some core network devices require a lot of service switching work to be done offline. At this time, consider restarting the device to see if the restart can be restored. Generally, abnormal devices can be restored by restarting and continue to operate normally in a short time, which wins valuable time for analyzing the cause of the problem. On the one hand, we continue to analyze the causes, and on the other hand, we ensure that the data center business continues to operate normally. After finding the cause of the problem, we will remedy the hidden dangers.

Data center operation and maintenance personnel have gradually explored a lot of experience in their daily work. These are all lessons learned through hard work and are the precious wealth of the data center. Although some regulations do not have deep technical support, they are very practical. These are also the solutions that operation and maintenance personnel think of when facing data center failures. As the saying goes, "rough words but not rough principles", these unwritten rules may seem simple, but they are very useful at critical moments.

<<:  Shi Kai: ThoughtWorks creates a competitive advantage for you

>>:  Wind River Wins Two Awards at 2017 SDN/NFV World Summit

Recommend

What is edge computing from a hardware perspective?

Edge computing has exploded due to the massive am...

Expert: China ranks second in the world in terms of the number of IPv6 addresses

[[230257]] The global Internet Protocol (IP) addr...

Why the popular dual-band wireless router advantages tell you

Open the e-commerce website, dual-band wireless r...

Security Talk丨How far are we from 5G?

[[267324]] Security officials from governments ar...

Enterprise Network Data Communication Solution Practice - EIGRP

Practical objectives: Through practical applicati...

ICO is suspended and blockchain needs to develop

ICO disguised as blockchain is like a glass of be...

What are digital certificates and signatures? This article explains it very well

Hello everyone, I am Brother Ming. I sorted out s...

Is HTTP1.1 Keep-Alive considered a long connection?

[[435412]] This article will talk about the collo...

Wi-Fi 7 is on the way, how powerful is it?

In 2019, Samsung and Apple were the first to intr...

Intel leads the flourishing PC ecosystem market for designers

[51CTO.com Beijing report] On August 29, Intel...

What network automation certification options are available today?

Networks are increasingly reliant on software and...