The difficulty of operation and maintenance has reached a new level - it does not exist!

The difficulty of operation and maintenance has reached a new level - it does not exist!

What is the data center most afraid of?

Power outage, network damage...

What do data center operators fear most?

Downtime, unusual failures, upgrades and expansions...

As the scale of data center construction continues to expand and new technologies are iterated, the network that carries data center services has become extremely complex. In order to adapt to the development of data center services, data center networks are also constantly updating and changing, which brings great difficulty to operation and maintenance work. Data center downtime accidents are inevitable, which not only increases the workload of data center operation and maintenance personnel, but more importantly, brings huge losses to the data center. Even world-renowned Internet giants often enjoy such "treatment".

Internet giants are experiencing constant downtime, and operation and maintenance work has become a problem

In the early morning of March 3, Alibaba Cloud experienced a system outage, which caused the websites of enterprises or apps of Internet companies that purchased Alibaba Cloud services to be unable to function normally. A large number of programmers, operators, and maintenance personnel had to get out of bed and work. Regarding the Alibaba Cloud outage, Shen Jian, a senior architect at 58, said that the accident lasted about 3 hours and was observed for 2 hours afterwards.

Starting at 3:43 a.m. on May 3, Microsoft Azure experienced a large-scale outage around the world, which lasted for nearly 2 hours and was not fully restored until 5:30 a.m. Affected by the Azure outage, Microsoft's major services including Microsoft 365, Dynamics and DevOps all had usage problems.

Starting at 2:58 a.m. on June 3, Google suffered a massive outage worldwide, affecting many Google services based on Google Cloud architecture services, including Gmail, YouTube and Google Drive. Users accessing Google services received various error alerts, and were blocked from accessing emails, uploading YouTube videos, etc.

On June 25, Amazon confirmed on its official website that its cloud computing services had been down, affecting the network connections of some network users and multiple AWS regions. The failed node was in AWS US East 1, and a total of 33 services were affected, of which 9 were completely out of service.

Frequent downtime incidents make operation and maintenance more difficult

Time and again, downtime incidents have proven the importance of data center operation and maintenance, but it seems unavoidable. Today, with the advancement of technology and the advent of the era of the Internet of Everything, data centers play an important role as important infrastructure. Although data centers have only been developed in China for more than ten years, they have evolved from ordinary computer rooms with only UPS, air conditioners and IT equipment to a new era that includes all-round services such as the Internet, big data, AI, and cloud services, with tens of thousands of cabinets, and new technologies such as natural cooling, wind walls, underwater data centers, and liquid-cooled servers are constantly being created and applied. As a result, operation and maintenance management faces greater challenges, and the difficulty of operation and maintenance has also "reached a new level."

First, the ultra-large-scale data centers have brought about changes in personnel, organization, and efficiency. In the past, data centers within 10,000 square meters took 2-4 hours to conduct manual inspections. Now, with hundreds of thousands of square meters, more operation and maintenance personnel are needed to be distributed in different areas of responsibility, which increases the difficulty and cost of management. Secondly, the voltage level has increased, and the safety risks have increased. In the past, operation and maintenance personnel were exposed to low voltage, but now power supply equipment, generators, and refrigerators are all powered by high voltage, and the maintenance safety requirements have increased. In addition, the concentration of scale has led to concentrated risks and greater impacts of accidents. For example, the data center downtime mentioned above has caused large-scale service and application interruptions around the world, resulting in heavy losses, so the pressure on operation and maintenance management has advanced.

Reduce human errors and improve professional skills of operation and maintenance management

According to data surveys, 70% of data center downtime accidents are caused by human errors. Therefore, as the scale of data centers continues to expand, operation and maintenance personnel must improve their skills and professional level to cope with unexpected events in data centers:

  • Establishing a complete personnel skill evaluation system to assess the skills and abilities of operation and maintenance personnel from multiple aspects can effectively help operation and maintenance personnel improve their operation and maintenance skills and promote their active learning and automatic improvement.
  • Learn operation and maintenance experience online, establish an operation and maintenance experience database, realize an online operation and maintenance experience sharing and exchange platform, and provide channels for online internship and learning of operation and maintenance knowledge.
  • The online simulation of the practical operation environment provides an operation and maintenance simulation practice environment, effectively isolates operational risks, and helps quickly improve the actual level of operation and maintenance.
  • Online assessment of theoretical skills, relying on a massive IT cloud platform component question bank, regular assessments, and random questions, to achieve online real-time automatic assessment of operation and maintenance theoretical capabilities.
  • Online assessment of practical skills, building a lightweight online operation and maintenance, and online programming environment, to achieve online real-time automatic assessment of operation and maintenance skills and R&D skills.
  • Improve efficiency through automatic assessment, realize online scientific and automatic assessment of operation and maintenance theoretical skills and practical skills, improve assessment efficiency, and ensure objective and fair reflection of capabilities.

To make up for the lack of manual operation and maintenance, intelligent operation and maintenance came into being

Today, the digital age has arrived. The scale and capacity of data centers are growing exponentially, and the complexity and difficulty of operation and maintenance management are also increasing. From script operation and maintenance, tool operation and maintenance to platform operation and maintenance, manpower has reached its limit, and intelligent operation and maintenance has emerged. Nowadays, more data center companies such as Tencent, Huawei, and JD.com have begun to increase their R&D efforts and invest in the wave of intelligent operation and maintenance, combining artificial intelligence with operation and maintenance, and improving operation and maintenance efficiency through machine learning methods based on existing operation and maintenance data (logs, monitoring information, application information, etc.), thereby gradually replacing manual operation and maintenance. I believe that data centers will become more and more intelligent in the future.

<<:  The Socket and TCP connection process you must know

>>:  What is 6G and when will it be launched?

Blog    

Recommend

Accelerate the release of new infrastructure value with data as the core

[[341973]] Yu Yingtao, Co-President of Tsinghua U...

...

NTT and Cisco jointly attended the 2021 China CIO Alliance Annual Summit Forum

[[435879]] The China CIO Alliance (CCA) was held ...

Analysis of SpringCloud Gateway routing configuration and positioning principles

[[409660]] Environment: springcloud Hoxton.SR11 T...

AI chip black technology inventory

As big data and deep learning are increasingly us...

5 reasons why DevOps will be a big thing in 2018

DevOps has been a hot topic for a few years now. ...

ColoCrossing: $12/year-1GB/20G SSD/1Gbps unlimited traffic/New York data center

ColoCrossing has released a new VPS promotional p...

IDC: 5G commercialization will greatly benefit the manufacturing industry

The issuance of 5G licenses in China has greatly ...