Within 60 days, a computer room caught fire and the four major cloud giants went offline. How can operations and maintenance avoid downtime?

Within 60 days, a computer room caught fire and the four major cloud giants went offline. How can operations and maintenance avoid downtime?

The day before yesterday morning, Tencent's cloud computing service crashed in Guangdong, causing some users to fail to access resources, abnormal console login, and inaccessible websites. The failure lasted for 3 hours. At present, the losses caused by the failure and the issue of user compensation have not yet been determined. It is reported that the accident was caused by the interruption of the operator's optical cable. As of 11:40 yesterday, Tencent Cloud said that the failure has been restored.

However, this is not an isolated case. Over the past month, a series of thrilling accidents have occurred one after another:

  • In early June, a diesel engine caught fire in a data center room in Yizhuang, Beijing;
  • On June 28, problems occurred in the Alibaba Cloud official website console and in the use of some product functions;
  • On July 17, the AWS Management Console failed intermittently;
  • On July 18, Google Cloud Platform's global load balancing service was interrupted;

Take the Alibaba Cloud failure as an example. Its subsequent statement pointed out that this was also an operational error in its operation and maintenance. As a result, many Alibaba Cloud products were unavailable for about an hour. Some users said: Half of China's Internet was in shock for a whole hour!

It seems that under high temperature mode, data centers and cloud computing fields are also undergoing severe operation and maintenance tests.

[[237931]]

Intelligent and automated operation and maintenance does not mean no human intervention, but also requires the use of artificial intelligence

Operation and maintenance is no small matter, especially for data centers that play an important role in infrastructure. Operation and maintenance work cannot be slacked off. In the past decade, data centers have evolved from ordinary computer rooms with only UPS, air conditioners and IT equipment to a new era that includes various new technologies and applications. As a result, under scale, risks are concentrated, and the operation and maintenance management of data centers faces greater challenges, and the difficulty of operation and maintenance has also "reached a higher level". Especially in the face of the continuous expansion and upgrading of data centers, the safe and stable operation of infrastructure has become increasingly important.

In the field of data centers, the emphasis is on "30% technology and 70% management". Therefore, how to reduce the chances of human involvement in data centers and scientifically control human behavior is the top priority of current operation and maintenance work, and the new generation of data centers built and put into operation in recent years often have a say in this. Among them, China Telecom Kepler (Foshan) Data Center, which was put into operation at the end of August, has actively explored the road of intelligent and automated operation and maintenance.

Today, innovative technologies such as big data, the Internet of Things, automation, and machine learning have changed the traditional operation and maintenance management model of data centers. The operation and maintenance of the new generation of data centers cannot be separated from the support of information systems. Establishing a highly intelligent information system is the key to improving operation and maintenance efficiency and realizing intelligent automation of operation and maintenance.

Among them, the operation monitoring platform is the basis and prerequisite for realizing the operation and maintenance management system. In order to ensure the safety of the data center, operators need to conduct comprehensive and real-time monitoring of the temperature, humidity, electricity, water flow and air volume in the data center in order to discover potential problems. In the Kepler data center, the monitoring center implements infrared temperature monitoring, power quality monitoring, ultrasonic water flow monitoring, air volume monitoring and other resource monitoring, and additionally adds key equipment monitoring. The alarm information is directly displayed in the monitoring center to ensure that the alarm information of key equipment is obtained by the operation and maintenance personnel in the first time, so as to be prepared for any eventuality, reduce the time for troubleshooting, and improve work efficiency; at the same time, prevent problems before they happen, and scientifically use these data to provide reliable guidance for emergency measures and energy-saving measures.

Relying on artificial intelligence technology, the monitoring center adopts unified and standardized coding, name, data type, unit precision, update frequency, storage requirements and other data source standards. The operating status of various resources and equipment in the data center is clear at a glance, which not only improves the efficiency of operation and maintenance, but also largely avoids the occurrence of local hotspots in the computer room, uneven heating and cooling in the computer room, local hotspots and other adverse phenomena.

Only the monitoring platform is far from enough. To achieve more refined management, an intelligent management platform is indispensable. Therefore, the intelligent management method of PC combined with mobile APP came into being. It is reported that Kepler Data Center pioneered the use of a fully automated QR code inspection system in the industry, which can customize inspection routes, automatically generate inspection tasks, and automatically receive inspection tasks through mobile APP. It can generate inspection reports with one click, automatically evaluate the inspection health, realize process automation and intelligent inspection, and at the same time, improve the security of the data center and improve the overall operation efficiency.

It should be pointed out that the intelligent and automated operation and maintenance of data centers does not mean that humans are not needed in the operation and maintenance, but about 30%-40% of the operation and maintenance is standardized work, which does not require human intervention. As long as the parameters and steps are set, the problem can be solved and automation can be achieved. However, there is still a long way to go for data centers to fully adopt artificial intelligence - experts from Schneider Electric, a data center equipment manufacturer, pointed out.

In addition, another 60%-70% of the work still requires manual intervention, because this part of the work involves non-standardized operation and maintenance, which tests the professionalism of the operation and maintenance team. Under the strict implementation of the 7*24-hour operation and maintenance duty system, monthly facility equipment maintenance, and quarterly equipment manufacturer maintenance, Kepler Data Center provides complete, efficient, and reliable data operation and network services. It is reported that Kepler Data Center will put into operation the first batch of 774 racks at the end of August, which are located in the 2nd and 3rd floor module rooms, with an average of 20A cabinets, and 4th to 7th floors can be customized for customers.

Operation and maintenance management, technology and service strength complement each other and are indispensable

Of course, even the most careful planning can lead to some omissions. The trend of centralization of data center resources is significant. Once a failure occurs or a loophole is exploited, it may cause large-scale data loss or even equipment downtime in the data center. Even a few minutes of downtime may have a catastrophic impact on the enterprise, and disaster recovery emergency plans are crucial to the stable operation of the enterprise. Taking the Kepler data center as an example, it truly realizes high reliability, true dual-circuit mains power, and is equipped with a 2N power supply UPS system. The diesel generator is also sufficient to provide a fuel supply capacity of no less than 8 hours, and the chilled water/cooling water of the refrigeration system also uses a highly reliable dual-loop pipeline. With such technical strength, it strictly follows the rules and regulations such as twice a year fire drills, twice a year diesel generator load operation, and once a year computer room emergency drills, so customers can enjoy data hosting services without worries.

The importance of intelligent and automated operation and maintenance is self-evident, and efficient and intelligent information-based operation and maintenance management systems will also play an increasingly important role. However, the information-based operation and maintenance management system does not fight alone. Only when it is combined with scientific design concepts, reasonable structural layout, and strong technical service capabilities can the intelligent, efficient, and safe operation and maintenance goals be achieved.

Kepler Data Center is a model of a new generation of data centers with both soft and hard power. Relying on the cloud-network integration strategy of its partner China Telecom, directly connected to the 163 backbone international exits, and backed by the safe and reliable power resources of its shareholder Foshan Electric Power Construction Group, Foshan Kepler Data Center has implemented the concepts of "green", "energy saving" and "environmental protection" in its design. It adopts an independent oil engine building, efficient ventilation and noise reduction, and uses the principles of flow dynamics to assist in the layout design of the computer room load, and reserves an interface for the future access to the trigeneration of cooling, heating and electricity in the Funeng Park. The air conditioning condensate is recycled and reused. The construction standard is China Telecom's five-star and T3+ computer room, aiming to become a new generation of high-tech, information-based, green and environmentally friendly data center in the Pearl River Delta region, an important backbone network node, and provides a full range of data services for the public, government and enterprises throughout the province, Hong Kong, Macao and Taiwan, and even the whole country and Southeast Asia.

The value of the operation and maintenance market is highlighted, and data centers are taking advantage of the curve to overtake others

In fact, operation and maintenance is often the most important work in a data center, but it is often overlooked, mainly because the work of operation and maintenance does not produce results in the short term, and only when a failure occurs will operation and maintenance be named and blamed. With the development of big data technology, especially the continuous emergence of new servers, the requirements for the infrastructure layer are getting higher and higher. The basic requirements of safe, stable, reliable and green operation of data centers have long been unable to meet user needs. Operators should also follow the trend, actively expand their business scope, and innovate operation and maintenance management models.

According to the 2018 China Enterprise IT Operation and Maintenance Management Market Report, the scale of China's data center operation and maintenance service market is expected to reach 274.47 billion yuan by 2020, with a compound annual growth rate of 16.4%. Undoubtedly, intelligence will be the inevitable trend of China's data center operation and maintenance management, and operation and maintenance management will also change from passive response to active defense, realizing the transformation from IT cost center to IT service center and IT value center. During this period, data centers with both soft and hard power will be far ahead and quickly seize the market.

<<:  Five reasons why data center liquid cooling is on the rise

>>:  Software-defined branch network reconstruction can save 80% of costs

Recommend

How Wi-Fi 6, WWAN and 5G make fully wireless office possible

For use cases, fully wireless connectivity for of...

FirstByte: Russian KVM monthly payment starts from 55 rubles (≈ RMB 4.78 yuan)

FirstByte is a regular Russian hosting company fo...

A Brief Analysis of TSN Time Sensitive Network Technology

With the continuous development of industrial int...

Game changers for the branch office: Wi-Fi 6, 4G, 5G and SD-WAN

Today, the use of cloud computing services contin...

The battle of 1G to 5G, the communication reshuffle is about to begin

Recently, there has been a big conflict between H...

Huawei's Intelligent IP Network Solution Creates a Simplified 5G Bearer Network

At the HAS Analyst Conference recently, Chen Jinz...

IoT and 5G: A blessing or a curse?

The UK's recent decision to phase out Huawei ...

Data Cabling: Seven Tips for Office Renovations and Relocations

Technological advancements have helped businesses...

How to play the NB-IoT game in 2019?

NB-IoT technology is a low-power wide area networ...

Principles and Applications of Distributed System Selenium GRID

Author: Wang Huan, Unit: China Mobile Smart Home ...