What is the data center most afraid of? Power outage, network damage... What do data center operators fear most? Downtime, unusual failures, upgrades and expansions... As the scale of data center construction continues to expand and new technologies are iterated, the network that carries data center services has become extremely complex. In order to adapt to the development of data center services, data center networks are also constantly updating and changing, which brings great difficulty to operation and maintenance work. Data center downtime accidents are inevitable, which not only increases the workload of data center operation and maintenance personnel, but more importantly, brings huge losses to the data center. Even world-renowned Internet giants often enjoy such "treatment". Internet giants are experiencing constant downtime, and operation and maintenance work has become a problem In the early morning of March 3, Alibaba Cloud experienced a system outage, which caused the websites of enterprises or apps of Internet companies that purchased Alibaba Cloud services to be unable to function normally. A large number of programmers, operators, and maintenance personnel had to get out of bed and work. Regarding the Alibaba Cloud outage, Shen Jian, a senior architect at 58, said that the accident lasted about 3 hours and was observed for 2 hours afterwards. Starting at 3:43 a.m. on May 3, Microsoft Azure experienced a large-scale outage around the world, which lasted for nearly 2 hours and was not fully restored until 5:30 a.m. Affected by the Azure outage, Microsoft's major services including Microsoft 365, Dynamics and DevOps all had usage problems. Starting at 2:58 a.m. on June 3, Google suffered a massive outage worldwide, affecting many Google services based on Google Cloud architecture services, including Gmail, YouTube and Google Drive. Users accessing Google services received various error alerts, and were blocked from accessing emails, uploading YouTube videos, etc. On June 25, Amazon confirmed on its official website that its cloud computing services had been down, affecting the network connections of some network users and multiple AWS regions. The failed node was in AWS US East 1, and a total of 33 services were affected, of which 9 were completely out of service. Frequent downtime incidents make operation and maintenance more difficult Time and again, downtime incidents have proven the importance of data center operation and maintenance, but it seems unavoidable. Today, with the advancement of technology and the advent of the era of the Internet of Everything, data centers play an important role as important infrastructure. Although data centers have only been developed in China for more than ten years, they have evolved from ordinary computer rooms with only UPS, air conditioners and IT equipment to a new era that includes all-round services such as the Internet, big data, AI, and cloud services, with tens of thousands of cabinets, and new technologies such as natural cooling, wind walls, underwater data centers, and liquid-cooled servers are constantly being created and applied. As a result, operation and maintenance management faces greater challenges, and the difficulty of operation and maintenance has also "reached a new level." First, the ultra-large-scale data centers have brought about changes in personnel, organization, and efficiency. In the past, data centers within 10,000 square meters took 2-4 hours to conduct manual inspections. Now, with hundreds of thousands of square meters, more operation and maintenance personnel are needed to be distributed in different areas of responsibility, which increases the difficulty and cost of management. Secondly, the voltage level has increased, and the safety risks have increased. In the past, operation and maintenance personnel were exposed to low voltage, but now power supply equipment, generators, and refrigerators are all powered by high voltage, and the maintenance safety requirements have increased. In addition, the concentration of scale has led to concentrated risks and greater impacts of accidents. For example, the data center downtime mentioned above has caused large-scale service and application interruptions around the world, resulting in heavy losses, so the pressure on operation and maintenance management has advanced. Reduce human errors and improve professional skills of operation and maintenance management According to data surveys, 70% of data center downtime accidents are caused by human errors. Therefore, as the scale of data centers continues to expand, operation and maintenance personnel must improve their skills and professional level to cope with unexpected events in data centers:
To make up for the lack of manual operation and maintenance, intelligent operation and maintenance came into being Today, the digital age has arrived. The scale and capacity of data centers are growing exponentially, and the complexity and difficulty of operation and maintenance management are also increasing. From script operation and maintenance, tool operation and maintenance to platform operation and maintenance, manpower has reached its limit, and intelligent operation and maintenance has emerged. Nowadays, more data center companies such as Tencent, Huawei, and JD.com have begun to increase their R&D efforts and invest in the wave of intelligent operation and maintenance, combining artificial intelligence with operation and maintenance, and improving operation and maintenance efficiency through machine learning methods based on existing operation and maintenance data (logs, monitoring information, application information, etc.), thereby gradually replacing manual operation and maintenance. I believe that data centers will become more and more intelligent in the future. |
<<: The Socket and TCP connection process you must know
>>: What is 6G and when will it be launched?
BudgetVM is still offering a 50% discount on the ...
At the 2018 Global Network Technology Conference,...
[[341973]] Yu Yingtao, Co-President of Tsinghua U...
CMIVPS released this year's Double 11 promoti...
[[435879]] The China CIO Alliance (CCA) was held ...
[[409660]] Environment: springcloud Hoxton.SR11 T...
China's 5G era has arrived as promised! The f...
As big data and deep learning are increasingly us...
DevOps has been a hot topic for a few years now. ...
ColoCrossing has released a new VPS promotional p...
The issuance of 5G licenses in China has greatly ...
HostKvm is a Chinese VPS service provider founded...
Wi-Fi 6 has made significant progress compared to...
5G is accelerating. 3GPP has completed the non-st...