Lessons from data center outages: Focus on infrastructure

The majority of downtime incidents over the past year had known causes and were preventable through strong design and processes.

[[256928]]

According to findings published by the Uptime Institute in summer 2018, nearly a third of data centers experienced an outage in the past year, up from 25% in 2017. But the increase wasn’t caused by some deadly new malware.

Instead, the top three causes of downtime were power outages (33%), network failures (30%), and IT or software errors (28%).

Most importantly, 80% of data center managers say these downtime events are preventable.

There is no way to prevent a lightning strike (such as the one that destroyed a Microsoft Azure data center in San Antonio in September 2018) or a zero-day malware attack. However, with proper planning and data center design, downtime due to unexpected weather events, attacks, routine human errors, or unscheduled systems can be minimized.

Getting a data center up and running quickly after an outage is equally important. According to a report this year by Information Technology Intelligence Consulting, one hour of downtime costs data center operators an average of $260,000, while five minutes of downtime costs only $2,600.

Infrastructure redundancy still works

At the most basic level, data center systems need to be backed up. Backing up the power supply, the main cooling system, backing up the data, or even backing up the entire data center.

Uptime Institute says that many enterprises require data centers with 2N cooling and power architectures, in other words, a fully redundant mirrored system. 22% of users experienced power outages in the last year. That's one-third fewer outages than those who adopted the cheaper, less-redundant "N+1" approach, of which 33% reported downtime incidents.

The backup of the entire data center can provide higher reliability. According to Uptime survey data, 40% of data center managers said they would replicate workloads and data in two or more data centers.

"If you have one data center and there's a lightning strike, you're going to go down," said Markku Rossi, CTO of SSH Communications Security. "You should have a secondary data center where there's physical separation between them so they're not dependent on the same power source."

He added that no data center is immune to the problem, citing the example of a lightning strike at Microsoft Corp.'s South Central U.S. data center.

“If you have a second data center, you can fail over immediately,” he said.

Rossi added that planning and testing are key regardless of where backup systems are located, and that planning needs to take into account the complexity of today's data centers, where some problems can trigger others.

He gave the example of a recent outage that occurred during maintenance at GitHub's data center. They fixed the physical problem in minutes, but it took 24 hours for the data to sync correctly.

Data center managers need to pinpoint potential problem areas and then have tools and processes in place when something happens.

“Focus on building processes and a mindset that allows you to prepare for failure,” Rossi said.

Strengthening security not only around the perimeter

One of the biggest lessons data center managers should take away from recent malware-related outages is that it is no longer enough to have a hardened perimeter. Attackers can attack.

Healthcare companies, government agencies, educational institutions, and major manufacturers were hit in 2018, though everyone should have been on high alert after last year’s record-breaking breaches.

Obviously, it is critical to keep defenses up to date to prevent malware from getting in in the first place. But data center managers must be prepared in case perimeter defenses fail and have secondary protections in place.

These include malicious traffic detection mechanisms, network defenses such as segmentation, and least-privilege access and communication methods.

These could help prevent malware from spreading once it enters a network, or at least slow it down enough to give security teams a chance to respond, said Igor Livshitz, director of product management at Israel-based cybersecurity service Guardi Core.

WannaCry specifically exploited a vulnerability in the Server Message Block Transfer Protocol. He said data centers should do more to reduce lateral communications.

"In many cases like the WannaCy ransomware over the past year, the primary driver for the widespread impact of the attack was the ease with which these worms could spread once they gain a foothold within a data center," Livshitz said. "In fact, SMB traffic between servers is not necessary at all. If it had been blocked, the spread of the attack and the damage to the data center could have been greatly reduced, and the attack detected at an earlier stage before it could cause so much damage."

The lesson from the breaches of 2018 is that data center managers must confront a new threat. They need to get back to basics.

Nearly all data center outages are the result of poor planning and investment decisions, combined with poor processes or the inability to follow them, Andy Lawrence, executive director of research at Uptime Institute, wrote in a June 2018 survey. “Almost all outages reported or studied by Uptime Institute have occurred and are often well documented.”

Lightning strikes and new types of malware may dominate industry headlines, but when it comes to resiliency, the security of your data center infrastructure remains paramount.

<<: Ruijie Smart Town E-Day Tour

>>: 5 must-know SD-WAN security myths

Learn about IP addresses in one minute. The Internet is not a lawless place. Please be careful in what you say and do.

[5.1] DogYun: 30% off on all dynamic clouds, 20% off on classic clouds, 10 yuan free for every 100 yuan recharge, 100 yuan off for independent servers per month

Blog

How to choose an application performance management tool? Master four basic principles

Recommend

FCC authorizes first batch of 6GHz WiFi devices

The FCC has reportedly authorized the first batch...

The 30th anniversary of Zhongchuang Software: persistence and perseverance have created glory, and Xinchuang has provided broad space for independent innovation of middleware!

Since 1991, Zhongchuang Software Engineering Co.,...

Digital transformation enters the "immersion period". Huawei helps government and enterprise industries embrace digitalization without blind spots.

[51CTO.com original article] Recently, the recurr...

Lessons from data center outages: Focus on infrastructure

Learn about IP addresses in one minute. The Internet is not a lawless place. Please be careful in what you say and do.

5G messaging is about to be launched in the commercial use countdown

Talk about TCP's three-way handshake and four-way wave

[5.1] DogYun: 30% off on all dynamic clouds, 20% off on classic clouds, 10 yuan free for every 100 yuan recharge, 100 yuan off for independent servers per month

How to choose an application performance management tool? Master four basic principles

A new starting point: 5G messaging writes a new chapter in 2020

Monaco becomes the first country to have full 5G coverage, supported by Huawei technology

6G, how should the communications industry tell an attractive story?

World Cup employees are distracted and use enterprise-level routing to control

Talk about STM32 network interruption

Recommend

FCC authorizes first batch of 6GHz WiFi devices

The 30th anniversary of Zhongchuang Software: persistence and perseverance have created glory, and Xinchuang has provided broad space for independent innovation of middleware!

With the handshake of 5G, intelligent transportation will usher in four major qualitative changes!

The details of number portability have been announced, but these four types of numbers cannot be ported

Ten basic skills for Linux operation and maintenance engineers

China's 5G users account for more than 70% of the world's total

Digital transformation enters the "immersion period". Huawei helps government and enterprise industries embrace digitalization without blind spots.

spinservers: $109/month - 2*E5-2650L v3 CPU, 64G memory, 1.6T SSD hard disk, 10TB/10Gbps, San Jose data center

Huawei and partners build a "capability-based" ecosystem to accelerate digital transformation in thousands of industries

After 6G, will there be 7G and 8G?

Summary information: CUBECLOUD/Duoxiantong/CYUN/PIGYun/Ouluyun/VoLLcloud/Hongsu Technology

The so-called ICMP is nothing more than a general and soldiers

DNS message format for network learning

Nine global manufacturers using 5G