Lessons from data center outages: Focus on infrastructure

Lessons from data center outages: Focus on infrastructure

The majority of downtime incidents over the past year had known causes and were preventable through strong design and processes.

[[256928]]

According to findings published by the Uptime Institute in summer 2018, nearly a third of data centers experienced an outage in the past year, up from 25% in 2017. But the increase wasn’t caused by some deadly new malware.

Instead, the top three causes of downtime were power outages (33%), network failures (30%), and IT or software errors (28%).

Most importantly, 80% of data center managers say these downtime events are preventable.

There is no way to prevent a lightning strike (such as the one that destroyed a Microsoft Azure data center in San Antonio in September 2018) or a zero-day malware attack. However, with proper planning and data center design, downtime due to unexpected weather events, attacks, routine human errors, or unscheduled systems can be minimized.

Getting a data center up and running quickly after an outage is equally important. According to a report this year by Information Technology Intelligence Consulting, one hour of downtime costs data center operators an average of $260,000, while five minutes of downtime costs only $2,600.

Infrastructure redundancy still works

At the most basic level, data center systems need to be backed up. Backing up the power supply, the main cooling system, backing up the data, or even backing up the entire data center.

Uptime Institute says that many enterprises require data centers with 2N cooling and power architectures, in other words, a fully redundant mirrored system. 22% of users experienced power outages in the last year. That's one-third fewer outages than those who adopted the cheaper, less-redundant "N+1" approach, of which 33% reported downtime incidents.

The backup of the entire data center can provide higher reliability. According to Uptime survey data, 40% of data center managers said they would replicate workloads and data in two or more data centers.

"If you have one data center and there's a lightning strike, you're going to go down," said Markku Rossi, CTO of SSH Communications Security. "You should have a secondary data center where there's physical separation between them so they're not dependent on the same power source."

He added that no data center is immune to the problem, citing the example of a lightning strike at Microsoft Corp.'s South Central U.S. data center.

“If you have a second data center, you can fail over immediately,” he said.

Rossi added that planning and testing are key regardless of where backup systems are located, and that planning needs to take into account the complexity of today's data centers, where some problems can trigger others.

He gave the example of a recent outage that occurred during maintenance at GitHub's data center. They fixed the physical problem in minutes, but it took 24 hours for the data to sync correctly.

Data center managers need to pinpoint potential problem areas and then have tools and processes in place when something happens.

“Focus on building processes and a mindset that allows you to prepare for failure,” Rossi said.

Strengthening security not only around the perimeter

One of the biggest lessons data center managers should take away from recent malware-related outages is that it is no longer enough to have a hardened perimeter. Attackers can attack.

Healthcare companies, government agencies, educational institutions, and major manufacturers were hit in 2018, though everyone should have been on high alert after last year’s record-breaking breaches.

Obviously, it is critical to keep defenses up to date to prevent malware from getting in in the first place. But data center managers must be prepared in case perimeter defenses fail and have secondary protections in place.

These include malicious traffic detection mechanisms, network defenses such as segmentation, and least-privilege access and communication methods.

These could help prevent malware from spreading once it enters a network, or at least slow it down enough to give security teams a chance to respond, said Igor Livshitz, director of product management at Israel-based cybersecurity service Guardi Core.

WannaCry specifically exploited a vulnerability in the Server Message Block Transfer Protocol. He said data centers should do more to reduce lateral communications.

"In many cases like the WannaCy ransomware over the past year, the primary driver for the widespread impact of the attack was the ease with which these worms could spread once they gain a foothold within a data center," Livshitz said. "In fact, SMB traffic between servers is not necessary at all. If it had been blocked, the spread of the attack and the damage to the data center could have been greatly reduced, and the attack detected at an earlier stage before it could cause so much damage."

The lesson from the breaches of 2018 is that data center managers must confront a new threat. They need to get back to basics.

Nearly all data center outages are the result of poor planning and investment decisions, combined with poor processes or the inability to follow them, Andy Lawrence, executive director of research at Uptime Institute, wrote in a June 2018 survey. “Almost all outages reported or studied by Uptime Institute have occurred and are often well documented.”

Lightning strikes and new types of malware may dominate industry headlines, but when it comes to resiliency, the security of your data center infrastructure remains paramount.

<<:  Ruijie Smart Town E-Day Tour

>>:  5 must-know SD-WAN security myths

Recommend

5 blockchain trends for 2018

Few new technologies have generated as much discu...

Why You Should Avoid Public WiFi

Translator | Li Rui Proofread by Sun Shujuan Ther...

Saving Energy in Smart Buildings with PoE Switches

This is not something that happened overnight, bu...

China Mobile builds the world's largest 5G network

One year after 5G was officially put into commerc...

Why ordinary users don’t feel the 3rd anniversary of 5G license issuance

As of April this year, the total number of 5G bas...

What are the five skills required for data center management?

Today, IT managers must be prepared for the vario...

How to Choose the Right Switch for Your Network?

When it comes to networking, switches are crucial...

What is in the Http Header?

The author has developed a simple, stable, and sc...