Demystifying the elastic data center

When it comes to data centers, the term "resilience" can be defined as "the ability to maintain ICT services in the face of environmental extremes as well as human error or vandalism", and higher levels of resilience can often be designed into the cost of mechanical and electrical infrastructure at a premium.

The Uptime Institute's data center rating standard is a widely used method to measure the resilience of data center infrastructure. However, according to research, "human error" is the main cause of data center outages, at least 70%. But even so, reliability can be improved through redundant design. A dual-bus power supply system using UPS in each bus can largely protect dual-wire loads from power failures, human errors, and invalid damage, but even so, it is necessary to be more cautious.

Numbers mislead users

Of course, data center users want data centers to have higher reliability and availability, and value for money. So, how do you understand the availability of a data center? The following two somewhat interrelated "indicators":

"Type" of "Uptime Institute (I-IV)" or "TIA-942" (I-IV), "Rating" of BICSI, and "Availability Class" of EN50600
Availability percentage, such as 99.999% (the so-called "five nines")

Apart from pointing out that only the Uptime Institute can give a rating, TIA-942 and BICSI are the most applicable ANSI standards for North America, and EN50600 is not yet used, one can outline these standards as describing four levels of capability for "maintainability" and "fault tolerance". The principles are clear, and compatible maintainability answers the question, what is the point of building a very reliable (and possibly resilient) data center that has to be shut down once a year for maintenance? While a fault-tolerant system can have any component, path or space "fail" without affecting ICT services.

[[206349]]

However, the most abused is the availability percentage, because it is easy to calculate, but it can fool non-professional buyers and users and cause misunderstandings. In fact, to clearly express availability, only two numbers are needed, MTBF (mean time between failures, hours) and MTTR (mean time to repair, hours). Just divide MTBF by the total time (MTBF + MTTR) to express availability, and then multiply it by 100%, which is the true availability.

Therefore, having a very long MTBF and a very short MTTR may result in very high availability. Unfortunately, MTBF and MTTR are numbers that marketing departments can guess at if they use them to interpret. For example, a company can quote 99.999% availability for a UPS by assuming that the client has experienced staff and spare parts and can repair the UPS in 20 minutes. However, the real situation is that the service engineer is called to repair the device, wait for the spare part, and test it before it is put back into service (usually a day or more). Assuming an MTBF of 100,000 hours (under 12 years) and an MTTR of 20 minutes to 12 hours can produce any result one wants.

The second problem is the combination of the number of failure events (sum of multiple MTTRs) and MTBF. An old version of the Uptime Institute white paper (now obsolete) attempted to correlate availability with four Tier levels, but did not define the measurement time. This led to a strange situation where a Tier 1 facility could allow 53 minutes of offline time per year, but the highest Tier IV facility could only provide 5.3 minutes. This is strange, but if a failure occurs once a year, it is a disaster for any Tier I-Tier IV data center.

Anyway, don't always focus on this, but think about the combination. This especially affects many very short failures. The simplest way to illustrate this is to use the example of a human heart beating. Someone's heart is 99.9% "available", which sounds good, but there are 3153600 seconds in a year. 0.01% means that there may be 30,000 heartbeats in a year. If one of them is long, it will be life-threatening, but if they are evenly distributed throughout the year, it may just feel uncomfortable. In data center terms, look at the voltage provided to the load by the power input. Many modern servers cannot withstand a 10ms power outage, and at 6ms, the power system is 99.9999999%, so there may be three 10ms failures per year.

So what to do? Since availability is a metric, there is nothing wrong with it as long as it is clearly expressed. For example, "99.99% availability measured over 10 years, with a single failure lasting no more than 10 hours" is a clear statement of MTBF (10 years) and MTTR (10 hours). Some people may have calculated the answer and the availability will be 99.98859. But now people may come to the point that MTBF is more important than availability and people need to use MTBF to calculate availability in the first place. "Single failure" avoids the sum of multiple events.

Of course, the ultimate "failure" of a resilient data center may be the easiest to achieve: not by hacking the UPS through the Internet, but by human factors or failures turning off the power, raising the server inlet temperature and causing it to crash.

Resilience is critical to data center infrastructure management and preventing downtime. Even the best designs and operations can fail. Therefore, data center technicians design and test to meet the needs of operators and operators, reduce the fear of downtime, and improve staff management and maintenance of data centers, and increase confidence in availability.

<<: Data Center Strategic Evolution

>>: How much does data center downtime cost?

Nanchang Wanda City Marketing Secret: Ruijie's "Five-Star" Wi-Fi Makes Every Guest a VIP

LOCVPS: 30% off on CN2 line KVM in the Netherlands, 2G package renewal in Tai Po, Hong Kong, starting from 44 yuan per month

Blog

The State Council has deployed to promote the speed increase and fee reduction, and operators have spared no effort on the road to benefit the country and enterprises

In the continuous promotion of network speed-up a...

From WiFi to NB-IoT, exploring the high-tech access methods of smart door locks

Hello everyone! I am Xiaomi, a 29-year-old who is...

Demystifying the elastic data center

Nanchang Wanda City Marketing Secret: Ruijie's "Five-Star" Wi-Fi Makes Every Guest a VIP

How about HostYun? Simple test of HostYun Los Angeles CN2 GIA cheap version

Detailed explanation: How did China Mobile perform in 2020?

Developers meet in Guangzhou to check in at Kunpeng Salon and see how new computing can enable smart government office!

A thought-provoking report on a major communications failure

Comparison of LPWAN technologies: Ten criteria for successful implementation

Huawei launches star products and industry cooperation plans for the F5G era in the enterprise sector

Report: Global Private 5G Networks Will Take Enterprises to the Next Level!

CAICT answers hot issues on “number portability” service

LOCVPS: 30% off on CN2 line KVM in the Netherlands, 2G package renewal in Tai Po, Hong Kong, starting from 44 yuan per month

Recommend

I found a mistake in the book!

“Hotel chains” also have “five-star” Wi-Fi! How did Lavande Hotels do it?

The full implementation of number portability is about to reach its first anniversary, and the winner may be different

Read the history of instant messaging IM in one article

Someone finally explains the true value of 5G

ASUS releases PG27VQ gaming monitor: 165Hz, RGB light

STM32 Network SMI Interface

What does 5G bring to the Internet of Things today and tomorrow?

Can 5G and ecosystem construction support the rapid development of MEC?

5G spending of the three major operators decreased, and the number of package users increased

From five capabilities to "1+5+N", Huawei makes the transformation of government and enterprises more stable

The hidden threat of smart home privacy leakage comes from the router

5G new call concepts and key technologies

The State Council has deployed to promote the speed increase and fee reduction, and operators have spared no effort on the road to benefit the country and enterprises

From WiFi to NB-IoT, exploring the high-tech access methods of smart door locks