Demystifying the elastic data center

Demystifying the elastic data center

When it comes to data centers, the term "resilience" can be defined as "the ability to maintain ICT services in the face of environmental extremes as well as human error or vandalism", and higher levels of resilience can often be designed into the cost of mechanical and electrical infrastructure at a premium.

The Uptime Institute's data center rating standard is a widely used method to measure the resilience of data center infrastructure. However, according to research, "human error" is the main cause of data center outages, at least 70%. But even so, reliability can be improved through redundant design. A dual-bus power supply system using UPS in each bus can largely protect dual-wire loads from power failures, human errors, and invalid damage, but even so, it is necessary to be more cautious.

Numbers mislead users

Of course, data center users want data centers to have higher reliability and availability, and value for money. So, how do you understand the availability of a data center? The following two somewhat interrelated "indicators":

  • "Type" of "Uptime Institute (I-IV)" or "TIA-942" (I-IV), "Rating" of BICSI, and "Availability Class" of EN50600
  • Availability percentage, such as 99.999% (the so-called "five nines")

Apart from pointing out that only the Uptime Institute can give a rating, TIA-942 and BICSI are the most applicable ANSI standards for North America, and EN50600 is not yet used, one can outline these standards as describing four levels of capability for "maintainability" and "fault tolerance". The principles are clear, and compatible maintainability answers the question, what is the point of building a very reliable (and possibly resilient) data center that has to be shut down once a year for maintenance? While a fault-tolerant system can have any component, path or space "fail" without affecting ICT services.

[[206349]]

However, the most abused is the availability percentage, because it is easy to calculate, but it can fool non-professional buyers and users and cause misunderstandings. In fact, to clearly express availability, only two numbers are needed, MTBF (mean time between failures, hours) and MTTR (mean time to repair, hours). Just divide MTBF by the total time (MTBF + MTTR) to express availability, and then multiply it by 100%, which is the true availability.

Therefore, having a very long MTBF and a very short MTTR may result in very high availability. Unfortunately, MTBF and MTTR are numbers that marketing departments can guess at if they use them to interpret. For example, a company can quote 99.999% availability for a UPS by assuming that the client has experienced staff and spare parts and can repair the UPS in 20 minutes. However, the real situation is that the service engineer is called to repair the device, wait for the spare part, and test it before it is put back into service (usually a day or more). Assuming an MTBF of 100,000 hours (under 12 years) and an MTTR of 20 minutes to 12 hours can produce any result one wants.

The second problem is the combination of the number of failure events (sum of multiple MTTRs) and MTBF. An old version of the Uptime Institute white paper (now obsolete) attempted to correlate availability with four Tier levels, but did not define the measurement time. This led to a strange situation where a Tier 1 facility could allow 53 minutes of offline time per year, but the highest Tier IV facility could only provide 5.3 minutes. This is strange, but if a failure occurs once a year, it is a disaster for any Tier I-Tier IV data center.

Anyway, don't always focus on this, but think about the combination. This especially affects many very short failures. The simplest way to illustrate this is to use the example of a human heart beating. Someone's heart is 99.9% "available", which sounds good, but there are 3153600 seconds in a year. 0.01% means that there may be 30,000 heartbeats in a year. If one of them is long, it will be life-threatening, but if they are evenly distributed throughout the year, it may just feel uncomfortable. In data center terms, look at the voltage provided to the load by the power input. Many modern servers cannot withstand a 10ms power outage, and at 6ms, the power system is 99.9999999%, so there may be three 10ms failures per year.

So what to do? Since availability is a metric, there is nothing wrong with it as long as it is clearly expressed. For example, "99.99% availability measured over 10 years, with a single failure lasting no more than 10 hours" is a clear statement of MTBF (10 years) and MTTR (10 hours). Some people may have calculated the answer and the availability will be 99.98859. But now people may come to the point that MTBF is more important than availability and people need to use MTBF to calculate availability in the first place. "Single failure" avoids the sum of multiple events.

Of course, the ultimate "failure" of a resilient data center may be the easiest to achieve: not by hacking the UPS through the Internet, but by human factors or failures turning off the power, raising the server inlet temperature and causing it to crash.

Resilience is critical to data center infrastructure management and preventing downtime. Even the best designs and operations can fail. Therefore, data center technicians design and test to meet the needs of operators and operators, reduce the fear of downtime, and improve staff management and maintenance of data centers, and increase confidence in availability.

<<:  Data Center Strategic Evolution

>>:  How much does data center downtime cost?

Recommend

I found a mistake in the book!

I discussed some TCP issues with my friends over ...

“Hotel chains” also have “five-star” Wi-Fi! How did Lavande Hotels do it?

Hotel Wi-Fi ≈ Slow speed and insecurity? This pro...

Read the history of instant messaging IM in one article

ICQ, the instant messaging software we are more f...

Someone finally explains the true value of 5G

Since 2019, the pace of 5G commercialization has ...

ASUS releases PG27VQ gaming monitor: 165Hz, RGB light

RGB lights have now spread to every corner inside...

STM32 Network SMI Interface

[[377132]] 01 Introduction to Ethernet The Ethern...

What does 5G bring to the Internet of Things today and tomorrow?

The Internet of Things is already booming, and we...

Can 5G and ecosystem construction support the rapid development of MEC?

MEC (Mobile Edge Computing) was born in the 4G er...

The hidden threat of smart home privacy leakage comes from the router

In our daily life, we can use smartphones to cont...

5G new call concepts and key technologies

Labs Guide The pursuit of communication technolog...

From WiFi to NB-IoT, exploring the high-tech access methods of smart door locks

Hello everyone! I am Xiaomi, a 29-year-old who is...