Demystifying the elastic data center

Demystifying the elastic data center

When it comes to data centers, the term "resilience" can be defined as "the ability to maintain ICT services in the face of environmental extremes as well as human error or vandalism", and higher levels of resilience can often be designed into the cost of mechanical and electrical infrastructure at a premium.

The Uptime Institute's data center rating standard is a widely used method to measure the resilience of data center infrastructure. However, according to research, "human error" is the main cause of data center outages, at least 70%. But even so, reliability can be improved through redundant design. A dual-bus power supply system using UPS in each bus can largely protect dual-wire loads from power failures, human errors, and invalid damage, but even so, it is necessary to be more cautious.

Numbers mislead users

Of course, data center users want data centers to have higher reliability and availability, and value for money. So, how do you understand the availability of a data center? The following two somewhat interrelated "indicators":

  • "Type" of "Uptime Institute (I-IV)" or "TIA-942" (I-IV), "Rating" of BICSI, and "Availability Class" of EN50600
  • Availability percentage, such as 99.999% (the so-called "five nines")

Apart from pointing out that only the Uptime Institute can give a rating, TIA-942 and BICSI are the most applicable ANSI standards for North America, and EN50600 is not yet used, one can outline these standards as describing four levels of capability for "maintainability" and "fault tolerance". The principles are clear, and compatible maintainability answers the question, what is the point of building a very reliable (and possibly resilient) data center that has to be shut down once a year for maintenance? While a fault-tolerant system can have any component, path or space "fail" without affecting ICT services.

[[206349]]

However, the most abused is the availability percentage, because it is easy to calculate, but it can fool non-professional buyers and users and cause misunderstandings. In fact, to clearly express availability, only two numbers are needed, MTBF (mean time between failures, hours) and MTTR (mean time to repair, hours). Just divide MTBF by the total time (MTBF + MTTR) to express availability, and then multiply it by 100%, which is the true availability.

Therefore, having a very long MTBF and a very short MTTR may result in very high availability. Unfortunately, MTBF and MTTR are numbers that marketing departments can guess at if they use them to interpret. For example, a company can quote 99.999% availability for a UPS by assuming that the client has experienced staff and spare parts and can repair the UPS in 20 minutes. However, the real situation is that the service engineer is called to repair the device, wait for the spare part, and test it before it is put back into service (usually a day or more). Assuming an MTBF of 100,000 hours (under 12 years) and an MTTR of 20 minutes to 12 hours can produce any result one wants.

The second problem is the combination of the number of failure events (sum of multiple MTTRs) and MTBF. An old version of the Uptime Institute white paper (now obsolete) attempted to correlate availability with four Tier levels, but did not define the measurement time. This led to a strange situation where a Tier 1 facility could allow 53 minutes of offline time per year, but the highest Tier IV facility could only provide 5.3 minutes. This is strange, but if a failure occurs once a year, it is a disaster for any Tier I-Tier IV data center.

Anyway, don't always focus on this, but think about the combination. This especially affects many very short failures. The simplest way to illustrate this is to use the example of a human heart beating. Someone's heart is 99.9% "available", which sounds good, but there are 3153600 seconds in a year. 0.01% means that there may be 30,000 heartbeats in a year. If one of them is long, it will be life-threatening, but if they are evenly distributed throughout the year, it may just feel uncomfortable. In data center terms, look at the voltage provided to the load by the power input. Many modern servers cannot withstand a 10ms power outage, and at 6ms, the power system is 99.9999999%, so there may be three 10ms failures per year.

So what to do? Since availability is a metric, there is nothing wrong with it as long as it is clearly expressed. For example, "99.99% availability measured over 10 years, with a single failure lasting no more than 10 hours" is a clear statement of MTBF (10 years) and MTTR (10 hours). Some people may have calculated the answer and the availability will be 99.98859. But now people may come to the point that MTBF is more important than availability and people need to use MTBF to calculate availability in the first place. "Single failure" avoids the sum of multiple events.

Of course, the ultimate "failure" of a resilient data center may be the easiest to achieve: not by hacking the UPS through the Internet, but by human factors or failures turning off the power, raising the server inlet temperature and causing it to crash.

Resilience is critical to data center infrastructure management and preventing downtime. Even the best designs and operations can fail. Therefore, data center technicians design and test to meet the needs of operators and operators, reduce the fear of downtime, and improve staff management and maintenance of data centers, and increase confidence in availability.

<<:  Data Center Strategic Evolution

>>:  How much does data center downtime cost?

Recommend

How will network reconstruction proceed in the future?

Network reconstruction is intended to overturn th...

Comprehensive popular science about "Internet of Vehicles"!

Speaking of the Internet of Vehicles, I believe e...

How can domestic Wi-Fi chips make up for the "lost decade"?

Since Wi-Fi 5 was launched in 2013, the gap betwe...

How can the CDN industry break through the era of negative gross profit?

Since the Ministry of Industry and Information Te...

...

Life is not easy, where is the future for terminal manufacturers in the 5G era?

From the 1G analog communication era to the 4G mo...

10 questions to ask during TCP protocol interview

First show the mind map of this article TCP, as a...

5G may be just around the corner as a universal core for fiber

Convergence between wired and wireless networks i...