What are the main measures and methods to deal with data center downtime?

What are the main measures and methods to deal with data center downtime?

While data centers are designed to not fail in theory, it does happen, so data center operators are facing a very serious situation, especially colocation data centers.

According to some recent events, the consequences of power outages and business interruptions in colocation data centers are very serious. For example, British Telecom, one of the world's largest communications providers and colocation data center providers, has experienced two outages in its data centers this year. According to reports, voice and data traffic in and around London fell by 10% due to the failure, and the incident lasted for more than four hours.

While efforts are made to avoid outages or incidents when designing and operating data centers, data center colocation facilities are not immune to these issues, and both short-term and long-term unexpected outages can be costly. If customers choose to abandon service, the business may be subject to financial penalties for not meeting service level agreements (SLAs), which may also cause long-term damage to the business’s brand and loss of business revenue.

[[222863]]

Data center outage

From a data center perspective, it is very simple to say what should or should not be done to prevent outages. However, if you are the data owner and your data center solution fails, then it is a different conclusion. If you have made a strategic decision to put your data in an external data center and have done a risk analysis, are you really prepared for the worst? The question is, what should you do if you find yourself in this situation?

The best way to prepare for the worst-case scenario is to continually address this possibility. If it fails, the organization's preparation efforts and awareness of the process will provide it with the resources and tools to mitigate failure. If the company has not considered or has not done so, it is recommended to evaluate its situation from the following aspects.

1. Diversify your risk

First, when you develop your data center strategy, avoid putting all your data in one place, as this increases the risk factor. Likewise, avoid putting all your critical applications in the same place. Consider putting your primary data in one place and your backup data in another. Then walk through each scenario and determine what the impact of any level of failure would be. Repeat this process once a year.

2. Trust but verify

Get audit records from your service provider, and more importantly, review them carefully. In many cases, colocation data centers need to be audited for compliance with regulations such as HIPAA, SOX, and PCI. Sometimes, however, this review may be done by someone who does not fully understand how IT or the data center operates. Therefore, companies need to arrange for professionals who understand how the data center operates reliably to conduct the audit. These third-party audits are usually much easier than identifying risks on their own and can provide more information. In most cases, the cost of mitigating risks through audits and verification measures is usually minimal compared to the cost of a disruption and the cost of operations.

3. Sign a written agreement

You need to know how your data center hosting provider will handle outages. When signing a contract with a vendor, insist on a written agreement that acknowledges both parties agree on what will cause an outage. This is critical. In fact, data owners have found that sometimes the agreement does not cover what they had in mind. Also, get written assurances that the vendor will provide services during an outage and promise to restore within an acceptable time.

4. Backup strategy

It is important for companies to understand the risks to their business and prepare for the worst. Most colocation data centers have an alternative site that can handle basic disaster recovery to ensure that their customers have little to no impact on operations. Most companies are still pursuing the deployment of active-active databases in data centers (colocation data centers, cloud computing, or on-premises). While some active-active deployments are close to success, the interruptions are painful when trying to use disaster recovery backups. The database is not as complete as the company expects, and data loss or applications are likely to be affected during the failover.

5. Understand (and document) the process

In the event of an incident, all parties go into crisis mode. It's important to understand (and document) how your hosting provider handles events like natural disasters and failed components. So what steps are taken and in what order? An important question to ask is who has access in the event of an outage? Other organizations will have access to this server after an incident. You need to understand exactly whether they have access, who has access, and what actions they are allowed to perform when they have access. Also, know what additional security measures will be taken to protect your data during the repair period.

An important part of this process is the communication protocol. Open communication is essential to effectively manage the situation and provide updates to your management. You need to know who the primary contact is, who to contact for updates, and how often updates are made. Also, verify the name and phone number of the contact regularly. Importantly, if the phone number on the calling list is no longer in service or the contact person has left the company, then the situation will be worse.

6. Keep records

Documentation is not only for colocation data centers, but for all companies related to the data center business. In the survey, it was found that many customers did not document their daily operational processes and procedures. Even if they were documented, they were not updated frequently. Documentation is essential to being prepared in the event of a disaster, including: understanding where applications are running, knowing which ones are most affected by the outage, who needs to know about the changes, etc.

7. Learn about failure cases

During the evaluation process, most hosting providers will tell you how the system was installed to prevent service interruptions. They also provide you with testimonials and references from satisfied customers. But they usually won’t tell you about failures.

Therefore, organizations need to understand the failure cases of colocation service providers and ask them whether they have encountered an incident in the past year. If so, they need to understand the details of the incident, how it was corrected, and what steps were taken to prevent the incident from happening again. Enterprises can learn a lot about colocation data centers and how they handled the situation in these cases. Handling crises is the test of whether a partner is qualified.

8. Understand the Disclaimer

If you lose confidence in your hosting partner, be sure to understand the disclaimer clauses in the contract, which will help you terminate the partnership smoothly. Make sure the contract is not vague and avoid being restricted by unreasonable terms.

9. Know your options

Most colocation contracts are for a period of several years, during which time the colocation market will expand and new vendors will enter the market. While you may not be looking to adopt a new colocation facility at this time, you should continually evaluate other providers or review your options with an advisor or broker. If a failure occurs, you must know the options for moving to a new solution. In some cases, if the failure is significant or takes too long, the consequences may force the colocation facility to cease operations, leaving the organization with a loss of business.

10. Become a data center expert

In the case of the British Telecom failure, the cause of the problem was a circuit breaker failure. While one would think that critical facilities would be immune to single points of failure, the evidence suggests otherwise. Today, organizations operating data businesses must become data center experts. Organizations must not only be familiar with data center knowledge, but also understand market trends.

By asking questions and reading reports, you can understand every aspect of your data center solution. Most importantly, you need to know potential failure points and understand what situations may cause an outage. We all hope that an outage or failure will never occur. However, if it does, you must prepare for it and guide your team. The best advice is to have a plan in place in case of these failures and follow it step by step. Communication is critical to the success of the plan because people may be impatient when failures occur, but they must follow through. By regularly reviewing these important areas, you will gain the knowledge and experience to effectively respond to outages or failures.

<<:  Which of the three major operators has the highest user complaint rate? This data tells you

>>:  A brief discussion on the organizational structure design of data center operation and maintenance

Recommend

Sogou's revenue in 2019 reached 8 billion, a record high

On March 9, Sogou released its unaudited financia...

Forecast of the layout of the three major operators in 2018

2017 is coming to an end. In this year, the total...

POTN - the only way for network integration in the new era

In the 21st century, the communication network on...

Ma Xiaofang from Xunlei: I yearn for a manager who is like a "stabilizing force"

[51CTO.com original article] In order to pay trib...

Five ways edge computing drives digital business

Every industry has created a new normal: if your ...

[5.1]BGPTO: Japan server $64/month, E3-1230v3/16GB/480G SSD/20M Softbank line

BGPTO is promoting a dedicated server in Tokyo, J...

Is blockchain the next big thing? But it can easily fail if you’re not careful

On December 17 last year, according to the Bitcoi...