Today, the operations and maintenance (O&M) of critical data center facilities is considered as important as the engineering and design phases of complex sites. As the robustness and associated complexity of critical infrastructure continue to increase, the importance of establishing strong operations and maintenance practices to manage data center facilities is becoming increasingly apparent, improving fault tolerance and parallel maintenance capabilities. Studies have shown that 60% or more of "sabotage events" where critical missions are affected are related to the behavior of staff. This activity includes routine switching and reconfiguration of critical systems, maintenance tasks, and of course human error. The staff and processes needed to support the ongoing operation of a data center must be in place on the first day of its opening and must continue until the last day of critical business operations. This requires that efforts to establish these processes begin before the facility begins operations, starting with the site planning and requirements definition phase.
Data Center Design Considerations Improving the high availability of critical data center facilities often requires the deployment of complex redundancy schemes, such as 2N, 2(N+1), or even 2(N+1)/3 configurations. Sufficient redundancy is required to support uninterrupted operations even if critical equipment or systems fail. But if the affected infrastructure does not have adequate measures to isolate the failed equipment, and the equipment cannot subsequently be accessed, repaired, or replaced during continued operations, an outage will still occur. This means that the requirements to maintain critical operations throughout the life of the data center facility must be included in the design and construction before operations begin. This is known as design for maintainability. Construction, start-up and commissioning A data center facility that has been thoroughly planned and designed is not the same as a perfectly designed one during construction. The construction process requires strict supervision and quality control, and frequent on-site progress inspections are required during construction. In addition, comprehensive startup and testing must be performed by qualified technicians to conduct formal acceptance tests before the equipment can be certified and ready to begin critical operations. This process is called commissioning, and it also includes ensuring that the project is properly staffed, and that site-specific training is provided to staff, and that accurate on-site documentation is available. Formal commissioning begins during the design phase (if not earlier) to provide a review of constructability, maintainability and to ensure that the design intent (based on the design documentation) meets the owner's requirements and expectations for equipment performance. Commissioning also includes various levels of testing and verification, including factory acceptance testing, shipping and receiving requirements, on-site progress inspections, functional and functional performance testing, and finally integrated system testing. On-site O&M personnel should be involved in the commissioning process throughout construction, startup, and acceptance testing. This provides valuable, and sometimes unique, opportunities for O&M personnel to participate in activities that will allow them to learn what they will be responsible for in critical operations in the future. There is no better opportunity than now to conduct hands-on training and gain a deep understanding of the nuances of a specific site. Operation and maintenance personnel and organizations The staff assigned to operate and maintain critical facilities should be given the same foresight, consideration, and attention as any other aspect of the process. Operations and maintenance staff should be identified, organized, and trained before the site goes live. Some important considerations are what skills are needed to operate and maintain the site? To whom should this department report? What will the staff be responsible for and what will be outsourced, including service level agreements? One of the first questions should be: “How will the operations and maintenance organization differentiate between staff providing operations and maintenance services for critical infrastructure, or all critical and non-critical operations and maintenance activities covered by the organization?” Ideally, dedicated staff are assigned to be separate staff responsible for critical infrastructure and non-critical infrastructure. Continuous operations require constant vigilance and focus on critical 24/7 continuous operations systems. Although some incidents may be urgent, especially when located in a highly visible location, they may distract staff from their work, but they should be fully focused on critical operations. Likewise, critical operations and maintenance budgets should not compete for scarce resources, which may include office supplies, landscaping, and other necessary expenses. Operation and maintenance process Operation and maintenance of critical facilities is more than a set of procedures. It is a strategy that should include clear goals and objectives, clear roles and responsibilities, an organization focused on ongoing operations, and adequate resources to achieve objectives. When is the data center most vulnerable? Are contractors, suppliers, and parts difficult to reach during the night and weekends? Or during the work week, what is the worst impact of a power outage? Obviously, the answer is related to the mission of the data center. If the data center truly supports more valuable business activities during normal working hours, you may get an answer. On the other hand, if the data center has a true 24/7 operation mission, where 9 am on Monday is no more important than 9 pm on Saturday. The answers to these questions can lead to even more questions. For example, where will operators store critical spare parts? Will they require environmental conditioning or routine maintenance? Will the data center require industry experts to manage complex monitoring and control systems, or what will the operating system require? Which spare parts will be considered critical and need to be maintained on-site? What tools, equipment, and inventory will be required? Will a computerized maintenance management system be used, and if so, who will build and configure it? Maintenance programs for general data center facilities also vary widely, with critical facilities tending to be on the high end. Most data center facilities have some level of planned maintenance. Routine tasks based on time intervals or frequencies are called preventive maintenance. For example, on a particular piece of equipment, there may be monthly inspections, semi-annual inspections of conveyors and adjustments, six-month filter changes, and annual internal cleaning, calibration checks, and sensor calibrations. The disadvantage here is that the tasks occur regardless of actual operating conditions. These procedures can be refined based on actual equipment operating hours, but still do not take into account actual operating conditions. One improvement is to implement condition-based monitoring technology so that maintenance is performed based on actual operating conditions. A simple example is using a differential pressure sensor to monitor filter conditions. When the filter is loaded, the delta-P increases and the filter needs to be replaced when appropriate. When these condition monitoring technologies are used and the data is trended, operators can predict in advance when maintenance will be required. This is known as predictive maintenance. Thresholds can be assigned to alarm and alert conditions, and by analyzing trends, it is possible to predict when a threshold will be exceeded or even predict failure. Some of the technologies used in operating condition monitoring technologies include vibration analysis, tribology (lubrication analysis), and infrared thermal scanning. These technologies can reveal insights into the health of equipment while it is online, without the need for downtime or interruption for maintenance. in conclusion All aspects of facility operation and maintenance must be considered early in the development of site requirements. Otherwise, opportunities may be lost to embed necessary O&M requirements into the design and construction of a facility. It is obvious that due to the enormous capital investment required to design, build and bring online critical facilities today, and given the importance of the missions associated with these facilities, the staff, programs, and resources that will be entrusted to operate and maintain a data center over its intended lifespan will be significant. |
<<: One router makes all the appliances in the house smart. Huawei's ecosystem is taking over Xiaomi
>>: What’s next after the first 5G standard is released?
In the past two years, Alipay and WeChat payment ...
Double Eleven coincides with the fourth anniversa...
51CTO Network+ Platform launched the "TechNe...
Hostmem is a Chinese VPS service provider. The tr...
UFOVPS is currently carrying out a spring promoti...
What can 5G do? This is a question that everyone ...
[[322727]] Differences between HTTP and HTTPS HTT...
DCIM (Data Center Infrastructure Management) is a...
V5.NET is a company that provides cloud servers a...
[51CTO.com original article] As 2016 enters the c...
On October 28, according to the latest 5G service...
Improving battery life has been a challenge for a...
In recent years, the development trend of network...
It is no exaggeration to say that today's Int...
Have you ever thought about how long it has been ...