Application and development of machine learning tools in data centers

Application and development of machine learning tools in data centers

At the beginning of the Internet, data centers were small and simple. A large e-commerce service data center only used a few 19-inch racks to deploy the required servers, storage, and network equipment. Today, super-large data centers have tens of thousands of hardware devices deployed on thousands of racks. As data center designs change, these large data centers are either built near large population centers or in remote areas with cheap electricity.

As data center operations become more automated, public cloud providers like AWS or Microsoft Azure employ fewer and fewer senior data center engineers, often fewer than security staff and general technical workers. Fewer people managing more servers means that monitoring data center power and cooling infrastructure requires more reliance on sensors, which are now known as IoT hardware. These hardware help identify problems to a certain extent, but in many cases, sensors cannot replace experienced facility engineers. For example, by identifying the operation of equipment through sound, you can also understand which fan will fail or locate the leak through the sound of water drops.

[[196434]]

A rack of servers powered by Google's custom Tensor Processing Units (TPUs) for machine learning.

Data center managers need more sensors to monitor modern data center infrastructure, and a new generation of applications aims to fill this gap by applying machine learning to IoT sensor networks. The idea is to turn experience into rules to help sensors distinguish between sounds and images, for example, adding a new automated management layer to the data center that can predict and prevent failures in data center infrastructure. "Fast recovery time and efficient capacity configuration can also reduce data center risks," said Rhonda Ascierto, an analyst at 451 Research.

Combining DCIM and diverse data

The first step is to leverage predictive analytics in data center infrastructure management, or DCIM, software. Take, for example, software from a company called Vigilent, based in Oakland, California. “The control system is based on machine learning software that determines relationships between variables, such as rack temperature, cooling unit settings, cooling capacity, cooling redundancy, power consumption, and risk of failure. It regulates the cooling units by turning the units on and off, including variable frequency drives (VFDs), adjusting the VFDs up and down, and adjusting the temperature setpoints of the units,” Ascierto said. It uses wireless temperature sensors and predicts what will happen if the operator takes certain actions, such as shutting down the cooling unit or increasing the setpoint temperature.

Another example is Oneserve Infinite, a UK company that combines sensors with multiple data points, such as using weather conditions, to provide what it calls "predictive field service management" in Exeter. The aim is to predict maintenance requirements, avoid downtime, and minimize downtime. Chris Proctor, CEO of Oneserve, said that by applying these technologies, strategic planning and procurement can be handled at the same time. "Data centers will be able to manage assets and resources more accurately and efficiently." (This capability is not yet available in any data center.)

Oneserve is more concerned about maintenance issues, tracking past maintenance issues and allowing users to detail where the problem occurred each time. At present, this is still a very time-consuming and laborious manual operation method, but in the future, staff will use this data to train machine learning systems.

Mining human knowledge

An example of combining sensor data with operational experience is LitBit in San Jose. According to Scott Noteboom, the company's founder and CEO, they once provided data center strategy for Yahoo and Apple. LitBit's data center artificial intelligence or DAC (digital-to-analog converter) allows operators to train and adjust machines, learn from staff, and gain the ability to respond to events in the data center, thereby alerting operators or eventually performing operations automatically. The key to LitBit's approach is to use a form of assisted learning. When the system detects a new abnormal event, the system will alert the operator, and the operator will then develop a set of rules for responding to these events in the future. To collect data, LitBit has a mobile application that can accept video and then convert it into thousands of images for training.

The startup offers a managed cloud service that can leverage anonymized data from many users to build more complex and accurate models. Some customers will keep their training models confidential, while others may sell them as an additional revenue source. As Ascierto noted, "The value of data center management data is multiplied when it is aggregated and analyzed at scale. By applying algorithms to large data sets aggregated from many customers, including different types of data centers and different locations, providers can predict when equipment will fail and when cooling thresholds will be reached."

When operators with knowledge and experience are not around, some implicit knowledge can help the system identify problems in operation and respond faster. Data center artificial intelligence may not completely replace data center staff, but it can continuously enhance skills to help operators solve problems.

This field is still immature, but it is developing rapidly. Machine learning on sensors is rapidly developing and being widely used in various industries. Microsoft Research has been working with Sierra Systems to develop machine learning-based audio analysis of oil and gas pipeline defects, using its cognitive toolkit to help classify anomalies that occur.

AI-based data center management services are emerging technologies that are still under development and require a lot of training. Ascierto pointed out that your DCIM software may need more sensors. "If you want to use AI for end-to-end chiller-to-rack decisions, your equipment will need to have acoustic and vibration sensors installed, as well as environmental sensors and electrical instrumentation. If the goal is to optimize and automate the set-point temperature of the cooling unit, you may need multiple environmental sensors per rack (top, middle, bottom).

It will take time for AI systems to be fully operational, just like recruiting new data center staff, but similar machine learning tools will eventually help you run your data center.

<<:  How do modern data centers meet the needs of a hyper-connected global economy?

>>:  Protecting corporate intranet data security in just seven steps

Recommend

The first interpretation in China: the potential game-changer behind IIoT

If the Industrial Internet is to be implemented, ...

80VPS Los Angeles MC Data Center 199 yuan/year KVM simple test

A few days ago, I shared the information about th...

5G development enters its fourth year, and innovation is the key to development

On June 6, 2019, the Ministry of Industry and Inf...

Why SDN and IBN Require Better Network Visibility

Intent-based networking (IBN) has been a topic of...

Hostio: €5/month KVM-2GB/25GB/5TB/Netherlands data center

Hostio is a foreign hosting company founded in 20...

NFV in 2017: A Key Year for Commercial Deployment

[[181719]] As mobile communications shift from th...

Analysis | A Deeper Look at Apache Flink’s Network Stack

Flink's network protocol stack is one of the ...

Digital Experience Monitoring and Network Observability

It’s clear that in the business world, digital op...

Industry insiders look at this: The history of 5G at the two sessions

[[327682]] A 5G+ holographic remote same-screen i...