According to a study by the Uptime Institute, up to one in ten cabinets are operating at temperatures above the recommended range for equipment reliability guidelines. As data center cabinet power densities continue to rise, with average power densities per cabinet reaching 5kW or more, it is expected that the number of cabinets suffering from hot spots will increase day by day and soon exceed this percentage. If hot spots are not eliminated, they may pose a serious threat over time, not only endangering the reliability and performance of IT equipment, but also affecting the warranty or maintenance agreement of the hardware manufacturer. Therefore, data center operation and maintenance personnel need to take effective measures as soon as possible to avoid such risks.
1. What is a hotspot? Many IT professionals often check the temperature in the hot aisle, or check the temperature in the wrong place in the cold aisle, and once they find the temperature is too high, they think they have found a hot spot. Then they will take various countermeasures, but the results may be disappointing. Instead of eliminating the hot spot, it will cause more hot spots. Understanding what hot spots are, the root causes of hot spots, and how to identify hot spots is critical to eradicating hot spots. (1) Definition of hotspot Any high temperature point randomly measured in the data center cannot be considered a hot spot. We define a hot spot as: when the temperature of the air inlet of the IT equipment is higher than the expected value recommended by ASHRAE Tc9.9, it is considered a hot spot. Generally, the top of the cabinet is the most likely location to produce a hot spot. The thermal guide of the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) gives the recommended value and allowable value range of the temperature at the server air inlet. (2) The origin of hot spots The cooling capacity of the cooling equipment installed in the data center is often greater than the cooling capacity required, especially when the cooling capacity is determined entirely by the rating on the "nameplate" of the IT equipment. If this is the case, then why do hot spots still occur? The reason is that hot spots are not caused by insufficient cooling capacity or excessive heat load, but by insufficient cooling capacity being used. In other words, there is enough cooling capacity, but it is not provided in sufficient amount in the areas that need cooling, which is caused by poor airflow management. Figure 1 is an example of underutilized cooling capacity from a real case study by Schneider Electric. The figure shows a typical traditional data center with room-level cooling, where the raised floor and ceiling are used as supply and return air ducts. The computer room air conditioner first delivers cold air into the floor duct at a certain pressure and speed. The cold air then enters the IT space from the floor duct (i.e., leakage air) through the perforated floor in the raised floor (accounting for 54% of the computer room air conditioner airflow) and the cable cutouts in the floor (accounting for 46% of the computer room air conditioner airflow). Air leakage from cable cutouts in the floor results in a loss of cooling capacity because the airflow does not reach the front of the IT equipment but instead bypasses it. In fact, the airflow does not remove any heat but instead goes back to the cooling unit. Most of the airflow through the perforated floor (96.29% of the perforated floor airflow) flows through the equipment in the IT cabinet, but due to the lack of airflow management, not all of the airflow passes through the equipment. A small amount of cold air (3.71% of the perforated floor airflow) bypasses the IT equipment and returns to the cooling unit. Like the leakage airflow, these bypass airflows will also cause cooling capacity loss. At the same time, some "cooling-hungry" IT equipment cannot get enough cooling capacity and have to draw the hot exhaust air from the equipment from the rear of the cabinet (accounting for 7.15% of the IT airflow), which often causes hot spots in front of the "cooling-hungry" IT equipment. In short, measures to reduce airflow leakage, bypass and recirculation will help eliminate hot spots. (3) How to identify hot spots Detecting hot spots early is critical to preventing IT equipment from overheating and failing. There are three ways to detect hot spots:
Other recommendations to help identify or prevent potential hot spots include using metered rack PDUs to identify and inspect high-density (5kW+) racks—as these racks are more likely to have hot spots. Use CFD software to predict hot spots after decisions are made to move, add, and change racks or during the data center design phase. CFD simulation technology can provide a detailed three-dimensional analysis of the temperature and pressure contours at the front of the rack and the airflow distribution around the rack to identify potential hot spots. The power of this tool lies in identifying areas where cooling capacity is wasted and areas where there is a mixture of hot and cold airflows, resulting in underutilized cooling capacity. 2. Evaluate the traditional measures taken When a hotspot is discovered, data center operators will take various countermeasures. However, not all measures are effective. Below are some traditional countermeasures and why they work/don’t work. Please note that most measures do nothing to reduce air bypass or recirculation. (1) Lower the temperature setting of the cooling unit It may seem logical that lowering the supply air temperature will help reduce hot spots, but it is a last resort when dealing with hot spots because it reduces the efficiency and cooling capacity of the entire cooling system. The effectiveness of this method depends on the operating status of the CRAC. If the cooling system has excess capacity (i.e., the workload is less than 100% and has not reached the cooling limit), then lowering the temperature setpoint has a positive effect. For hot spots close to the CRAC location, lowering the temperature setpoint can reduce the temperature at the hot spot. However, if the CRAC is running at maximum capacity (100% full load), lowering the temperature setpoint will have no effect because the system has reached the cooling limit and will not eliminate the hot spot. Every cooling system has a fixed maximum cooling capacity under given environmental conditions. When the temperature setpoint is lowered, the "maximum" cooling capacity is also reduced. (2) Place perforated floor panels in hot aisles Some people think this is a good approach because they don’t understand the benefits of a cold aisle/hot aisle layout and view all high temperatures as hot spots. In fact, this approach will not eliminate hot spots in the cold aisle, but may cause more hot spots. In addition, placing perforated floors in the hot aisle (i.e., creating airflow bypass) will reduce the available cooling capacity. The cold aisle/hot aisle layout is the most effective approach, so there are no hot spots in the hot aisle. Since the cold aisle is the "cold container" where IT equipment obtains cooling capacity, it is critical for IT equipment to maintain low temperatures in the cold aisle. In the early days of large air-cooled equipment, cooling was often provided by raised floors, and the cooling units were controlled based on the return air temperature. This approach worked because the room air temperature was uniform and the hot and cold air flows were fully mixed. Today, the cold aisle/hot aisle layout design deliberately creates two separate cold and hot temperature zones, resulting in uneven return air temperatures. People who are used to uniform room temperature designs may place perforated floors in the hot aisle, thinking that this will solve the hot spot problem. (3) Place the cabinet and perforated floor close to the cooling unit Some people think it is a good idea to place cabinets and perforated floors as close to the cooling unit as possible, assuming that the cabinets and perforated floors placed within a few feet of the cooling unit will receive more cooling capacity. In fact, the opposite effect is achieved. It may cause IT equipment to be undercooled and not consistently eliminate hot spots. Although this practice can help collect most of the exhaust heat, it is not predictable and is not an efficient way to eliminate hot spots. The reason this practice causes IT equipment to be undercooled is that the airflow from the cooling unit is at a high velocity, resulting in low static pressure in that area. This means that the perforated floor installed in that area will have little cooling capacity and may even draw air from the room into the floor duct. A simple way to determine if there is an airflow problem under a raised floor is to place a small piece of paper over the perforated floor. If the paper is drawn into the perforated floor, the perforated floor should be replaced with a solid floor to equalize the aisle pressure under the raised floor. (4) Place the floor fan in front of the hotspot cabinet Some people think that this is a good way to eliminate a hot spot by focusing the airflow directly in front of a hot spot. However, this method should only be used temporarily in emergency situations, such as when the IT equipment is about to experience a cooling outage. This can reduce the operating temperature of the equipment and eliminate hot spots, but it is very expensive. The floor fan basically acts as an air flow mixer, mixing the hot air exhausted by the equipment with the cold air to make the air flow temperature between the low supply air temperature and the high exhaust air temperature. This also increases the airflow through the equipment. The mixing of hot and cold air also reduces the efficiency of the cooling system, resulting in increased dehumidification/humidification burden, insufficient utilization of cooling system capacity, and possible loss of cooling redundancy. In addition, floor fans are an additional heat source in the data center. (5) Blow air through the ice and into the cold aisle Some people think that using ice to cool down is a good and simple method. While this method can help relieve hot spots, ice turns into water when it melts and can overflow the container, causing serious consequences. Even if you use packaged ice packs, this method is not the best choice because there are many simpler and more effective methods. We will discuss these methods in detail below. (6) Push in the portable refrigeration unit Some people think this is a good way to solve the problem because it focuses the cool air directly in front of a hot spot. However, this method should only be used temporarily in emergency situations, such as when IT equipment is about to lose cooling. Unfortunately, this method is often used as a permanent solution. Portable cooling units are generally used in emergency situations when there is a loss of cooling because data center staff can easily roll them into place. However, the best methods discussed below are the preferred, low-cost, and effective permanent solutions to the problem of hot spots throughout the data center. (7) Add more refrigeration units Some people naturally associate hot spots with insufficient cooling capacity, so adding cooling units is the ideal solution. However, in most cases, there is more than enough cooling capacity, but the lack of airflow management means that the cooling capacity reaching the cooling demand point is not up to the required level. In addition, this approach is not a panacea and it also incurs a lot of expenses while solving the hot spot problem. A survey by the Uptime Institute showed that although some IT rooms have cooling capacity up to 15 times the required amount, 7% to 20% of the cabinets in the room still have hot spots. The reason is that the incoming cold air bypasses the air intake of the IT equipment. The correct solution is to use the best practices discussed below and then determine whether additional cooling units are needed. 3. A new approach to eliminating hot spots The above methods are common, but most of them are not recommended because they do nothing to address the two main causes of hot spots: air bypass and recirculation. To eliminate air bypass and recirculation, the hot and cold air streams must be completely separated, so hot spots do not exist at all. The first four best practices below are effective because they address air bypass, recirculation, or both. The last new method should only be used after airflow management is fully in place. (1) Manage cabinet airflow Many hot spots occur because the hot exhaust air from the equipment recirculates in or around the cabinet. Therefore, improving cabinet airflow management is critical to solving hot spots. Open cabinet U spaces and cable inlets are the main causes of hot air recirculation, which directly leads to hot spots. One of the simplest and most cost-effective ways to improve cabinet airflow is to use blind panels to block unused cabinet U spaces and install brushes at the cabinet inlets and outlets. Enterprises should update data center operating procedures to require the installation of blind panels and brushes for all moves, adds and changes. Some types of switches and routers use side-to-side airflow. If the data center in which these devices are installed uses the traditional front-to-back rack airflow pattern, the hot exhaust air from the switch/router may return to the air intake and cause hot spots. The side-to-side air distribution units can be used to direct cool air to side-to-side airflow equipment in a predictable manner without creating hot spots. If the average total cooling is adequate, but hot spots are occurring in cabinets with above-average power densities, fan-assisted equipment can be added to improve cooling by improving airflow and increasing cooling capacity. Fan-assisted equipment effectively “borrows” airflow from adjacent cabinets with loads under 3kW to support the cabinet load. This approach minimizes the temperature difference between the top and bottom of the cabinet and prevents hot exhaust air from the equipment from being recirculated to the cabinet inlet. All exhaust equipment must be deployed with care to ensure that airflow from adjacent spaces does not cause adjacent cabinets to overheat. These units should be powered by a UPS to avoid interruptions in cooling during power outages. In high-density environments, overheating can also occur during the start-up of backup engines. (2) Managing airflow in the computer room After improving airflow management in the cabinets, the next important step is to improve airflow management within the room. First, seal all openings in the raised floor. Use a brush to seal the cable entry openings at the rear of the cabinet and under the PDU. Most of the unexpected air leaks are caused by these openings. Also seal the gaps around the cooling unit and other floor voids with air damping foam or cushions, replace the missing floor with solid flooring, and identify and replace perforated flooring that causes air bypass. For example, if there is a perforated floor in front of an empty cabinet, it should be replaced with solid flooring. Also, follow the procedures in the sidebar to rebalance the airflow under the floor. Proper flooring and sealing the gaps in the raised floor can help recapture lost cooling capacity. Another factor that contributes to hot spots is the mixing of hot and cold airflows that occurs over the tops of cabinets and around the ends of rows. A best practice to address this is to separate hot and cold airflows through containment aisle and/or cabinet airflow. Aisle containment not only helps eliminate hot spots, it is also more energy efficient than traditional uncontained data center designs. The rear door of the cabinet can be replaced with an air supply device to turn it into an active ducted cabinet. Note that these devices will increase the overall depth of the cabinet by approximately 250mm, which may increase the spacing between adjacent rows of cabinets. The hot air that is normally exhausted into the hot aisle is collected and pushed upward and then ducted into the return air duct. This prevents airflow from recirculating in the cabinet and improves the efficiency and cooling capacity of the cooling system. The fans in active independent vertical duct systems can support cabinet power densities up to 12kW and can overcome poor aisle pressure or pressure drops caused by excessive density of cables at server exhaust outlets. However, active vertical duct systems can easily cause unexpected problems in other areas of the data center, so special care should be taken when deploying and installing them. Blind panels and cabinet side panels must be used in these devices. Active duct systems are power-consuming devices, so they need to be monitored and maintained. (3) Transferring Problem Load As mentioned above, the way to move problem loads is to move the "problem" loads to lower density cabinets after they are found, thereby eliminating hot spots. Equip the room with cooling equipment to cool the room to an average value below the potential peak of the cabinets, and spread the load across several cabinets, thereby distributing the load to any cabinet that is loaded above the design average. Note that spreading the equipment load across multiple cabinets will create a lot of unused vertical space within the cabinet. These spaces must be sealed with blanking panels to prevent cooling performance degradation. If a server or other critical equipment can be removed, this move can eliminate the hot spot problem at almost no cost. (4) Change the location of the temperature and humidity sensor In most older data centers, temperature probes are installed in the CRAC return airflow, which makes the airflow unpredictable. This also causes uneven CRAC loads, which can cause temperature fluctuations at the server air inlets. Moving the temperature probes to the supply airflow (where the supply air is controlled and predictable) can provide more consistent temperatures at the IT equipment inlets. If used in conjunction with airflow containment, changing the location of the temperature probes can also increase the supply air temperature, thereby reducing the energy consumption of the cooling system without having to worry about large fluctuations in supply air temperature. (5) Use data center infrastructure management software to control the airflow of cooling units Some systems can control cooling units in a single room based on the temperature in front of the IT cabinets. These systems can use fuzzy algorithms to dynamically predict and adjust the fan speed of cooling units and calculate which cooling units can be turned off. By controlling the amount of air entering the data center, bypass air flow can be limited. The Vigilent cooling system is an example of such a system. 4. Conclusion Hot spots can seriously affect server reliability and performance, and can even damage servers. Hot spots usually occur at the air intake of IT equipment due to inefficient airflow management, such as cold air leakage (i.e. airflow bypass) and hot air recirculation from equipment exhaust. Patrolling, manually measuring temperature, or automatic monitoring are the three main ways to identify hot spots. Data center operators have adopted many countermeasures to eliminate hot spots, but most of them are unsatisfactory. Some can only be used as emergency situations, while others are useless and even make the problem worse. The best practices for eliminating hot spots include airflow management of cabinets and computer rooms, airflow containment, relocation of problem equipment, changing the location of temperature sensors, and controlling the airflow of cooling units through data center physical infrastructure management software. Using these methods to solve hot spot problems is not only simple and easy, but also low-cost and effective. |
>>: These core Internet protocols are gradually changing
The tribe once shared information about Hizakura,...
The use of dedicated mobile networks based on LTE...
I wrote an article about HTTPS the day before yes...
This is the best of times, and also the worst of ...
[[411646]] What will 5G replace? 5G's lightni...
The HTTP protocol only establishes the standard f...
The Domain Name System (DNS) is an Internet basic...
[Shanghai, China, September 24, 2020] During HUAW...
CentOS8 has been released for some time. I person...
[51CTO.com original article] On March 11, 2019, F...
HostYun has added a new VPS product in Hong Kong&...
Looking back at the year 2019 that is about to en...
[[387787]] March 15 news: At tonight's 315 Ga...
On August 9, according to foreign media reports, ...
1. Project Background The 5G communication networ...