Those hot and humid days of summer are upon us again, stress testing the limits of our cooling systems. And while many organizations have forgone operating their onsite data centers, and have converted to colo’s and cloud, there are still many small to midsize data centers and “server rooms” in operation. While I have written about this before, many firms will still see their data center’s cooling systems pushed to, and even beyond, their limits. So you may want to consider putting sunburn lotion on your servers or you can use some of these cooling tips to keep them from overheating.

So while there are many posts about all the new ASHRAE expanded Thermal Guidelines and Free Cooling, it does not help if you are in a site with marginal cooling units. This is also a common issue for server rooms located in mixed-use buildings that are not using large dedicated cooling systems or systems without enough extra capacity for those very hot summer days. Virtually any cooling system’s performance will decrease with higher outdoor temperatures and humidity. Many IT departments are “sweating” out the summer (again), hoping that they will not have servers suddenly crashing from over-temperature shutdowns.

heat cooling summer temperature city thinkstock tomwang112
– Thinkstock / tomwang112

Here are a few tips, tricks and techniques that may not solve the long term problem, but may help enough to get you through the summer. Many times, when the actual capacity of the cooling system is not severely exceeded by the actual heat load of the equipment, optimizing the airflow may improve the situation until a new or additional cooling system is installed.

  1. If it feels warm, don’t panic - even if you see 80°F, in the cold aisle! Yes, while this is hotter than the proverbial 70-72°F data center “standard” you were used to (and you may not enjoy working in the room), however, it may not be as bad for the servers as you think. If the highest temperature reading in the front of the rack is 80°F or less, you are still within ASHRAE’s TC 9.9 latest “Recommended” guidelines. Even if the intake temperature is somewhat higher (up to 90°F), it is still within the A1 “Allowable” guidelines.
  2. Take temperature measurements inside the face of the cabinet at the front of the servers. This is where the servers draw in the cool air and is really the only valid and most important measurement. Take readings at the top, middle and bottom of the front of the racks (assuming that you have a Hot Aisle - Cold Aisle layout). The top of the rack is usually the highest. If the bottom areas of the racks are cooler, and where possible, try to re-arrange the servers nearer the bottom (or coolest area) of the racks. Make sure that you use blanking panels to block off ANY and ALL open un-used spaces in the front of the racks. This will prevent hot air from the rear re-circulating into the front of the racks.
  3. Don’t worry about rear temperatures – even if they are at 100°F or more(this is not unusual)! Do not place random fans blowing at the rear of racks to “cool them down” – this just causes more mixing of warm air into the cold aisles (I wish I had a dollar for every time I have seen this)!
  4. If you have a raised floor, make sure that the floor grates or perforated tiles are properly located in front of where the hottest racks are. If necessary re-arrange or change to different floor grates to match the airflow to the heat load. Be careful not to locate floor grates too close to the CRACs, this will “Short Circuit” the cool air flow immediately back into the CRACs and rob the rest of the room/row of sufficient cool air.
  5. Avoid bypass airflow. Check the raised floor for openings inside the cabinets. Cable openings in the floor allow air to escape the raised floor plenum were it is not needed, and lowers the available cold air to the floor vents in the cold aisles. Use air containment brush type collar kits to minimize this problem.
  6. If possible, try to re-distribute and evenly spread the heat loads into every rack to avoid or minimize “Hot Spots”. At the very least, manually check the temperature in the racks at the top, middle and bottom, before you move the servers. Install permanent temperature sensors in each rack or at least every 3rd rack and a central monitoring if possible.
  7. Check the rear of racks for cables blocking exhaust airflow. This will cause excessive back pressure for the IT equipment fans and can cause the equipment to overheat - even when there is enough cool air in front. This is especially true of racks full of 1U servers with a lot of long power cords and network cabling. Consider purchasing shorter (1-2 foot) power cords and replacing the original longer OEM cords shipped with most servers. Also use the shortest possible network cables as well. Use cable management to unclutter the rear of the rack so that the air flow is not impeded.
  8. If you have an overhead ducted cooling system, make sure that the cool air outlets are directly over the front of the racks and the return ducts are over the hot aisles. I have seen sites where the ceiling vents and returns are poorly located, the room is very hot, yet the capacity of the cooling system has not been exceeded simply because the all the cool air is not getting directly to the front of the racks or the hot air is not properly extracted. The most important issue is to avoid recirculation; make sure the hot air from the rear of the cabinets can get directly back to the CRAC return, without mixing with the cold air. If you have a plenum ceiling consider using it to capture the warm air and add a ducted collar going into the ceiling from your CRAC’s top return air intake. Some basic duct work will have an immediate impact on the room temperature. In fact the warmer the return air, the higher the efficiency and actual cooling capacity of the CRAC.
  9. Consider adding temporary “roll-in” type cooling units only if you can exhaust the heat into an external area. Running the exhaust ducts into a ceiling that goes back to the CRAC does not work. The heat exhaust ducts of the roll-in must exhaust into an area outside of the controlled space.
  10. When the room is not occupied, turn off the lights. This can save 1-3% of electrical and heat load, which in a marginal cooling situation, may lower the temperature 1-2 degrees.
  11. Check to see if there is any equipment that is still plugged in and powered up, but is no longer in production (aka the ever popular Zombie servers). This is a fairly common occurrence and has an easy fix - just shut them off!
  12. If you have blade servers consider activating the “power capping” feature, when cooling systems are not able to handle the full heat load. This may slow down the processors a bit, but it is much better that having an unexpected server crash due to thermal shutdown.

The bottom line

Of course, make sure that your cooling system is properly serviced and that all exterior rejection systems have been cleaned. While there is no true quick fix when your heat load totally exceeds your cooling system’s capacity, sometimes just improving the air flow may increase the overall efficiency 5-20%. Make sure that your cool system was serviced and all exterior rejection has been cleaned. This may get you though the hottest days, until you can upgrade your cooling systems should you need it. In any event, it will lower your energy costs, which is always a good thing.

This year the Covid-19 pandemic has made it more difficult for IT and other support personnel to work onsite, making remote monitoring and control more important than ever. Plan ahead. At the very least, install some basic remote temperature monitoring inside some or all of the cabinets. Set alarm thresholds to provide an early warning system of developing problems. If all else fails, have a fall-back plan to shut down the least critical systems, so that the more critical servers can remain operational (i.e. email – financial, etc.). Make sure to locate the most critical systems in the coolest area. This is a lot better than getting (or perhaps not getting) high temperature warning email messages or having the most critical systems unexpectedly shutdown from overheating.