The danger with thermal runaway

Archived Content

The following content is from an older version of this website, and may not display correctly.

The right level of cooling can have a significant impact on power bills and make a company look as if it is meeting environmental targets. In recent years data center owners have embraced the cost savings that moving from an input temperature of 16^oC to 23^oC have delivered. At the same time, they know that there is an increasing amount of ‘green’ legislation that means it is about more than just the money.

Historic cooling approaches
Historically data centers were cooled through the use of Computer Room Air Conditioning (CRAC) units spread around the outside of the room and cold air forced under the floor. With low power usage in the racks this was sufficient to cool all the equipment. Since the advent of blade systems and the increase in switches and storage, power usage per square foot has soared, along with the heat.

Modern cooling
The introduction of aisle containment, free-air cooling, in-row cooling, water cooling, air flow monitoring and better room design have delivered significant improvements in cooling. Some of these, such as aisle containment, can be retrofitted to a data center for limited cost and with little disruption to operations. This is critical because not only does it extend the life of a data center but it makes economic sense.

Many of these technologies, however, are only being deployed in new builds. Free-air cooling, a huge subject in its own right, can be done as part of a complete refurbishment but some options such as heat wheel or large plenum, have to be part of the building fabric. Water has to be carefully designed and implemented to ensure that there is no risk of power and water coming into contact.

Another approach that can be used in any data center, is the increase in input temperature. Until the early 2000s it was not unusual for a large percentage of the computer equipment inside a data center to be on a three- to five-year lease. At the same time, advances in internal IT system cooling were not high on the agenda of manufacturers. This meant that generational replacement of hardware gave some cooling efficiency but not a huge amount.

In the last decade, however, we have had a number of significant changes. The end of the dot-com recession and current recession have meant systems are being kept much longer. The introduction of blade systems and the massive heat increases they bring have ushered in an era of highly efficient cooling inside the systems.

As a result of all this, increasing the input temperatures into servers and storage systems can produce appreciable savings in power and cooling. The electrical cost of a fan inside a server can be less than the cost of injecting more air when it is just a single server that needs the extra cooling.

With all of this, why the doom and gloom of thermal runaway and data centre meltdown?

Thermal runaway
First, there is no suggestion that any of these technologies are not fit for purpose. Each of them can cool data centers at a lower cost than simple CRAC and forced air. The risk comes due to a combination of technologies being applied either wrongly or with no proper failsafe planning.

The start point here is the input temperature. Depending on the technology used for cooling, it can take an hour or more to remove just a couple of degrees of heat from a data centre. It takes far less time for heat to increase. A complete failure of cooling could see temperatures rise in minutes, even after cooling is resumed the temperatures may continue to rise if the cooling system does not have enough excess capacity to cope.

As we increase input temperatures, we shrink the gap between acceptable input temperature and the level at which failure becomes more likely. The older the equipment, the lower that failure temperature is. As temperatures rise, the fans work harder inside equipment, pulling in more air to try and cool the equipment and that lowers the available volume of cool air for other systems.

Any cooling failure therefore, has the potential to cause not just a single system failure but a cascade of failures. This is because as other systems begin to overheat they respond by drawing in more air, increasing the rate at which cool air is replaced by hotter air. This is known as a positive feedback loop.

Solution
The solution is two-fold:

* Model or test the impact of a complete cooling system failure. Identify the point and speed at which temperature rises.

* Add failover capacity that can be brought into play immediately a failure occurs to prevent the start of the overheating process.

For many data center owners, this will mean adding some cost back into the data center. While this may seem unpalatable, the alternative is likely to cost more both short term in replacement of equipment and long term in loss of trust and business.

The danger with thermal runaway

Archived Content

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies