ASHRAE has released new guidelines for data center liquid cooling systems to help operators deal with the challenges posed by advanced chips.
The organization, which represents heating, refrigerating, and air conditioning engineers, has issued a technical bulletin detailing two “primary concerns” it has about cooling GPUs and other components used to power advanced workloads such as AI.
Demand for AI systems means data center racks are getting hotter and denser, and many operators are looking to liquid cooling as a solution to keep their servers running.
According to ASHRAE’s bulletin, chip power is moving into “uncharted territory”.
“Compute workloads continue to push for faster, more powerful, more efficient chips resulting in extreme chip power, lower temperature requirements, and broader use of liquid cooling,” the association said. “The loss of cooling can be catastrophic when supporting extreme chip powers.
“The extreme chip power is a design and operational challenge.”
The advisory identifies two “primary concerns” caused by increasingly powerful hardware - throttling, or “reduced computational performance due to temperature excursions within IT components,” and the potential for hardware damage caused by rapid spikes in temperature.
To mitigate these problems, ASHRAE suggests a series of technical and operational measures.
On the technical side, ASHRAE recommends using a Coolant Distribution Unit (CDU) to ensure demarcation between the Facility Water System (FWS) and Technology Cooling System (TCS) within the data center.
Data centers should increase thermal inertia to avoid hardware thermal damage due to large load changes and power loss, and incorporate active redundancy to maintain cooling during the changeover from primary to redundant systems. The organization also recommends carrying out transient modeling “to verify the performance of systems, products, and components that do not have empirical data from prior testing.”
On the operational side, ASHRAE says systems should be put in place to monitor coolant quality and filtration, as this can have an adverse effect on a liquid system, leading to lower efficiency or more energy consumption.
Its recommendations also include the use of load migration strategies so that data centers are prepared for a cooling system outage. These should “work within the timeframe of the minimum server time-to-throttle based on worst-case failure of the resilient design,” the advisory says.
ASHRAE says it will provide a new liquid cooling thermal template in its TC 9.9 Datacom Encyclopedia, released later this year.