With the latest H100 Nvidia chip drawing up to a whopping 700 watts when configured on a SXM socket and a hefty 400 watts when configured via PCI-E, it's no wonder that 2024 has been the year where liquid cooling has shot to the forefront of minds throughout the data center industry.

The AI boom has forced operators to look beyond the traditional air cooling solutions that the vast majority of data centers leverage to keep their IT systems running efficiently.

Nvidia tech specs
Nvidia GPUs need a lot of cooling – Nvidia

Novel liquid cooling solutions are coming to the fore, driven by the need for owners and operators to come up with completely new designs for their new greenfield facilities, with the majority also having to balance this against the retrofit and upgrade of current brownfield sites.

These workloads, at the moment, only seem to be getting bigger across all requirements - be it the need for power, cooling, bandwidth, or data storage. For example, Nvidia’s upcoming Blackwell generation takes power consumption to new heights. The B200 GPU is expected to draw up to 1,200W, while the GB200 - featuring two B200 GPUs paired with a Grace CPU - could reach an astounding 2,700W. This marks a staggering 300 percent jump in power consumption within a single GPU generation, reflecting the accelerating energy demands of AI systems.

Liquid v air cooling forecast
– Omdia

The liquid cooling market is also experiencing a bit of a limelight moment - with analysts placing the data center cooling segment to reach a staggering $16.8 billion dollars by 2028, at a 25 percent CAGR, with liquid cooling emerging as the predominant technology and biggest driver of this.

As AI compute loads expand with increasingly wider and more intensive deployments, ultra-high-density AI racks are becoming a reality. These racks can demand 100kW of power and house equipment valued at more than $10 million per rack, often relying on direct-to-chip or immersion liquid cooling. This shift introduces significant challenges in delivering adequate power, space, and cooling to accommodate unprecedented workload levels.

Before settling on an AI compute cooling strategy, owners and operators must evaluate a broad spectrum of engineering considerations. These decisions should account not only for supply chain constraints but also for long-term corporate ESG and sustainability objectives.

From the plethora of liquid cooling solutions, it seems cold plate technology and wider direct-to-chip solutions are leading the charge in terms of adoption. The preference for direct-to-chip and, specifically, cold plate liquid cooling is attributed to its effectiveness in handling high-density computing environments and its compatibility with existing data center infrastructures. This method offers a balance between performance and implementation complexity, especially for brownfield sites and retrofits. That being said, greenfield sites will most likely drive an uptick in immersion cooling deployments - either single-phase or two-phase.

Since late 2022, vendors have been hard at work finding the middle ground between innovation and risk management bringing new solutions to the market with some blurring the lines between direct-to-chip and immersion.

Accelsius unveiled its NeuCool two-phase direct-to-chip cooling solution in April 2024. It utilizes a dielectric refrigerant that evaporates upon absorbing heat from high-power components like CPUs and GPUs. The vapor is then condensed and recirculated, creating an efficient cooling loop. This supports up to 2,200W per socket and up to 100kW per rack, making it suitable for current and future high-performance computing needs. Furthermore, the system works well for older, air-based equipment where the heatsinks are replaced by their proprietary CPU and GPU “vaporators.” These are designed to slot in the same location and form factor as the heatsinks so this type of technology can become the de-facto solution for existing facilities that need an upgrade to support the requirements of these new workloads.

This solution can work with or without the use of water and is entirely modular. As such, it allows for seamless integration into existing data center infrastructures. It accommodates standard server racks, facilitating ease of deployment across a range of facilities from Edge to hyperscale data centers. The system is compatible with various final heat rejection systems, including waterless, pumped-refrigerant options.

Chilldyne, on the other hand, has come up with a direct-to-chip solution that eliminates one of the main risks of liquid cooling - leaks. The solution, pictured, works by creating a vacuum that draws in and circulates coolant into the system and across cold plates mounted on the CPU/GPU.

Chilldyne
– Chilldyne

Should a rupture occur in one of the tubes carrying the liquid, the vacuum ensures no liquid is spilled and air is drawn in. The system also continuously monitors for pressure changes and alerts the owner/operator if something has gone awry. Furthermore, in certain configurations the system may isolate the affected section, minimizing the impact on other components. After releasing this, Chilldyne turned its attention to brownfield sites, and in July 2024 launched a plug-and-play liquid cooling starter kit designed to modernize data centers and support AI workloads. The kit includes two CDUs and cold plates rated up to 2,000-watt TDP, supporting up to 150kW cooling per rack.

Other vendors are betting big on the new wave of greenfield sites coming online right now with companies like Asperitas, LiquidStack, and Submer hedging their bets on a different flavor of liquid cooling - immersion.

Submer
Submer's immersion cooling system – Submer

These systems are either two-phase or single-phase and revolve around the concept of immersing the servers directly into a tub of dielectric fluid and using the high heat-carrying potential of these fluids to move the heat away from the IT equipment. These designs offer an extremely high cooling efficiency but also pose challenges, particularly around integrating them into already existing air-cooled data centers.

Liquid cooling systems (of any flavor) require a significant upfront investment in equipment regardless of whether it is deployed in brownfield or greenfield sites. And while liquid cooling offers long-term energy savings, owners and operators are still on the fence when it comes to the need for liquid cooling in smaller, low-density facilities.

Adoption of liquid cooling within the data center space is driven by the workload itself, so facilities not offering AI or HPC services are unlikely to see a need to upgrade their cooling infrastructure, as air cooling is more than sufficient for most use cases.

The lack of standardization across the parts making up the liquid cooling solution is another blocker towards adoption - components like manifolds, reservoirs, and even the way the CDU is connected, vary from vendor to vendor. Regulatory pressures coupled with the drive for a sustainable data center industry have made certain solutions less desirable even though they boast great energy efficiency potential. These concerns primarily revolve around water usage, chemical impacts, energy consumption, and waste management. For example, certain fluids may require special handling and disposal processes due to their chemical properties.

The liquid cooling market still has considerable room to grow, and as the industry works towards standardization and vendors keep coming up with solutions that cater to the plethora of risks that this technology brings, the breadth of adoption will only increase. After all, the laws of physics have not and will not change anytime soon: Broadly speaking, every added watt of power needed by a chip means one watt of heat through energy transfer that needs removing.