The era of generative AI is well and truly upon us. According to JLL, it is among the top three technologies expected to have the biggest impact on real estate, having reached record investments of up to $4bn in AI-powered property technologies (PropTech) in 2022.
JLL's 2024 report also found that AI energy demands — ranging from 300-500MW+ — will require a plethora of more energy-efficient data center designs.
From an industry perspective, the numbers are indeed staggering. Analysts at TD Cowen have stated the AI wave has led to approximately 2.1GW of US data center leases, while CBRE’s European Real Estate Market Outlook 2024 found that data center providers will see an uptick in requests for capacity related to artificial intelligence (AI) requirements, with most of these expected to come from service providers and AI start-ups, as opposed to members of the hyperscale and cloud communities.
Now, as AI descends into all aspects of technology products, services, and solutions, many are asking if the data center industry is truly ready to accommodate its requirements. The answer, for many operators, is no.
Cooling the AI workloads of the future
Today Nvidia, the world’s leading authority on high-performance computing (HPC) and AI, is estimated to be responsible for more than 95 percent of machine learning workloads and remains the dominant manufacturer of GPU-accelerated technologies.
Last year, the company shared news it had won a $5 million grant to rearchitect the data center landscape and build an advanced liquid cooling system to address many of the challenges that legacy data centers, including on-premises, enterprise, and older colocation facilities, are facing.
Funded by the US Department of Energy, the COOLERCHIPS program has been positioned as one of the most ambitious projects the industry has ever seen, and at a time when processor heat and power capabilities are soaring as Moore’s Law and data center designs reach their physical limits.
The expectation from some is that soon, traditional air-cooled data center technologies may become obsolete, especially as AI adoption and supercomputing advancements gather pace, and that Nvidia’s cooling system could cost approximately 5 percent less and run up to 20 percent more efficiently than air-cooled approaches. It also expects that cooling technologies may begin to reach their limitations, as heat loads of more than 40 watts per square centimeter will face significant challenges in the future.
This is no wonder, with the latest Nvidia SuperPOD packing up to eight H100 GPUs per system, and all connected by Nvidia NVLink. Each DGX H100 is expected to provide up to 32 petaflops of AI performance, around six times more than its predecessor the DGX A100, which was already placing limits on traditional data center capabilities.
To add further context from a design and energy standpoint, an Nvidia SuperPOD can include up to 32 DGX H100 systems with the associated InfiniBand connectivity infrastructure, drawing up to 40.8kW of power per rack. By today’s standards, that might be considered an obscene amount of processing power and AI capability, but rack and power densities are only expected to increase.
Interestingly, Nvidia’s new Blackwell GPUs are set to enable businesses to build and run real-time Generative AI applications and large language models at up to 25× less cost and energy consumption than its predecessor, paving a new path for data centers engineered for AI. The question remains, how will data centers need to evolve to accommodate the cooling requirements of AI, and which organizations will be the winners in the race?
The future of data center cooling
The discussion around cooling methodologies remains one of the most divisive conversations within the industry. In one camp are those who advocate for air-cooled systems and recognize the benefits of free air cooling over that of a liquid-cooled approach – one which often requires a major investment in capital expenditure, a retrofit of a legacy data center architecture.
In the other are the owners and operators already undertaking proof of concept (POC) projects and deploying hybrid environments – those who are developing high-performance infrastructure systems precision-engineered to accommodate compute-intensive applications on an industrial scale.
The benefits of liquid cooling
With rack densities now expected to surpass 100kW, it’s clear that liquid cooling will become increasingly popular.
For those embracing the technology, the benefits are significant. Many of today’s liquid cooling solutions leverage the higher thermal transfer properties of water and other fluids to cool high-density racks more efficiently and effectively than legacy measures.
Approaches such as this are also reinforced by studies from organizations such as Iceotope and Meta, which confirmed the practicality, efficiency, and effectiveness of precision liquid cooling in meeting the cooling requirements of hyperscalers, where liquid cooling has gained something of a bias among members of the community.
With DTC liquid cooling, between 70-75 percent of the heat generated by the rack equipment is removed via water with the remaining 25-30 percent removed via air. Because direct-to-chip cooling is more effective from a heat transfer perspective, it is therefore able to support greater CPU and GPU densities while offering significant heat re-use capabilities.
Organizations within the sector have also predicted that liquid cooling can be up to 3,000 times more effective than using air, all of which points towards liquid cooling having the potential to become the preferred cooling architecture of the future and something that will be vital to meet data center sustainability requirements.
The future for generative AI is both exciting and unknown, but if Moore’s Law is reaching its physical limits, all roads lead toward liquid cooling as the only option for future GPU-powered compute.