Nvidia is experiencing more problems with its Blackwell GPUs, with the AI processors reportedly overheating when linked together in 72-chip data center racks.

The chip designer has asked its suppliers to make multiple changes to its custom rack design in a bid to solve the issue, according to a report in The Information which cites unnamed Nvidia employees, as well as some of the company’s customers.

Nvidia GB200 NVL72
– Nvidia

Announced in March, Nvidia hopes Blackwell will enable its customers to train and run more powerful AI models. Amazon, Google, Meta, Microsoft, Oracle Cloud, and OpenAI were among the first companies to sign up to use the GPUs.

At the same time, Nvidia announced a liquid-cooled rack design, the GB200 NVL72, capable of running 72 GB200 GPUs, 36 Grace CPUs and nine NVLink switch trays, each of which has two NVLink switches. The idea, the company said, is that this will enable the system to run as a giant GPU, boosting performance.

Speaking to DCD earlier this year, Ian Buck, Nvidia’s accelerated computing VP, said: "The fabric of NVLink, the spine, is connecting all those 72 GPUs to deliver an overall performance of 720 petaflops of training, 1.4 exaflops of inference."

The system has two miles of NVLink cabling across 5,000 cables, and Buck said that "in order to get all this compute to run that fast, this is a fully liquid cooled design" with 45°C (113°F) coolant going in and 65°C (149°F) coolant coming out.

Why Nvidia's Blackwell GPUs are overheating

With so many components packed together, it is perhaps no surprise that the density of the rack is causing issues. According to the report in The Information, linking so many GPUs together has caused them to overheat, meaning servers were less reliable and performance was impacted. The company’s smaller, 36-chip rack has been experiencing similar issues, according to Nvidia employees quoted by The Information who asked not to be named.

As designs are revised, customers are said to be getting anxious that Nvidia suppliers will not be able to ship the racks on schedule, with deliveries to many clients due to take place in the first half of 2025. However, the company has yet to announce any delays.

An Nvidia spokesperson declined to comment on whether the rack designs were complete when approached by The Information. “GB200 systems are the most advanced computers ever created,” the spokesperson said. “Integrating them into a diverse range of data center environments requires co-engineering with our customers.” They added that “engineering iterations are normal and expected.”

When will NVL72s ship?

Nvidia had hoped to start shipping GB200s this year, but has already had to delay this by three months after a production error was discovered. Despite this, demand for the devices continues to soar, and a research note published last month by Morgan Stanley analyst Joseph Moore revealed the chips were sold out for 12 months, meaning new orders will not be fulfilled until late 2025 at the earliest.

Some GB200 NVL72s have already made it out into the wild, with Dell founder Michael Dell revealing his company was delivering an undisclosed number of the racks to AI cloud provider CoreWeave. “The AI rocket just got a massive boost!,” Dell wrote in a post on X.

Microsoft revealed last month that it had become the first cloud provider to put the GB200 GPU through its paces in its AI cloud servers.