Nvidia has fully installed the racks for its Cambridge-1 supercomputer at Kao Data's site in Harlow.
Ahead of the official launch in the coming weeks, we spoke to the project's lead about why the GPU company built what it claims is the UK's most powerful supercomputer.
The system will feature 80 Nvidia DGX A100 systems combined, for 400 petaflops of 'AI compute,' or eight petaflops of standard Linpack performance. That puts it 29th on the Top500 list of the world’s most powerful supercomputers and among the top three most-energy-efficient machines in the Green500, if it were scored today.
Building a supercomputer in a few months
Nvidia's foray into supercomputer operations dates back to 2015, when the company was designing its own server architecture to compete with the HPs and Dells of the world.
Its DGX platform, it says, operates as a reference architecture for how best to deploy its GPUs - but also serves as a healthy business venture in its own right.
The DGX platform came into fruition just as Nvidia began to hire more and more AI researchers. "We weren't a big user of AI yet," Marc Hamilton, Nvidia's VP of solutions architecture and engineering, told DCD.
"We want to attract more researchers. But if we're going to attract AI researchers, we have to give them the tools that they need."
They settled on building a huge supercomputer out of DGXs. "For AI researchers to pay attention, we want to be on the Top500 supercomputer list, but we don't want to be number 495."
The first system, built out of 124 DGX-1s in 2016 came in the top 50. "We have about a dozen different supercomputers across Nvidia of various sizes now," Hamilton said.
The supercomputers are used for Nvidia's chip designs and growing software platforms. But a need to dominate rankings has also driven them.
In competition with Nvidia, Google developed its own TPU accelerator series, following the chips up with a new benchmark to show what it could do - MLPerf. Nvidia quickly updated its most powerful supercomputer. "And we have been number one in all 16 benchmarks," Hamilton said.
That system, Selene, is the fifth-fastest supercomputer on November's Top500 list.
Cambridge-1 also appears to have multiple benefits. Not only will the system prove useful in its own right, and top UK rankings, but it is also likely part of the company's wider charm offensive in the UK.
The system was conceived in October 2020, a month after Nvidia announced its plans to acquire British chip designer Arm for $40bn. That acquisition faces intense regulatory reviews and is currently being challenged by Google, Microsoft, Qualcomm, and others.
The new supercomputer - which Prime Minister Boris Johnson and Health Secretary Matt Hancock may be set to open - is an affirmation of Nvidia's potential to invest heavily in the UK.
"We've got a lot of supercomputers in California and most of our researchers are in California," Hamilton said. "[We decided] why don't we build our supercomputer somewhere where we have a concentration of Nvidia employees - current or future - and customers?"
With this in mind, the company began looking at data centers around the Cambridge area, home to Arm's headquarters. Nvidia found nine potential data centers it thought could support its needs within the area, including providing renewable power.
After an RFI and review process, Nvidia settled on Kao Data. "They were the only one that was a UK company, a lot of these colocation providers are big international companies."
Once Kao was chosen, Nvidia used in-house computational fluid dynamics simulations (as well as Kao's CFD) and digital twins to decide how to build its supercomputer.
"What we decided was sort of Lego blocks - four blocks of 20 DGX A100 systems."
Currently, the company has only filled racks halfway. "Kao had the room, and it's because colocation pricing is kind of funny - the number one factor is the power, space almost doesn't matter," Hamilton said. But, he said, Nvidia may expand the nearly-1MW deployment to take up the other half of the rack.
Based in California, Hamilton had to manage the project remotely, with travel still limited by Covid-19. "We were also right in the middle of building a Selene when the pandemic hit," Hamilton said. "And when you're setting it up, and even running it, you want to know 'did that cable somehow get unplugged? Or is there a red light there?'"
So the company settled on basic telepresence robots - think an iPad on wheels - to patrol the facility checking out faults. That was fine for the cold aisle Selene layout, but for the hot aisle Cambridge-1 it meant that all the important parts of the racks were behind closed doors. The solution was to simply install motion-activated doors, like those found in shops.
"It's designed for our little robot, but I'm sure as a technician is walking in with arms full of parts, they'll appreciate it as well."
Beyond switching to cold aisle (as Kao airflow comes from the sides, and Selene from below), the design was incredibly similar to Nvidia's other deployments, with the company emphasizing the benefits of pre-built Lego-like systems.
Where it differs is how it is operated. Every other Nvidia system is exclusively used by the company. Cambridge-1 will be available to a select group of partners. In another perhaps politically savvy move given the ongoing crisis, Cambridge-1 will be made available to healthcare partners GSK, AstraZeneca, King’s College London, Oxford Nanopore, and Guy’s and St Thomas’ National Health Service Foundation Trust, among others.
Nvidia will open up the supercomputer to the companies and organizations, helping them focus on "grand projects," Hamilton said. All are currently DGX customers - a requirement to take part - but none operate systems at the scale of Cambridge-1. The hope is that it will serve as a showpiece, allowing the companies to test out large ideas without having to invest upfront. Should the results prove promising, perhaps it will encourage them to build DGX supercomputers of their own.
But Hamilton was insistent that this does not mark Nvidia's entry into the supercomputing-as-a-service space.
"All the major public clouds around the world have GPUs - Microsoft lets you build a supercomputer exactly the way we did," he said. "It would be foolish to try to compete with their scale and their pricing. This is very, very specific.
"No, we're getting more active in the supercomputer-as-a-Lego-block space."