In the year that Elon Musk built a 100,000-GPU-strong supercomputing computing cluster in record time and has already announced plans to double its compute capacity, there’s probably no better time to attend a multi-day supercomputing event.
The theme for the 2024 edition of the annual US-based Supercomputing conference was ‘HPC Creates.’ During his Tuesday morning keynote, the SC24 chair, Philip C. Roth, said he chose this theme because it reflects “all the ways that I see the SC community being creative in what we do, in the work that we do, and the technology that we produce, and how we present it to conference attendees, in the ways that we collaborate about our work.”
Across the five-day conference, I’ve been lucky enough to chat with a range of people from across the world of supercomputing, be they vendors, founders, universities, national labs, research centers, students, or simply just HPC enthusiasts.
Here are some takeaways from my trip to Atlanta for SC24.
Liquid cooling is king
Before you even entered the Georgia World Congress Center this week, you’d have likely already passed a bus or taxi advertising a liquid cooling offering, a sentiment that followed you into the lobby and then down four levels and onto the exhibition hall floor.
“Come check out our liquid cooling solution!” the vendors cry, as 17,000 people walk past server cabinets showing off different offerings – although you have to feel bad for the one company whose liquid-cooled rack display sustained a giant crack across its glass front.
There’s a reason everyone is so keen to talk about liquid cooling; without it, the industry would not be able to sustain and grow AI workloads at their current pace. However, while vendors of all stripes are desperate to show off their offerings, there are a few things worth keeping in mind.
With chips now starting to hit 1,000 watts, liquid cooling is a non-negotiable. However, single-phase cooling is not a silver bullet for chips with a power consumption of over 1kW, meaning it’s unclear how long it can keep pace with Nvidia, and other chip makers’, high-speed product roadmap. Additionally, two-phase immersion cooling, which could be used to handle processors with higher TDPs, is not without its challenges, in terms of infrastructure, cost, and the environmental questions stemming from the expensive dielectric liquids used in two-phase cooling.
One thing’s for sure: liquid cooling is the king of the conference center and is a continued inevitability in a world of 1kW chips. But, as the reports regarding the overheating problems being experienced by Nvidia NVL72, could single-phase liquid cooling soon be reaching its upper limit, in much the same way air cooling has now become almost obsolete for the increasingly dense workloads the industry is continuing to chase?
Quantum is very much a part of the HPC conversation
According to IDC, the global quantum computing market is expected to grow from $1.1 billion in 2022 to $7.6bn in 2027. For what was the second year running, SC24 had its own dedicated Quantum Alley where there was more than one chandelier-like dilution refrigerator on display.
While hybrid quantum computing – the integration of classical and quantum computing – is not a new concept, at SC24, conversations about how to take this next step and make such integrations a reality seemed to be front of mind for many already involved in the industry.
This is because, in part, bringing together quantum computers with existing classical computing infrastructure is not without its challenges. As one quantum computing vendor explained, quantum will not work in isolation and will require significant integration with data centers that have the resources necessary to support the technology.
To combat this, he said his company was spending a lot of time with data center providers, finding out what they're thinking about quantum, and trying to help prepare them for its future use.
On the customer side, as is perhaps to be expected with a technology that is still somewhat theoretical, there’s still some uncertainty around the adoption of quantum technologies. With the pace of change being so fast with classical compute infrastructure right now, conversations indicate that customers are concerned about the same being true of quantum technology, and are apprehensive about deploying something only to see rapid technological advancements take place and leave them on the back foot once again.
There also appears to be continued uncertainty around the need for quantum technologies outside of computing centers, universities, and research institutions, with quantum vendors saying that enterprises aren’t clear that they have use cases that would be best addressed by quantum computing.
However, despite the challenges, also of interest were the conversations to be had with traditional compute vendors who, at the moment at least, are not planning to enter into the quantum market but have set up dedicated research teams to evaluate the technology and consider how it might be integrated with said company’s own compute infrastructure in the future.
What’s new in storage?
It goes without saying that advancements in compute have dominated headlines in the last several years and, while 100,000-GPU-strong clusters might be the thing that everyone is focused on, HPC would not be able to create anything without storage and networking.
As it was pointed out to me by one (admittedly disruptive storage) vendor: “When you look at all those AI block diagrams, you’re lucky if you even have storage mentioned at all.”
Consequently, it now seems like the storage landscape might be ripe for some innovation, particularly if AI continues at its current trajectory; both consuming and creating vast amounts of data, all of which needs to be stored somewhere. Processing your data with an aggressively large GPU cluster is one thing, but let’s not forget that it needs to come from somewhere and go somewhere, otherwise, either the training sets or the insights generated from the training could be lost.
Some estimates state that, at today’s pace, the amount of data being generated will see a 1,000X increase in the next 30 years. However, the challenge for today’s incumbent vendors will be: How do you scale up current offerings enough to keep up with demand?
Non-storage vendors also acknowledged there is definitely value to be added when it comes to the storage layer of the data center stack, with one executive at an HPC hardware provider joking that now would probably be a good time to buy shares in a successful storage company, as he assumed an acquisition would probably be a likely outcome for one in the near future. (N.B., DCD is not a financial advisor, and you should not consider this to be financial advice.)
What’s next?
Considering how fast the pace of change has been in recent years, lots of people I spoke to found it difficult to have much real certainty about what the next 12 months might hold.
Like death and taxes, some things are inevitable, namely Nvidia and AMD moving ahead with their yearly product release cycles.
Furthermore, as companies of all sizes publicly announce a move away from training to focus on the less computationally intensive inference, could we soon see tens of millions worth of H100s sat in a corner gathering dust?
Additionally, given supply chain issues, hardware costs, concerns about overheating, and changing priorities, it’s likely we could see some companies that are still yet to invest in their compute infrastructure being more intentional about what they purchase, rather than opting for the flashiest option.
On the cooling front, liquid cooling companies remain bullish that increasingly dense AI workloads will continue to see their business tick upwards, but, much like air cooling before it, might single-phase liquid cooling one day hit its own ceiling?
Two-phase immersion cooling businesses might feel like they can sleep a bit easier but challenges and costs associated with retrofitting the technology into legacy infrastructure are likely to keep data center providers up at night.
Whether 2025 becomes the year that storage has its moment still remains to be seen. But, whatever happens, it's likely that AI data centers will continue to grapple with a number of challenges, namely overheating, power consumption, and eye-watering sums being spent on building even bigger supercomputing clusters than those we’ve already seen.
Here’s to SC25!