The role of state supercomputers
Most AI training activity is now focused around the huge resources available to the tech giants, who build virtual supercomputers in their clouds. But in earlier days, research was largely carried out on supercomputers in government research labs.
During the 2010s, the world’s advanced nations raced to build facilities with enough power to perform AI research, along with other tasks like molecular modeling and weather forecasting. Now those machines have been left behind, but their resources are being used by smaller players in the AI field.
When the US government launched Summit in 2018, at the Oak Ridge National Laboratory, the 13-megawatt machine was the world's most powerful supercomputer. Now, by traditional Linpack benchmarks (FP64), it is the fifth fastest supercomputer in the world at 200 petaflops, using older models of Nvidia’s GPUs.
For the frontiers of AI, it is too old and too slow, but the open source EleutherAI group is happy to pick up the scraps. "We get pretty much all of Summit," said EleutherAI’s Quentin Anthony.
"A lot of what you're bottlenecked by is that those old [Tesla] GPUs just don't have the memory to fit the model. So then the model is split across a ton of GPUs, and you're just killed by communication costs," he said.
"If you don't have the best and latest hardware you just can't compete - even if you're given the entire Summit supercomputer."
A similar story is found in Japan, where Fugaku was the world’s fastest machine when it launched 2020.
“We have a team trying to do GPT-like training on Fugaku, we’re trying to come up with the frameworks to build foundation models on it and scale to a fairly large number of nodes,” said Professor Satoshi Matsuoka, director of Japan’s RIKEN Center for Computational Science.
“By global standards for systems, Fugaku is still a very fast AI machine,” he said. “But when you compare it to what OpenAI has put together, it's less performant. It's much faster in HPC terms, but with AI codes it's not as fast as 25,000 A100s [Nvidia GPUs].”
Morgan Stanley estimates that OpenAI’s next GPT system is being trained on 25,000 Nvidia GPUs, worth around $225m.
Fugaku was built with 158,976 Fujitsu A64FX Arm processors, designed for massively parallel computing, but does not have any GPUs.
“Of course, Fugaku Next, our next-generation supercomputer, will have heavy optimization towards running these foundation models,” Matsuoka said.
The current supercomputer, and the research team using it, have helped push the Arm ecosystem forward, and helped solve issues of operating massively parallel architectures at scale.
“It's our role as a national lab to pursue the latest and greatest advanced computing, including AI, but also other aspects of HPC well beyond the normal trajectory that the vendors can think of,” Matsuoka said.
“We need to go beyond the vendor roadmap, or to encourage the vendors to accelerate the roadmap with some of our ideas and findings - that's our role. We're doing that with chip vendors for our next-generation machine. We're doing that with system vendors and with the cloud providers. We collectively advance computing for the greater good.”
Morality and massive machines
Just as open source developers are offering much-needed transparency and insight into the development of this next stage of artificial intelligence, state supercomputers provide a way for the rest of the world to keep up with the corporate giants.
"The dangers of these models should not be inflated, we should be very, very candid and very objective about what is possible,” Matsuoka said. “But, nonetheless, it poses similar dangers if it falls into the wrong hands as something like atomic energy or nuclear technologies.”
State supercomputers have for a long time controlled who accesses them. “We vet the users, we monitor what goes on,” he said. “We've made sure that people don't do Bitcoin mining on these machines, for example.”
Proposals for compute use are submitted, and the results are checked by experts. “A lot of these results are made public, or if a company uses it, the results are supposed to be for the public good,” he continued.
Nuclear power stations and weapons are highly controlled and protected by layers of security. “We will learn the risks and dangers of AI,” he said. “The use of these technologies could revolutionize society, but foundation models that may have illicit intent must be prevented. Otherwise, it could fall into the wrong hands, it could wreak havoc on society. While it may or may not wipe out the human race, it could still cause a lot of damage.”
That requires state-backed supercomputers, he argued. “These public resources allow for some control, to the extent that with transparency and openness, we can have some trustworthy guarantees. It's a much safer way than just leaving it to some private cloud.”
Building the world’s largest supercomputers
"We are now at a realm where if we are to get very effective foundation models, we need to start training at basically multi-exascale level performance in low precision," Matsuoka explained.
While traditional machine learning and simulation models use 32-bit “single-precision” floating point numbers (and sometimes 64-bit “double-precision” floating point numbers), generative AI can use lower precision.
Shifting to the half-precision floating-point format FP16, and potentially even FP8, means that you can fit more numbers in memory and in the cache, as well as transmit more numbers per second. This move massively improved the computing performance of these models, and has changed the design of the systems used to train them.
Fugaku is capable of 442 petaflops on the FP64-based Linpack benchmark, and achieved two exaflops (that is 1018) using the mixed FP16/FP64 precision HPL-AI benchmark.
OpenAI is secretive about its training resources, but Matsuoka believes that "GPT-4 was trained on a resource that's equivalent to one of the top supercomputers that the state may be putting up," estimating that it could be a 10 exaflops (FP16) machine "with AI optimizations."
“Can we build a 100 exaflops machine to support generative AI?” Matsuoka asked. “Of course we can. Can we build a zettascale machine on FP8 or FP16? Not now, but sometime in the near future. Can we scale the training to that level? Actually, that’s very likely.”
This will mean facing new challenges of scale. “Propping up a 20,000 or a 100,000 node machine is much more difficult,” he said. Going from a 1,000-node machine to 10,000 does not simply require scaling by a factor of 10. “It's really hard to operate these machines,” he said, “it’s anything but a piece of cake.”
It again comes down to the question of when and where models will start to plateau. “Can we go five orders of magnitude better? Maybe. Can we go two orders of magnitude? Probably. We still don't know how far we can go. And that's something that we'll be working on.”
Some people even warn that HPC will be left behind by cloud investments, because what the governments can invest is outclassed by what hyperscalers can spend on their research budgets.
Weak scaling and the future of HPC
To understand what the future might hold for HPC, we must first understand how the large parallel computing systems of today came to be.
Computing tasks including AI can be made to run faster by breaking them up and running parts of them in parallel on different machines, or different parts of the same machine.
In 1967, computer scientist and mainframe pioneer Gene Amdahl noted that parallelization had limits: no matter how many cores you run it on, a program can only run as fast as the portions which cannot be broken down and parallelized.
But in 1988, Sandia Labs' John Gustafson essentially flipped the issue on its head and changed the focus from the speed to the size of the problem.
"So the runtime will not decrease as you add more parallel cores, but the problem size increases," Matsuoka said. "So you're solving a more complicated problem."
That's known as weak scaling, and it's been used by the HPC community for research workloads ever since.
"Technologies advanced, algorithms advanced, hardware advanced, to the extent that we now have machines with this immense power and can utilize this massive scaling,” Matsuoka said. “But we are still making progress with this weak scale, even things like GPUs, it's a weak scaling machine."
That is “the current status quo right now,” he said.
This could change as we near the end of Moore’s Law, the observation that the power of a CPU (based on the number of transistors that can be put into it) will double every two years. Moore’s Law has operated to deliver a continuously increasing number of processor cores per dollar spent on a supercomputer, but as semiconductor fabrication approaches fundamental physical limits, that will no longer be the case.
“We will no longer be able to achieve the desired speed up just with weak scaling, so it may start diverging,” Matsuoka warned.
Already we’re beginning to see signs of different approaches. With deep learning models like generative AI able to rely on lower precision like FP16 and FP8, chip designers have added matrix multiply units to their latest hardware to make them significantly better at such lower orders of precision.
“It’s still weak scaling, but most HPC apps can't make use of them, because the precision is too low,” Matsuoka said. “So machine designers are coming up with all these ideas to keep the performance scaling, but in some cases, there are divergences happening which may not lead to a uniform design where most of the resources can be leveraged by all camps. This would lead to an immense diversity of compute types.”
This could change the supercomputer landscape. “Some people claim it's going to be very diverse, which is a bad thing, because then we have to build these specific machines for a specific purpose,” he said. “We believe that there should be more uniformity, and it’s something that we are actively working on.”
The cloudification of HPC
Riken, Matsuoka’s research institute, is looking at how to keep up with the cadence of hyperscalers, which are spending billions of dollars every quarter on the latest technologies.
“It's not easy for the cloud guys either - once you start these scaling wars, you have to buy into this game,” Matsuoka said.
State-backed HPC programs take around 5-10 years between each major system, working from the ground up on a step-change machine. During this time cloud-based systems can cycle through multiple generations of hardware.
“The only way we foresee to solve this problem is to be agile ourselves by combining multiple strategies,” said Matsuoka. He wants to keep releasing huge systems, based on fundamental R&D, once or twice a decade - but to augment them with more regular updates of commercial systems.
He hopes that a parallel program could deliver new machines faster, but at a lower cost. “It will not be a billion dollars [like Fugaku], but it could be a few $100 million. These foundation models and their implications are hitting us at a very rapid pace, and we have to act in a very reactive way.”
Riken is also experimenting with the 'Fugaku Cloud Platform,' to make its supercomputer available more widely in partnership with Fujitsu.