Generative AI looks set to change how we work, create, and live. Governments, businesses, and individuals are all grappling with what it means for the economy and our species, but struggle as we simply don’t know what AI will be capable of, or the costs and benefits of applying it.
Behind this transformation lies a deeper story, of vast changes in compute architectures, networking topologies, and data center design. Deploying the massive computing resources these systems require could change the cloud industry, and put the traditional supercomputing sector at risk.
To understand what this moment means, and what could be coming next, DCD spent four months talking to nearly two dozen AI researchers, semiconductor specialists, networking experts, cloud operators, supercomputing visionaries, and data center leaders.
This story begins with the models, the algorithms that fundamentally determine how an AI system works. We look at how they are made, and how they could grow. In operation, we look at the twin requirements of training and inferencing, and the so-called ‘foundation models’ which can be accessed by enterprises and users. We also ask what the future holds for open-source AI development.
From there, we move to the world of supercomputers, understanding their use today and why generative AI could upend the traditional high-performance computing (HPC) sector. Next, we talk to the three hyperscalers that have built gigantic AI supercomputers in the cloud.
Then we turn to chips, where Nvidia has a lead in the GPU processors that power AI machines. We talk to seven companies trying to disrupt Nvidia - and then we then hear from Nvidia's head of data centers and AI to learn why unseating the leader will be so hard.
But the story of compute is meaningless without understanding networking, so we talk to Google about a bold attempt to overhaul how racks are connected.
Finally, we learn about what this all means for the data center. From the CEO of Digital Realty, to the CEO of DE-CIX, we hear from those set to build the infrastructure of tomorrow.
Making a model
Our journey through this industry starts with the model. In 2017, Google published the 'Attention is All You Need' paper that introduced the transformer model, which allowed for significantly more parallelization and reduced the time to train AIs.
This set off a boom in development, with generative AI models all built from transformers. These systems, like OpenAI’s large language model (LLM) GPT-4, are known as foundation models, where one company develops a pre-trained model, for others to use.
“The model is a combination of lots of data and lots of compute,” Rishi Bommasani, co-founder of Stanford’s Center for Research on Foundation Models, and lead author of a seminal paper defining those models, told DCD. “Once you have a foundation model, you can adapt it for a wide variety of different downstream applications,” he explained.
Every such foundation model is different, and the costs to train them can vary greatly. But two things are clear: The companies building the most advanced models are not transparent about how they train them, and no one knows how big these models will scale.
Scaling laws are an area of ongoing research, which tries to work out the optimal balance between the size of the model, the amount of data, and the computational resources available.
Raising a Chinchilla
"The scaling relations with model size and compute are especially mysterious," a 2020 paper by OpenAI's Jared Kaplan noted, describing the power-law relationship between model size, dataset size, and the compute power used for training.
As each factor increases, so does the overall performance of the large language model.
This theory led to larger and larger models, with increasing parameter counts (the values that a model can change as it learns) and more tokens (the units of text that the model processes, essentially the data). Optimizing these parameters involves multiplying sets of numbers, or matrices, which takes a lot of computation, and means larger compute clusters.
That paper was superseded in 2022 by a new approach from Google subsidiary DeepMind, known as 'Chinchilla scaling laws,' which again tried to find the optimal parameter and token size for training an LLM under a given compute budget. It found that the models of the day were massively oversized on parameters in relation to tokens.
While the Kaplan paper said that a 5.5× increase in the size of the model should be paired with a 1.8× increase in the number of tokens, Chinchilla found that parameter and token sizes should be scaled in equal proportions.
The Google subsidiary trained the 67 billion-parameter Chinchilla model based on this compute-optimal approach, using the same amount of compute budget as a previous model, the 280bn parameter Gopher, but with four times as much data. Tests found that it was able to outperform Gopher as well as other comparable models, and used four times less compute for fine-tuning and inference.
Crucially, under the new paradigm, DeepMind found that Gopher, which already had a massive compute budget, would have benefited from more compute used on 17.2× as much data.
An optimal one trillion parameter model, meanwhile, should use some 221.3 times as much compute budget for the larger data, pushing the limits of what's possible today. That is not to say one cannot train a one trillion parameter model (indeed Google itself has), it's just that the same compute could have been used to train a smaller model with better results.
Based on Chinchilla’s findings, semiconductor research firm SemiAnalysis calculated the rough computing costs of training a trillion parameter model on Nvidia A100s would be $308 million over three months, not including preprocessing, failure restoration, and other costs.
Taking things further, Chinchilla found that an optimal 10 trillion parameter model would use some 22,515.9 times as much data and resulting compute as the optimal Gopher model. Training such a system would cost $28.9bn over two years, SemiAnalysis believes, although the costs will have improved with the release of Nvidia’s more advanced H100 GPUs.
It is understood that OpenAI, Anthropic, and others in this space have changed how they optimize compute since the paper’s publication to be closer to that approach, although Chinchilla is not without its critics.
As these companies look to build the next generation of models, and hope to show drastic improvements in a competitive field, they will be forced to throw increasingly large data center clusters at the challenge. Industry estimates put the training costs of GPT-4 at as much as 100 times that of GPT-3.5.
OpenAI did not respond to requests for comment. Anthropic declined to comment, but suggested that we talk to Epoch AI Research, which studies the advancement of such models, about the future of compute scaling.
“The most expensive model where we can reasonably compute the cost of training is Google’s [540bn parameter] Minerva,” Jaime Sevilla, the director of Epoch, said. “That took about $3 million to train on their internal data centers, we estimate. But you need to train it a number of times to find a promising model, so it’s more like $10m.”
In use, that model may also need to be retrained frequently, to take advantage of the data gathered from that usage, or to maintain an understanding of recent events.
“We can reason about how quickly compute needs have been increasing so far and try to extrapolate this to think about how expensive it will be 10 years from now,” Sevilla said. “And it seems that the rough trend of cost increases goes up by a factor of 10 every two years. For top models, that seems to be slowing down, so it goes up by a factor of 10 every five years.”
Trying to forecast where that will lead is a fraught exercise. “It seems that in 10 years, if this current trend continues - which is a big if - it will cost somewhere between $3 billion or $3 trillion for all the training runs to develop a model,” Sevilla explained.
“It makes a huge difference which, as the former is something that companies like Microsoft could afford to do. And then they won't be able to push it even further, unless they generate the revenue in order to justify larger investments.”
Since we talked to Sevilla, Techcrunch reported that Anthropic now plans to develop a single model at a cost of $1bn.
What to infer from inference
Those models, large and small, will then have to actually be used. This is the process of inference - which requires significantly fewer compute resources than training on a per-usage basis, but will consume much more overall compute, as multiple instances of one trained AI will be deployed to do the same job in many places.
Microsoft’s Bing AI chatbot (based on GPT-4), only had to be trained a few times (and is retrained at an unknown cadence), but is used by millions on a daily basis.
"Chinchilla and Kaplan, they're really great papers, but are focused on how to optimize training,” Finbarr Timbers, a former DeepMind researcher, explained. “They don't take into account inference costs, but that's going to just totally dwarf the amount of money that they spent training these models.”
Timbers, who joined the generative AI image company Midjourney (which was used to illustrate this piece) after our interview, added: “As an engineer trying to optimize inference costs, making the model bigger is worse in every way except performance. It's this necessary evil that you do.
“If you look at the GPT-4 paper, you can make the model deeper to make it better. But the thing is, it makes it a lot slower, it takes a lot more memory, and it just makes it more painful to deal with in every way. But that's the only thing that you can do to improve the model.”
It will be hard to track how inference scales, because the sector is becoming less transparent, as the leading players are subsumed into the tech giants. OpenAI began as a not-for-profit company and is now a for-profit business tied to Microsoft, who invested billions in the company. Another leading player, DeepMind, was acquired by Google in 2014.
Publicly, there are no Chinchilla-esque scaling laws for inference that show optimal model designs or predict how it will develop.
Inference was not a priority of prior approaches, as the models were mostly developed as prototype tools for in-house research. Now, they are beginning to be used by millions, and it is becoming a paramount concern.
“As we factor in inference costs, you'll come up with new scaling laws which will tell you that you should allocate much less to model size because it blows up your inference costs,” Bommasani believes. “The hard part is you don't control inference fully, because you don't know how much demand you will get.”
Not all scaling will happen uniformly, either.
Large language models are, as their name suggests, rather large. “In text, we have models that are 500bn parameters or more,” Bommasani said. That doesn’t need to be the case for all types of generative AI, he explained.
“In vision, we just got a recent paper from Google with models with 20bn parameters. Things like Stable Diffusion are in the billion parameter range so it’s almost 100× smaller than LLMs. I'm sure we'll continue scaling things, but it's more a question of where will we scale, and how we will do it.”
This could lead to a diversification in how models are made. “At the moment, there’s a lot of homogeneity because it's early,” he said, with most companies and researchers simply following and copying the leader, but he’s hopeful that as we reach compute limits new approaches and tricks will be found.
“Right now, the strategies are fairly brutish, in the sense that it's just ‘use more compute’ and there's nothing deeply intellectually complicated about that,” he said. “You have a recipe that works, and more or less, you just run the same recipe with more compute, and then it does better in a fairly predictable way.”
As the economy catches up with the models, they may end up changing to focus on the needs of their use cases. Search engines are intended for heavy, frequent use, so inference costs will dominate, and become the primary factor for how a model is developed.
Keeping this sparse
As part of the effort to reduce inference costs, it’s also important to note sparsity - the effort of removing as many unneeded parameters as possible from a model without impacting its accuracy. Outside of LLMs, researchers have been able to remove as many as 95 percent of the weights in a neural network without significantly impacting accuracy.
However, sparsity research is again in its early days, and what works on one model doesn't always work on another. Equally important is pruning, where the memory footprint of a model can be reduced dramatically, again with a marginal impact on accuracy.
Then there's mixture of experts (MoE), where the model does not reuse the same parameters for all inputs as is typical in deep learning. Instead, MoE models select different parameters for each incoming example, picking the best parameters for the task at a constant computational cost by embedding small expert networks within the wider network.
"However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs, and training instability," Google researchers noted in a 2022 paper where they outlined a new approach that solved some of those issues. But the company has yet to deploy it within its main models, and the optimal size and number of experts to put within a model is still being studied.
Rumors swirl that GPT-4 uses MoEs, but nobody outside of the company really knows for sure. Some of the technically largest models of out China take advantage of them, but are not very performative.
SemiAnalysis' chief analyst Dylan Patel believes that 2023 "will be the year of the MoE," as current approaches strain the ability of today's compute infrastructure. However, it will have its own impact, he told DCD: "MoEs actually lead to more memory growth versus compute growth," as parameter counts have to increase for the additional experts.
But, he said, no matter which approach these companies take to improving the efficiency of training and inference, “they’d be a fool to say ‘hey, with all these efficiencies, we're done scaling.’”
Instead, “the big companies are going to continue to scale, scale, and scale. If you get a 10× improvement in efficiency, given the value of this, why not 20× your compute?”
Where does it end?
As scale begets more scale, it is hard to see a limit to the size of LLMs and multimodal models, which can handle multiple forms of data, like text, sound, and images.
At some point, we will run out of fresh data to give them, which may lead to us feeding them with their own output. We may also run out of compute. Or, we could hit fundamental walls in scaling laws that we have not yet conceived of.
For humanity, the question of where scaling ends could be critical to the future of our species.
"If the scaling laws scale indefinitely, there will be some point where these models become more capable than humans at basically every cognitive task,” Shivanshu Purohit, head of engineering at EleutherAI and research engineer at Stability AI, said.
“Then you have an entity that can think a trillion times faster than you, and it's smarter than you. If it can out plan you and if it doesn't have the same goals as you…”
That’s far from guaranteed. “People's expectations have inflated so much so fast that there could be a point where these models can't deliver on those expectations,” Purohit said.
Purohit is an “alignment” researcher, studying how to steer AI systems towards their designers' intended goals and interests, so he says a limit to scaling “would actually be a good outcome for me. But the cynic in me says that maybe they can keep on delivering, which is bad news.”
EleutherAI colleague Quentin Anthony is less immediately concerned. He says that growth generally has limits, making an analogy with human development: “If my toddler continues to grow at this rate, they're gonna be in the NBA in five years!”
He said: “We're definitely in that toddler stage with these models. I don't think we should start planning for the NBA. Sure we should think ‘it might happen at some point,’ but we'll see when it stops growing.”
Purohit disagrees. “I guess I am on the opposite end of that. There's this saying that the guy who sleeps with a machete is wrong every night but one.”