As large language models (LLMs) and other generative AI systems remain the workload du jour, data centers have adapted to support deployments of tens of thousands of GPUs to run training and inference models on them.

Nvidia remains the leader in the training race, with its high-end GPUs dominating the market. But, as the generative AI market matures, the size of the models and how they are inferenced could be set to change.

"We're in that part of the hype cycle where being able to say 'the model has hundreds of billions of parameters that took months to train and required a city's worth of power to do it' is seen as actually a good thing right now," Ampere's chief product officer Jeff Wittich told DCD.

"But we're missing the point of it, which is the efficiency angle. If that's what it took to do it, was that the right way to go about modeling?"

Wittich is part of a number of industry figures that believe the future will not consist purely of these mega-models, but also countless smaller systems that are highly specialized: "If you have an AI that's helping people to write code, does it need to know the recipe for souffle?"

That version of tomorrow would prove lucrative for Ampere, which develops high-performance Arm-based CPUs. "Even today, you could run a lot of LLM models on something that's more efficient," he said.

"You could run them on CPUs, but people just aren't because they went and built gigantic training clusters with GPUs, and then use them to train and inference the models."

Part of the problem is the speed at which the market is currently moving, with generative AI still a nascent sector with everything to fight for. Nvidia GPUs - if you can get them - perform fantastically and have a deep software library to support rapid development.

"It's just 'throw the highest power stuff at it that we can, to be the fastest and the biggest," Wittich said. "But that'll be the thing that'll come back to haunt us. It's so power hungry, and it's so costly to do this that when that starts to matter this could be the thing that dooms this, at least in the short term."

Ampere_One.jpg
– Ampere

GPUs will still be at the heart of training, especially with the larger models, Wittich said, but questions whether they were truly the most optimal chip for inference. "People are going and building the same stuff for the inferencing phase when they don't need to because there is a more efficient solution for them to use," he said.

"We've been working with partners Wallaroo.AI on CPU-based inferencing, optimizing the models for it, and then scaling out - and they can get a couple of times more inferencing results throughput at the same latency without consuming any more power."

Taking OpenAI’s Whisper generative speech recognition model as an example, Ampere claims that its 128-core Altra CPU consumes 3.6 times less power per inference than Nvidia’s A10 (of course, the more expensive and power-hungry A100 has better stats than the A10).

High memory footprint inferencing will likely remain better on GPUs, but Wittich believes that the majority of models will be more suited to CPUs. The company’s AI team has developed the AI-O software library to help companies shift code from GPUs to CPUs.

CPU developers are also slowly borrowing from GPUs. Ampere - as well as Intel, AMD, and others - have integrated ever more AI compute functions into their hardware.

"When you look at the design of Ampere One, we did specific things at the micro-architectural level that improve inference performance," Wittich said, pointing to the company's 2021 acquisition of AI company OnSpecta. "AI is one of these things where stuff that was very specialized years ago eventually becomes general purpose."

There are always trade-offs in design, however: "If a block is included, it is stealing area, power, and validation resources."

He added: "If something is used 80-90 percent of time, that's what I want on every single one of our CPUs. If it's 20-30 percent of the time, I can create product variations that allow me to incorporate that when it's needed.

“You don't want a bunch of esoteric accelerators in the CPU that are always drawing power and always consuming area."

Of course, GPUs and CPUs are not the only game in town, with a number of chip providers developing dedicated inferencing chips that boast competitive inferencing and power consumption statistics.

Here, Wittich counters with the other issue of industry bubbles: That they often pop.

"A lot of the AI inferencing chips that are out there are really good at one type of network or one type of model," he said. "The more specialized you get, usually the better you get at it.

"But the problem is, you better have guessed correctly and be pretty confident that the thing that you're really, really good at is the thing that is important a couple of years from now."

If generative AI takes a dramatic turn away from the current model architectures, or the entire industry collapses (or, perhaps, coalesces around just OpenAI), then you could be left holding the bag.

When Bitcoin crashed in value, miners were left with thousands of highly specialized ASICs that were useless at any other task. Many of the chips were simply destroyed and sent to landfills.

Ethereum miners, on the other hand, mostly relied on GPUs. Several providers, like CoreWeave, have successfully pivoted their business to the current AI wave.

CPUs are inherently general purpose, meaning that a company doesn't have to bet the farm on a specific business model. "We know that overall compute demand is going to grow over the next couple of years, whether it's inferencing, database demand, media workloads, or something else," Wittich said.

"You're safe regardless of what happens after you get out of the boom phase."