Training Google's Gemini: TPUs, multiple data centers, and risks of cosmic rays

The launch of Gemini last week marked Google's biggest effort to reassert its dominance as an artificial intelligence company after OpenAI's GPT models spent the year as the world's most popular generative AI platform.

To create the model family, the company embarked on a huge infrastructure build-out, setting out to prove that models can be trained without relying on Nvidia's GPUs.

Google declined to provide specific details about how the three Gemini models were built, but here's what DCD learned from interviews, research papers, and public comments.

Gemini Ultra, the largest of its models, was trained across multiple data centers, a step up over previous single-facility models like PaLM-2.

"We don't disclose exactly the details of how many locations but it was trained across multiple sites, and multiple clusters within those sites," Google Cloud CEO Thomas Kurian told DCD.

"We use a technology called multi-host to enable us to distribute the training. The reason we typically distribute the training is to ensure that if one side, for example, has power issues or other things, we have resilience. It also allows us to deploy a larger cluster of machines, because of space and power considerations."

Multi-host is set to be rolled out on Google Cloud to customers, which should soon give us a better understanding of the latency limitations (and therefore distance restrictions) of facilities operating together in one run.

To pull it off, Google connects TPU SuperPods (themselves 4,096 chips) using Google’s intra-cluster (distance between two objects belonging to the same cluster) and inter-cluster (distance between two objects belonging to two different clusters) network, a Google Deepmind paper states. The company exploits model parallelism within a SuperPod and data parallelism across multiple SuperPods.

At the heart of that is Google's networking platform, Jupiter.

It relies on an in-house optical switching network (now known as OCS but previously called Mission Apollo) that replaces the data center spine.

In traditional network topologies, signals jump back and forth between electrical and optical. “It's all been hop by hop, you convert back to electronics, you push it back out to optics, and so on, leaving most of the work in the electronic domain,” Amin Vahdat, Google’s systems and services infrastructure team lead, told us earlier this year.

“This is expensive, both in terms of cost and energy.”

With its custom-built OCS, the company “leaves data in the optical domain as long as possible,” using tiny mirrors to redirect beams of light from a source point and send them directly to the destination port as an optical cross-connect.

This dramatically reduces latency, as well as costs - making multi-host possible.

For Ultra, the OCS was used to dynamically reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologies in around 10 seconds. "We decided to retain a small number of cubes per SuperPod to allow for hot standbys and rolling maintenance," the paper says.

Each version of Gemini (Nano, Pro, and Ultra) was trained on Google's TPUs, with the company using a mixture of TPUv5e and TPUv4.

The Pro model was trained in a matter of weeks, "leveraging a fraction of the Ultra’s resources," the Google Deepmind report states.

"Training Gemini Ultra used a large fleet of TPUv4 accelerators," it adds.

While TPUv5e has a 2.3× improvement in price-performance compared to TPUv4 for training large language models (according to Google), it was only officially announced this November. It would have been available internally at Google earlier, but it was likely not in quantities that were large enough for Ultra.

However, the fact that the ChatGPT-competitive models were trained on Google's hardware marks an important step in breaking Nvidia's stranglehold over AI development, the company has argued.

"I've been seeing multiple misinformed posts and articles recently claiming 'no ML is happening without Nvidia's GPU' or 'ML is only done with Nvidia's CUDA,'" Google's head of growth for Cloud TPU Max Sapozhnikov said on Twitter/X.

"It's great to see this myth being demystified - there is A LOT of great ML happening on TPUs with zero dependence on GPUs or CUDA."

He added: "Anthropic, Midjourney, Salesforce, and many others are already building their stack on TPUs, leveraging the benefits of hardware cost and power efficiency as well as XLA compiler optimizations."

Anthropic also uses Amazon's Trainium hardware, after AWS invested in the company alongside Google.

Google has yet to share the thermal design point (TDP) of its latest TPUs, but a 2021 research paper reveals that the TPUv3 had a TDP of 450W (660W if you include power for the DSA memory system plus its share of the server host power), up from 280W/460W on the previous chip.

The company has liquid-cooled its TPUs since 2018. "We have deployed large-scale systems that have very dense footprints and have advanced features like liquid cooling, which allows you to get significantly more throughput from the system," Kurian said.

While liquid cooling is necessary for the increase in temperatures in high-density data centers, they have an unintended side effect: Cosmic rays.

Earlier this year, researchers at NTT and Hokkaido University warned in a paper that "if semiconductors are cooled by water, the thermal neutron count is expected to increase significantly."

All hardware at scale can be susceptible to radiation issues, and as process nodes get smaller, that risk increases. This can corrupt data or, more concerningly, lead to silent data corruption. The chip can be "still processing bits, but the data it gave you was wrong - and it didn't know it was wrong," semiconductor radiation researcher Andrew Keller told DCD last year.

Keller set out to find out how often a data center in Denver, Colorado, with 100,000 operational FPGAs would be impacted by radiation. He found that such a deployment would experience a configuration memory issue every half-hour on average and SDC every 0.5-11 days.

It's not clear how susceptible Google's TPUs are to radiation, but the company notably included a mention of them in the initial release of its research paper. "Genuine machine failures are commonplace across all hardware accelerators at such large scales, due to external factors such as cosmic rays." Curiously, the paper was quietly updated to remove that mention - we have asked Google why.

With Gemini Ultra, Google said that it expected "SDC events to impact training every week or two. Rapidly detecting and removing faulty hardware required several new techniques that exploit deterministic replay to isolate incorrect computations, combined with proactive SDC scanners on idle machines and hot standbys."

The company added that its "fully deterministic infrastructure" allowed it to quickly identify root causes (including hardware failures) during the development leading up to the Ultra model, and this was a crucial ingredient towards stable training."

Finally, the paper discloses that the company eschewed the "conventional approach of periodic checkpointing of weights to persistent cluster storage." Instead, the company "made use of redundant in-memory copies of the model state, and on any unplanned hardware failures, we rapidly recover directly from an intact model replica."

This differed from PaLM and PaLM-2 training runs, leading to "a substantial speed up in recovery time, despite the significantly larger training resources being used."

The goodput, essentially the number of useful information bits delivered by the network to a certain destination per unit of time, "for the largest-scale training job increased from 85-97 percent" the company said.

Training Google's Gemini: TPUs, multiple data centers, and risks of cosmic rays

More in Cloud & Hyperscale

Energy efficiency is driving many HPC users to the cloud

Microsoft files for another data center in San Antonio, Texas

Episode Carbon removal - Overcoming challenges to advance a sustainable future

More in IT Hardware & Semiconductors

AMD data center segment sets internal revenue records, as GPU sales exceed expectations

DCD Podcast - Why Oxide rebuilt the rack from scratch

Episode Modern DCIM as a force for more resilient, secure, and sustainable IT

Tags

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies