Meta has shared the details of the hardware, network, storage, design, performance, and software that make up its two new 24,000-GPU data center scale clusters that the company is using to train its Llama 3 large language AI model.

The new training clusters are based on Meta’s AI Research SuperCluster (RSC), which was unveiled in 2022.

Cold storage, Meta data center
Meta data center – Meta

Developed to support AI research and development in areas such as natural language processing, speech recognition, and image generation, the newly announced clusters both contain 24,576 Nvidia Tensor Core H100 GPUs. This is a significant increase over the original clusters, which contained 16,000 Nvidia A100 GPUs.

Meta said this increase allows the clusters to support larger and more complex models than the RSC, paving the way for advancements in generative AI product development.

By the end of 2024, the company is aiming to grow its infrastructure build-out to include 350,000 Nvidia H100s as part of a portfolio that will feature compute power equivalent to almost 600,000 H100s.

While the number of GPUs is the same, the two clusters differ in network infrastructure. Both solutions interconnect 400 Gbps endpoints, but Meta has built one with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches. The other cluster features an Nvidia Quantum2 InfiniBand fabric.

Additionally, both clusters have been built using Meta’s in-house open GPU hardware platform, Grand Teton, the company's GPU-based hardware platform to support large AI workloads. The follow-up of the Zion-EX platform, it contains 4x the host-to-GPU bandwidth, 2x the compute and data network bandwidth, and 2x the power envelope when compared to its predecessor.

The clusters have also been developed using Meta’s Open Rack power and rack architecture, infrastructure that has been specifically designed to support solutions such as Grand Teton and provide greater flexibility in the data center environment.

The company’s Open Rack v3 hardware provides an architecture solution that installs power shelves anywhere in the rack, instead of bolting it to the busbar, which enables flexible rack configurations.

For these new clusters, Meta said the number of servers per rack has been customized to allow the correct balance of throughput capacity per server, rack count reduction, and associated power efficiency.

For storage, the clusters use a Linux Filesystem in Userspace API backed by a version of Meta’s ‘Tectonic’ distributed storage solution. The company has also partnered with Hammerspace to jointly develop a parallel network file system (NFS).

Both clusters are based on the YV3 Sierra Point server platform with the latest high-capacity E1.S SSD. Optimal network utilization was achieved via changes to network topology, networking routing, and deploying the Nvidia Collective Communications Library (NCCL) – a library of standard communication routines that have been optimized for Nvidia GPUs and networking.

Meta said it’s also continuing to evolve its PyTorch foundational AI framework to make it ready for hundreds of thousands of GPU training.

In a blog post co-written by Kevin Lee, technical program manager; Adi Gangidi, production network engineer; and Mathew Oldham, director, production engineering, the company said it maintains its commitment to open innovation in AI software and hardware and has launched the AI Alliance in an effort to build an open ecosystem that brings “transparency, scrutiny, and trust to AI development and leads to innovations that everyone can benefit from that are built with safety and responsibility top of mind.”

The blog post continued: “As we look to the future, we recognize that what worked yesterday or today may not be sufficient for tomorrow's needs. That's why we are constantly evaluating and improving every aspect of our infrastructure, from the physical and virtual layers to the software layer and beyond. Our goal is to create systems that are flexible and reliable to support the fast-evolving new models and research.”