Meta has expanded its Grand Teton platform to support AMD’s latest Instinct MI300X GPUs and unveiled its new high-powered rack, known as Catalina.
The announcements were made at the Open Compute Project (OCP) Summit 2024, currently taking place in San Jose, California.
First launched by the company at OCP in 2022, Meta’s Grand Teton is a GPU-based hardware platform designed to support large AI workloads. The latest version of Grand Teton features a single monolithic system design with fully integrated power, control, compute, and fabric interfaces.
In a post on Meta’s engineering blog, Dan Rabinovitsj, VP at Meta, and Omar Baldonado, director of DC networking at Meta and former OCP networking project co-lead, wrote that this high level of integration simplifies system deployment, enabling rapid scaling with increased reliability for large-scale AI inference workloads.
They added: “In addition to supporting a range of accelerator designs, now including the AMD Instinct MI300X, Grand Teton offers significantly greater compute capacity, allowing faster convergence on a larger set of weights. This is complemented by expanded memory to store and run larger models locally, along with increased network bandwidth to scale up training cluster sizes efficiently.”
Elsewhere at OCP, Meta also unveiled Catalina, the company’s new high-powered rack designed for AI workloads. Based on Nvidia’s GB200 Grace Blackwell Superchip, Catalina contains what Meta has dubbed the Orv3, a high-power rack capable of supporting up to 140kW.
In the same blog post, Rabinovitsj and Baldonadohis noted that the liquid-cooled solution also consists of a power shelf that supports a compute tray, switch tray, the Orv3 HPR, the Wedge 400 fabric switch, a management switch, battery backup unit, and rack management controller.
Finally, Meta detailed its incoming Disaggregated Scheduled Fabric (DSF) for next-generation AI clusters, and 51T fabric switches based on Broadcom and Cisco ASICs.
The company said its DSF is powered by the open OCP-SAI standard and FBOSS, Meta’s own network operating system for controlling network switches. It also supports an open and standard Ethernet-based RoCE interface to endpoints and accelerators across several GPUS and NICS from vendors including Nvidia, Broadcom, and AMD.
In order to meet growing AI needs, Meta has also developed its own NIC with Marvell. Dubbed FBNIC, it contains the first Meta-design network ASIC.
“AI won’t realize its full potential without collaboration,” Rabinovitsj and Baldonadohis wrote. “We need open software frameworks to drive model innovation, ensure portability, and promote transparency in AI development.”