The xAI Colossus supercomputer in Memphis, Tennessee, is set to double in capacity.

In separate statements, both Nvidia and xAI CEO Elon Musk announced that the facility is in the process of adding an additional 100,000 Nvidia Hopper GPUs to the cluster.

xAI supercomputer
Colossus supercomputing cluster server racks – Michael Dell

Commenting on an X/Twitter post from an xAI employee who said the team in Memphis is hiring, Musk wrote: “Soon to become a 200k H100/H200 training cluster in a single building.”

In addition to the expansion, the two companies also revealed that rather than using Nvidia InfiniBand for its networking interconnectivity, the cluster instead relies on the Nvidia Spectrum-X ethernet networking platform for its Remote Direct Memory Access (RDMA) network.

The announcements coincided with the release of a sponsored video from ServeTheHome, a server, storage, and networking outlet, during which a tour of the Colossus facility was conducted. According to the video’s presenter Patrick Kennedy, each GPU has its own 400GbE Nvidia BlueField-3 SuperNIC, which xAI has paired with the Spectrum-X SN5600, a 64-port 800 Gbps ethernet switch.

In a statement, Nvidia said the platform has been designed to deliver “superior performance to multi-tenant, hyperscale AI factories,” with the system maintaining 95 percent data throughput, as enabled by Spectrum-X, in addition to experiencing zero application latency degradation or packet loss due to flow collisions across all three tiers of the network fabric.

“AI is becoming mission-critical and requires increased performance, security, scalability, and cost-efficiency,” said Gilad Shainer, senior vice president of networking at Nvidia. “The Nvidia Spectrum-X ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis, and execution of AI workloads, and in turn accelerates the development, deployment, and time to market of AI solutions.”

Housed in a 750,000 sq ft (69,677 sqm) former Electrolux plant, the Colossus supercomputer was assembled in 122 days, with Musk announcing the cluster had gone live on July 22, 2024.

However, xAI has been criticized for running the facility off of gas power and for failing to consult with locals who argue they will be directly affected by the company’s operations in the area and its “harmful local consequences.” Musk also came under fire for shipping thousands of GPUs reserved for public company Tesla to private ventures X and xAI.