Just as silicon is being pushed to its very limits to handle huge AI models, networking and the architecture of data centers are facing challenges.
“With these large systems, no matter what, you can't fit it on a single chip, even if you're Cerebras,” SemiAnalysis’ Dylan Patel said. “Well, how do I connect all these split-up chips together? If it's 100 that’s manageable, but if it's thousands or tens of thousands, then you're starting to have real difficulties, and Nvidia is deploying just that. Arguably it's either them or Broadcom that have the best networking in the world.”
But the cloud companies are also becoming more involved. They have the resources to build their own networking gear and topologies to support growing compute clusters.
Amazon Web Services has deployed clusters of up to 20,000 GPUs, with AWS’ own purpose-built Nitro networking cards. “And we will deploy multiple clusters,” the company’s Chetan Kapoor said. “That is one of the things that I believe differentiates AWS in this particular space. We leverage our Nitro technology to have our own network adapters, which we call Elastic Fabric Adapters.”
The company is in the process of rolling out its second generation of EFA. “And we're also in the process of increasing the bandwidth on a per node basis, around 8× between A100s and H100s,” he said. “We're gonna go up to 3,200Gbps, on a per node basis.”
At Google, an ambitious multi-year effort to overhaul the networks of its massive data center fleet is beginning to pay off.
The company has begun to deploy Mission Apollo custom optical switching technology at a scale never seen before in a data center.
Traditional data center networks use a spine and leaf configuration, where computers are connected to top-of-rack switches (leaves), that are then connected to the spine, which consists of electronic packet switches. Project Apollo replaces the spine with entirely optical interconnects that redirect beams of light with mirrors.
"The bandwidth needs of training, and at some scale inference, is just enormous,” said Google’s Amin Vahdat.
Apollo has allowed the company to build networking “topologies that are more closely matched to the communication patterns of these training algorithms,” he said. “We have set up specialized, dedicated networks to distribute parameters among the chips, where enormous amounts of bandwidth are happening synchronously and in real-time.”
This has multiple benefits, he said. At this scale, single chips or racks fail regularly, and “an optical circuit switch is pretty convenient at reconfiguring in response, because now my communication patterns are matching the logical topology of my mesh,” he said.
“I can tell my optical circuit switch, ‘go take some other chips from somewhere else, reconfigure the optical circuit switch to plug those chips into the missing hole, and then keep going.’ There's no need to restart the whole computation or - worst case - start from scratch.”
Apollo also helps deploy capacity flexibly. The company’s TPUv4 scales up to blocks of 4,096 chips. “If I schedule 256 here, 64 there, 128 here, another 512 there, all of a sudden, I'm going to create some holes, where I have a bunch of 64 blocks of chips available.”
In a traditional network architecture, if a customer wanted 512 of those chips they’d be unable to use them. “If I didn't have an optical circuit switch, I'd be sunk, I'd have to wait for some jobs to finish,” Vahdat said. “They're already taking up portions of my mesh, and I don't have a contiguous 512 even though I might have 1,024 chips available.”
But with the optical circuit switch, the company can “connect the right pieces together to create a beautiful 512-node mesh that's logically contiguous. So separating logical from physical topology is super powerful."
Colos and wholesalers
If generative AI becomes a major workload, then every data center in the world could find that it has to rebuild its network, said Ivo Ivanov, CEO of Internet exchange DE-CIX. “There are three critical sets of services we see: 1) Cloud exchange, so direct connectivity to single clouds, 2) Direct interconnection between different clouds used by the enterprise, and 3) Peering for direct interconnect to other networks of end users and customers.”
He argued: “If these services are fundamental for creating the environment that generative AI needs in terms of infrastructure, then every single data center operator today needs to have a solution for an interconnection platform.”
That future-proof network service has to be seamless, he said: “If data center operators do not offer this to their customers today, and in the future, they will just reduce themselves to operators of closets for servers.”