If you're lucky enough to be able to get your hands on thousands of GPUs and a data center with enough power and cooling to support them, your headache has only just begun.
Artificial intelligence and machine learning workloads require these GPUs to be connected by a dense and adaptable network, that also connects CPUs, memory, and storage.
The slightest bottleneck can ripple through a system, causing issues and slow performance across the entire training run. But with countless interconnected nodes, it's easy for traffic to pile up.
Chip giant Broadcom hopes that part of the solution to the problem can lie in AI and software itself. Its new Trident 5-X12 switching silicon will be the first to use the company's on-chip, neural-network inference engine NetGNT, aimed at improving networks in real-time.
NetGNT (Networking General-purpose Neural-network Traffic-analyzer) is "general purpose," the company's principal PLM in Broadcom’s Core Switching Group, Robin Grinley, told DCD. "It doesn't have one specific function; it's meant for many different use cases," Grinley says.
The small neural network sits in parallel to the regular packet processing pipeline, where the customer puts a number of static rules into the chip (drop policies, forwarding policies, IP ranges, etc.).
"In comparison, NetGNT has memory, which means it's looking for patterns," Grinley says. "These are patterns across space, different ports in the chip, or across time. So as flows come through the chip, and various events happen, it can look for higher-level patterns that are not really catchable by some static set of rules that you've programmed into these low-level tables."
A customer could train the neural network on previous DDoS attacks to help it identify a similar event in the future.
"Now, the action could be local on the chip, it could be just okay. When you see this, this one of these DDoS flows starting up, disrupt the flow, and drop the packet. In parallel, it can also do things like create a notification when you first identify this and send it up to the Network Operation Center."
AI and ML runs can sometimes suffer from an incast event, where the number of storage servers sending data to a client goes past the ability of an Ethernet switch to buffer packets, which can cause significant packet loss.
"It cn detect this - if there's an accumulation of the buffer due to an incast, it could read that signature and say, 'Okay, I can take very fast action to start back pressure, maybe some of these flows, or do something else,'" Grinley said. "In an AI/ML workload, it goes in phases, and you only have a few milliseconds between phases. You don't have time to involve software in the loop and try to make some decisions as to what to do."
With NetGNT running in parallel, "there's no software in the loop where the more complex the packet processing, the longer it'll take. Whether NetGNT is on or off, the latency for any packet through our chip is the same."
Given the unique requirements of different networks, it's important to note that NetGNT does not work out of the box. "The only thing that we provide here is the hardware model: How many neurons? How are they connected? What are the big weights, et cetera."
The rest, the customer has to train with the model. "Somebody has to go look at huge amounts of packet trace data - here's my network when it's operating fine; this is the thing I want to track, incast, denial of service, some other security event, whatever it is," Grinley said.
"Some network architect has to go through all of this massive packet trace data and tag that stuff. And then you shovel all of that training data into a supervised training algorithm and spit out the weights that go into our neural network engine."
This means that the accuracy of the system is somewhat dependent on the quality of the data, the length of the training run, and the skill of the person tagging and training their system.
"They probably going to have to hire some AI/ML experts who know how to run it, and then they'll go run it in the cloud or wherever," Grinley said. It would also be up to the customer how often they re-train the system.
"You can reload it while the chip is running," he added. "So they can reload it daily if they want, but the training time typically is probably on the order of a few days to a week."
Separate to NetGNT, Broadcom aims to help reduce bottlenecks with 'cognitive routing,' which was first rolled out with Tomahawk-5. "if you're using Trident-5 as a tor and Tomahawk-5 as the fabric, these features operate together," Grinley said.
In older products, dynamic load balancing was confined to just the chip. Congestion would be spotted, and flow autonomously moved to a less loaded link. "Now, that works fine on your chip," Grinley said. "But if you go three or four hops down in the network, that new path that you're choosing may be congested somewhere else."
Now, the platform attempts to handle global load balancing, he said. "These chips, when they sense congestion, can send notifications back downstream as well upstream. And let all of the chips operate with a global view of the congestion and they can deal with it."
This is running on embedded Arm cores on the chip, because "it's not something that you can wait for the host CPU."
As the system develops, and the compute on the chip improves, Grinley sees their various efforts converging. "NetGNT could go into cognitive routing version two, or some new load balancing scheme, or some neat telemetry scheme.
"Once you have this little inference engine sitting there, then you can hook it in for performance, for security, for network optimization. I think we're gonna figure out a lot more use cases as time goes on."