Operating at scale can have interesting side effects. Events that are highly unlikely to occur at the local scale suddenly become near certainties, when the low probability event is multiplied by the number of hardware components in use.
Take background radiation. The small amount that bathes us all is mostly harmless, to humans and modern electronics. On a chip like a field-programmable gate array, the chances of minuscule levels of irradiation causing damage are equally tiny.
At the scale of something like a hyperscale data center, that calculation begins to change.
This feature appeared in Issue 44 of the DCD Magazine. Subscribe for free today
What if, for example, there was a data center in Denver, Colorado, with 100,000 operational FPGAs? Would the large sample size mean that low possibility of radiation-induced errors suddenly became a lot more likely?
Researchers at Brigham Young University hoped to find out, testing the susceptibility of FPGAs to normal background radiation.
All hardware at scale can be susceptible to radiation issues, but the researchers focused on FPGAs for two reasons. First, the CPUs and GPUs designed by the likes of AMD and Nvidia are made with error correction code (ECC) and data backups in mind, while FPGAs are reprogrammed to a desired application, so may not include as much protection.
Secondly, the study's lead researcher, Andrew Keller, explained: "While there's a lot of state in GPUs and CPUs as well, none of that state necessarily is used to configure how things are connected.
"There's a large portion of configuration memory in the FPGA that is dedicated to configuring routes, electrical connections between components, and those connections can be configured in many, many different ways."
Radiation could corrupt a configuration bit "and electrically disconnect components," in an FPGA, Keller said. "Whereas on an application-specific integrated circuit (ASIC), those are hard-wired connections, they're not configurable, they're not going to change with radiation."
When ionizing radiation passes through an FPGA, it can deposit enough energy to disrupt the proper flow of electricity through the device. This 'funneling phenomenon' can have different effects, including altering the value stored in a memory cell.
In an FPGA, that data could include circuit configuration and state, which means that a change in the value could disconnect components, or short them together. It could also change the intended behavior of a component.
Across 100,000 FPGAs, this slightly higher susceptibility becomes noticeable. In the paper, The Impact of Terrestrial Radiation on FPGAs in Data Centers, Keller found that such a deployment would experience a configuration memory issue every half-hour on average and silent data corruption (SDC) every 0.5-11 days.
It is that latter issue that is more concerning. "One of the challenges that an FPGA might face is a radiation effect causes your design to wedge or stall without the system knowing that it is wedged or stalled,” Keller said.
“With SDC, the FPGA was still processing bits. But the data it gave you was wrong. And it didn't know it was wrong.”
The study’s advisor, Prof. Dr. Mike Wirthlin, called silent data corruption "the biggest risk - you get a wrong computation," he told DCD. "If you're doing something small, this is such a low risk, but for some people this is a really big deal. For example, if you do financial calculations."
Another potential impact of the radiation is that an FPGA could be rendered completely unusable - although this is probably preferable for a data center operator, because at least they know something is wrong.
Luckily, the industry already has a way to significantly minimize the risk of these events. Configuration scrubbing, an upset mitigation technique that detects and corrects upsets in an FPGA's configuration memory, can reduce the occurrence of an SDC by three to 22 times, Keller and Wirthlin found.
“The vendors provide scrubbing, but most of the products that use FPGAs probably do not use it,” Wirthlin said, adding that larger hyperscalers appear to be better at enabling such protections, but there is no published data on this.
Vendors also appear to be getting better at building FPGAs that on a per-bit level can handle higher and higher amounts of radiation. But there is yet again a matter of scale to consider - while they are becoming more reliable per bit, FPGAs are featuring more and more bits per chip.
Larger chips, larger risks?
It's not clear yet what this means for overall reliability, Keller said. "We are doing some studies trying to map how many bits there are versus how much radiation a single bit takes to upset, and it's not conclusive. My gut feeling says we're getting better."
There is yet another question of scale to consider - as chip components get smaller and smaller, it means that "as a single particle passes through the device, it now affects multiple cells," Keller said. "It's a multi-cell upset, not a single-bit upset."
But, curiously, "we are not seeing that," Keller said. "There's something that's being done that we don't know about, that's preventing that climax."
He added: "Vendors are doing a great job, they're aware of the problem, they're pushing to make their systems as reliable as possible."
With companies like Xilinx and Intel working to reduce the risk, and mitigation efforts like scrubbing available, Prof. Dr. Wirthlin noted that the most important thing was for the user to be aware of the risk.
"Some people get overly alarmed, but we're not trying to scare people," he said. "It's an issue that you just need to be aware of, and there's plenty of techniques and hardware to address it. Just follow the proper recommendations and the risk can be adequately mitigated.”