The chip industry has a problem: how does it continue to crank-up performance, generation on generation, in order to justify new purchases and premium prices?

Conventional chip circuitry is already going beyond limits set just a few years ago, with leading fabs, including TSMC and Samsung, now advancing towards a 3nm process node that, just a few years ago, was believed might not even be possible. The 3nm process node will increase transistor density by around one-third compared to current 5nm circuitry, either goosing performance by between 10 and 15 percent in the process, or enabling the same compute power to be delivered for up to 30 percent lower power consumption.

S. Hermann & F. Richter_Pixabay_Camer_focus_chip circuit_190721_2.jpg
– Pixabay

In many respects, that’s all the semiconductor industry’s problem, but this tricky intricacy manifests all the way up the chain to system operation in the data center. Lifetime reliability is harder to maintain, and data centers rely on expensive redundancies and fall-back equipment to carry the load. The means to an end is no longer scalable, with the ‘end’ being uninterrupted uptime and service availability.

Fortunately, there are now tools that can help.

One of those tools – one that could benefit users, too – is on-chip telemetry from embedded agents that can communicate operating conditions, including performance margins, application stress, temperature, voltage and frequency, not just to chip developers and manufacturers as they iterate to produce new and better systems, but right the way through the usage cycle, too.

“We provide deep data monitoring for electronics components inside the data center, which boils down to the servers themselves, the different systems deployed there, driven by CPUs, GPUs, storage devices and so on,” Uzi Baruch, chief strategy officer at proteanTecs, an Israeli start-up established by a number of electronics industry veterans. “Even communication devices, such as network switches. It’s a new category for performance and health monitoring of data center IT equipment.”

“The way that we approach the problem takes you into a ‘deep tech’ kind-of offering. We basically start from the chip level...the chip is normally the ‘brain’ of the system,” says Baruch, pointing out that processors and other integrated circuits are sensitive to workload, heat, and voltage, as well as accelerated performance degradation, all of which can prematurely age the chip and cause a system error.

“We have developed monitors that are embedded inside the chip itself, with each one reporting on a variety of key parameters, at an extremely high coverage. When someone is designing [a chip] we’re providing them with those monitors, or ‘agents’, to embed inside the chip. Then, when the chip comes to life – when it’s tested, assembled in a system, or shipped to a data center – then the agents report data to the outside world,” says Baruch. “At that point the chip serves as a system sensor.”

The agents are built-for-analytics so machine learning algorithms can be applied in order to turn the raw data into actions the user can base decisions on. Those embedded agents, moreover, are neither space nor resource intensive, as proteanTecs’ Nir Sever explained in an earlier article with DCD.

“We have a target to take no more than about one percent of the logic. But when you’re talking about gate counts, it has zero practical effect on your area because, in chip design, we talk about utilization, which is the percentage of the area transistors occupy versus the entire area of the silicon. With our agents, it’s not a high number,” Sever told DCD.

“But the real magic happens in the cloud,” according to Baruch. “The agent fusion, machine learning inference, and targeted solutions … that’s what makes our chip telemetry so unique.”

The data that proteanTecs’ agents provide is mediated by a cloud-based analytics platform turning the complex, highly technical information from the chip into a usable form that can be sliced, diced and analysed to glean wider insights.

The analytics platform is what enables modelled estimations, calculated predictions, adaptive learnings based on historic data, even from production, and pinpoint root cause analysis.

What’s in it for me?

For users – whether in enterprises or data centers – the aim is to provide considerably more insight into system performance so that equipment operating in server farms and data center halls can be fine-tuned, maintained and optimised. This, says Baruch, not only enables predictive maintenance, but also proactive actions that can be taken if, for example, sub-optimal cooling is affecting multiple servers or server blades: potential problems can be highlighted before they become serious problems.

“We provide a software stack that can be deployed inside the data center as part of the standard hardware health monitoring infrastructure. Hyperscalers, for example, already have a variety of management systems that monitor activity in the data center. We integrate with those to enhance visibility,” says Baruch. “Our analytics platform automatically deploys actions to the existing systems.”

During the machine learning phase of proteanTecs’ analytics, a baseline is set. “So once you have the baseline, everything associated with predictive maintenance, fault diagnostics, anomaly detection, performance monitoring, and the application load then integrates with the standard management software running the data center,” he adds.

The platform provides alerts around performance and reliability degradation and potential faults before they arise. “You can actually analyse the component itself and see what caused it to fail. But that’s more like a post-mortem. What you really want to do is keep the data center and its servers up-and-running,” says Baruch.

In many cases, he continues, it’s often the application running on the server causing the stress. For example, an application pounding one particular core, instead of running more equally over all four, eight or in the case of AI, even thousands of cores in the CPU*. “We’ll show you what the root cause is and how, over time, it’s working,” says Baruch, adding that it’s also possible to troubleshoot exactly how the miscreant software is overtaxing the CPU and, therefore, to take effective action. Indeed, proteanTecs claims that software suppliers themselves can also use the analytics to tune application performance before new software or updates are shipped.

And it’s not just conventional CPUs from Intel, AMD and Nvidia that Baruch has in mind. With independent foundries today at the cutting edge of fabrication, hyperscale data center operators have increasingly been dabbling in the development of their own server CPUs, and are also open about the kind of chip telemetry they think they need to fine-tune their data centers.

Indeed, Amazon, Facebook, Microsoft and Google have all pursued this path, based on the Open Compute concept, with Amazon arguably furthest ahead with its Graviton family that it has ramped-up into mass deployments, “in a bid to vertically integrate more of its technology stack for differentiation and, longer term, to drive costs further down,” according to Uptime Institute research director Daniel Bizo.

“Everything associated with chip health and performance monitoring, they speak about publicly in open forums… it’s a known problem for them,” adds Baruch. “So we know we haven’t invented a problem and come up with a solution.”

*Cerebras Systems WSE2 CPU, for example, boasts 2.6 trillion 7nm transistors, has 850,000 cores and is optimized for AI. Its predecessor CPU, WSE (Wafer Scale Engine) offered a mere 400,000 cores