Deep dive: How taking data direct from server CPUs can improve data center performance and slash power consumption

What if you could pull a wide range of data from server chips, related to their performance or the health of their interconnect or even their surrounding system – anything you can think of that might be valuable in optimising both server and data center?

The data, of course, would be highly technical. So how about a series of dashboards to interpret, via machine learning, not just individual chip data, but aggregated by server hall, product type, or whatever you needed to see?

Then, perhaps, how about alerts for processors struggling with performance degradation, application stress, environmental effects, and other essential information?

All this is possible right now, according to Nir Sever, senior director of product at proteanTecs, an Israeli start-up that went public with its deep data monitoring technology for advanced electronics in late 2019.

While the company’s technology offers a range of benefits from chip design, chip and system manufacturing and all the way to usage in a range of environments, it is the Proteus Analytics Platform that holds most potential for data center operators.

Not only can it help identify issues with server hardware before they even become a problem, but it can also help operators to cut power consumption and improve operating efficiency.

Total control

“The key factor that we can control with regards to reducing the power consumed by electronics is the operating voltage supply,” says Sever. “Other effects, such as the size of the chip or adjusting the clock frequencies, have a linear contribution, but voltage has an exponential contribution.

“In the past, we enjoyed scaling [down] the supply voltage of electronics. Whenever you added a new chip manufacturing process generation, the supply voltage could be reduced in a process called Dennard scaling,” says Sever.

Broadly speaking, this means that a doubling of transistor density makes the circuit 40 percent faster, while power consumption remains the same. For many years, each succeeding generation of chip technology could deliver such major performance gains without making any new power demands.

But for the past ten years or so, the power performance improvements attributable to Dennard scaling have dried up. “We are in a crisis of power density,” says Sever. “That’s important because it means that if you have (or have to have) a power density increase, it means you have cooling requirements that are even more stringent than before, so your cooling equipment must be more responsive, too.”

Over time, says Sever, ‘silicon ageing’ on hard-run data center servers means that voltages need to be increased in order to achieve the same level of performance, “but if you’re adding 10 percent to the voltage, you’re adding 20 percent to the power consumption,” he says.

That invariably means, of course, that the cooling system – which can’t be swapped out as easily as a server blade – needs to be cranked up accordingly. However, cooling overheating blade-server CPUs is not as straightforward as it ought to be: temperatures can spike in seconds, or even, fraction of, in a process known as ‘thermal runaway’, while it can take server-hall cooling systems an hour or more to cut CPU temperatures by just a couple of degrees.

Hence, questions of chip temperature, silicon ageing, interconnect performance and more are all essential to the art of fine-tuning data center servers, on the one hand, while maximising sustainability, on the other – all the while also maintaining uptime and reliability.

ProteanTecs’ solution is based on in-chip agent monitoring and cloud-based analytics with built-in machine learning.

The company’s business model is to offer something for everyone throughout the chip lifecycle in order to maximise adoption, from designers to users. “It’s really about putting intelligence into these advanced chips,” says Sever, “so that they can report on their own health and performance, generating completely new data that’s highly actionable.”

The data generated from the chip is encapsulated in what proteanTecs calls Universal Chip Telemetry (UCT) language that is transmitted to the company’s Proteus analytics suite hosted in the cloud. Access to this analytics is what customers, whether they are server manufacturers or data center wranglers, are paying for.

“It’s the first of its kind, end-to-end, full-lifecycle electronics visibility platform. Proteus is comprised of two main pillars: on-chip agents, and an analytics platform. The on-chip agents are IP embedded within your chip during the design stage, which are built-for-analytics and generate novel data that is inaccessible externally,” says Noam Brousard, vice president of product at proteanTecs.

“The agents provide parametric information about the chip itself, as well as its surrounding board, application and environment. They operate in-mission so there is no need to disrupt system operation for maintenance or test. Machine learning algorithms are applied to that information, turning the data into actionable insights for much deeper decision making.”

The analytics platform is also compatible with data center hardware health systems (HHS). “The Proteus platform plugs-in to that and feeds actions,” says Sever. “It’s automated, automatically deploying actions learned via Proteus to the HHS.”

Payback time

But what kind of ‘hit’ does the agent technology serve on the silicon?

According to Sever, barely any.

“It’s not free. We have a target to take no more than about one percent of the logic. But when you’re talking about gate counts, it has zero practical effect on your area because, in chip design, we talk about utilization, which is the percentage of the area transistors occupy versus the entire area of the silicon. With our agents, it’s not a high number.

“Our agents are built for analytics. They are spread around, but are not running at the same speed as the chip itself because they just need to measure phenomena. They don’t need to run at a high clock frequency, therefore, they don’t consume a lot of power and they don’t need to operate all the time, either, just at the level of granularity you need for your observations,” says Sever.

Ultimately, proteanTecs isn’t just about data center performance optimization, but about optimization of power and performance across the lifecycle of the chip, starting with design, characterization and volume manufacturing.

“The Proteus software provides visibility into the post-production variability of power and performance. Based on this, it provides insights, such as application tuning, bin planning, outlier detection and performance yield prediction for accelerated time-to-market and ten-fold improved quality,” Noam Brousard, vice president of product at proteanTecs.

Hence, the idea of the technology isn’t just to help data center operators optimize their operations and maximise sustainability, but also to ensure that data center silicon is optimized at time-zero deployment, full stop.

Deep dive: How taking data direct from server CPUs can improve data center performance and slash power consumption

Total control

Payback time

Tags

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies