Machine learning, a subset of artificial intelligence, is about to transform everything - if you believe breathless reports from the government and in the media. It will change or replace the workforce, create new businesses, and revolutionize existing industries with new innovations, strategies and dangers.
Data centers are standing by to provide a home where those AI systems will be trained and run. But how will the algorithms transform the data centers themselves? Machine learning promises to unlock efficiencies, but that will mean handing over control to the machines. Are we ready for that?
“You can call it AI, you can call it machine learning, you can call it different names - but at the end of the day what you’re trying to do is predict a future outcome based on past data,” IBM’s VP of analytics development, Dinesh Nirmal, told DCD.
The tech giant turned to its largest enterprise customers with their own private data centers, and asked them what they hoped to achieve with this ability to peer into the future. “The two main things that came out of it were improved uptime and efficiency,” Nirmal said.
With that desire in mind, data center predictive analytics company Romonet last year updated its offering to include new machine learning capabilities.
“The current machine learning aspects are at the initial data processing stage of the platform where raw data from sensors and meters is normalized, cleaned, validated and labeled prior to being fed into the predictive modeling engine,” Zahl Limbuwala, Romonet’s co-founder, said.
“The biggest improvement for customers is an unparalleled ability to see what exactly is going on in their data centers.”
The signal through the noise
With data being produced by all kinds of instrumentation systems - electrical power management systems (EPMS), data center infrastructure management systems (DCIMs), branch circuit monitoring systems (BCMs), environmental monitoring systems, building management systems (BMS) sensors, the BMS itself, and more - Limbuwala said that facility operators can suffer ‘snow blindness’ from the information overload.
The machine learning platform improves the ‘signal-to-noise’ ratio to such an extent, he said, that: ”We [and our customers] often find surprises within the way their data centers are running.”
DCIM company Vigilent also hopes to use machine learning to discover surprises within data centers, having embraced the technology more than eight years ago. The company uses machine learning to make predictions and then uses those predictive capabilities to “take actions automatically,” Vigilent’s founder and CTO, Dr Cliff Federspiel, told DCD.
“Some DCIM systems have the ability to use tools like computational fluid dynamics on the thermal side for making certain kinds of forecasts, but as far as I know, there isn’t really a whole lot of data-oriented decision-making capabilities in DCIM systems, and in particular there aren’t ones where the decision-making is automated.”
After Vigilent sets the system up in a data center, it usually spends a week monitoring the facility. Then automation is turned on, and the customer measures the difference “to get an idea of what the system is doing for them from the point of energy savings. Then, over time, the system automatically makes changes to the state of various types of cooling equipment, temperature set points, flow rates and things like that.”
Vigilent claims that facilities usually see a 35 percent reduction in the energy spent on cooling, but have had “some dramatic examples where it’s been as high as 75 percent, particularly in market sectors that are very conservative and have an extreme level of redundancy,” Dr Federspiel said.
“If they don’t have a smart system, they often just run everything, no matter what, no matter how much overcapacity they have. So in cases where the customer either has a very low load, or they’re very conservative, they end up just using a lot more power than they need to to keep the environment right.”
Intelligently managing capacity is an area where machine learning can come to the fore. In the case of Tegile, the hybrid storage vendor has begun to use a cloud-based predictive analytics engine, IntelliCare, to monitor and manage its flash arrays.
“Today we collect data from all our 3,000+ arrays every hour,” Tegile CEO Rohit Kshetrapal said. “As we collect the data every hour, this allows us to predict three things - component failure, performance and balancing of the array.
“We can also tell you a year in advance when you’re going to run out of space.”
Intelligent prediction
Ilia Zintchenko, CEO of Mindi, also hopes to predict performance, failures and more with the help of AI. His London-based company has operated in stealth mode since being established over a year ago, and has huge plans for its AI system, ‘Autopilot.’
“What we’re building is an AI-based framework to provide what we call ‘anti-fragility’ to IT environments,” Zintchenko told DCD in the company’s first interview with the press.
“So basically minimizing the effects of unexpected events - whether that’s predicting them or reducing the number, or alleviating them altogether. Those events could be hypervisor failures, software failures, hardware failures, power outages, cooling system problems, security breaches, general resource contention and so on.”
Initially, though, Mindi is aiming to deal with resource contention: “When you have multiple workloads running on the same servers, what usually happens is the ‘noisy neighbor’ problem - one piece of workload starts to use all the bandwidth on the server, or all the I/O or CPU, and that really slows down all the other workloads.
“What our software does is predict the resource demands of applications running in a data center and then use live migration to actually move that around the computing environment in real time to minimize these resource contention problems.”
Looking further, Mindi hopes to optimize data center server utilization, since many enterprise facilities regularly operate at just 20 percent utilization. “In that case you can start to actually play around with the voltage and frequency of the CPU, memory, the GPU, the motherboard and so on to really drastically reduce the power consumption without actually affecting performance too much, simply because 80 percent of the CPU is not used anyway,” Zintchenko said.
“But if you are making use of the servers in a better way, you have a lot more money and footprint to save by reducing the number of servers by, let’s say, 30 percent, rather than just reducing the power by 30 percent.”
Intelligently balancing tasks can not only reduce the number of servers needed, but also the energy costs, a joint research initiative by scientists from Northwestern University and the Argonne National Laboratory uncovered.
“We found that for different machines, even with [the] same power setup, they have variations in power consumption and temperature,” Northwestern PhD student Kaicheng Zhang said. “So this difference and variation can be exploited intuitively by putting high demand applications on cooler nodes, and low-demand applications on hotter nodes, to balance the peak temperature and improve performance,” Zhang continued.
The team then tested this ‘COOLR’ concept on Chameleon, the US National Science Foundation’s cloud computing testbed, reducing fan power consumption by 17 percent. But the group aims to test it out on a larger system, Professor Seda Memik, of the computer engineering division at Northwestern, said. “In this case, the machine learning algorithm shines when the system goes large scale and it’s harder to use human knowledge.”
But with human knowledge called into question, so is the species’ usefulness in the data center.
The human factor
“Humans aren’t very scalable - particularly skilled and experienced people,” Romonet’s Limbuwala told DCD.
“Automation and data driven systems, standardization, robotics, et cetera will eventually reduce the need for many skilled humans to be physically present at a data center site.”
Vigilent’s Dr Federspiel concurred: “There’s already not that many people in a data center. One of our big markets is telecoms, and they have thousands of buildings, many of which are run as ‘lights out’ and have been for a long time.”
But Ian Dixon, vice president of operations at colocation company Colt, had doubts over whether AI is ready to take over from humans just yet.
“We tend to operate with automation providing information and humans providing interpretation and action. I think we can get more intelligent data, but at the moment I’m not a big fan of the data center taking control of itself,” he said.
Talking to DCD at a Colt facility just outside of London, Dixon gestured at the surrounding servers: “Look at the complexity of the technology on this site, it’s the same as on a microchip, just a bit bigger. There’s so much interconnectivity and so much information, but it’s not made by one vendor, it’s made by multiple vendors.”
For Colt, the priority right now is ensuring the company gets information from the systems in terms of condition-based maintenance.
“That to me is the way we should be going, but you need rich information to do that. We need improved alarming from systems, improved information, improved correlation of alarming and so on,” Dixon said.
“In terms of intervention, I’m still sitting on the camp of ‘I want a human pushing that button.’ We’ve still got manned sites and I see that for a few years yet.”
Dr Federspiel understood the trepidation: “When we were getting the company going, there was a lot of apprehension about letting an AI operate your data center, so we offered a recommendation system as well as the automation one.”
Vigilent’s VP Bob Thronson added: “We have over 500 data centers and telecom exchanges we’re currently deployed in globally - virtually all of those use the automatic control system.”
For IBM’s Nirmal and his survey of large customers, there’s only one long-term objective: “How do you make self-managing, self-healing and self-optimizing data centers? That’s the goal that every data center is aspiring to.”
And, perhaps in that future, all that will be left is the machine itself.
This article appeared in the October/November issue of DCD Magazine. Subscribe to the digital and print editions here: