There are few aspects of today’s world that aren’t being touched by artificial intelligence and machine learning. Automated systems can be used to predict faults and respond to capacity needs, and are paving the way for the era of the ‘lights out’ data center.
But while pre-packaged AI & ML solutions that are ready to go out of the box are becoming available they often still require integration to function beyond individual point solutions. And while DIY AI deployments are entirely doable, they require investment in sensors to collect the data and expertise and manipulate it into something usable.
“Giant industry players have been doing it for several years already, but most data center companies are just beginning to set up their data gathering and MLOps pipelines,” says Maciej Mazur, Product Manager for AI/ML, Canonical.
But while the likes of Google are well-placed to develop and deploy such technology, how available are AI & ML for the small to medium data center owners and operators?
This feature appeared in the 2021 Automation Supplement. Read it for free today
AI showing its value in the data center
For the hyperscalers and public cloud providers, AI & ML are already part and parcel of data center deployment and operations. Google has previously detailed how it uses DeepMind AI for cooling and was able to cut PUE by 15 percent through the automatic management of variables including fans, cooling systems, and windows. The company has also used Deepmind for predicting output of wind turbines up to 36 hours in advance, which it used to predict power needs for its facilities connected to wind farms.
“Alibaba Cloud deployed ML-based temperature alert systems in its global data center,” explains Wendy Zhao, Senior Director & Principal Engineer, Alibaba Cloud Intelligence. “We took hundreds of temperature sensors’ time series monitoring data, using an ensembled graph model to quickly and precisely identify a temperature event due to cooling facility faults.”
“It generated alerts much more in advance and provided the data center operation team precious time to respond to the fault, hence reducing failure impact.”
Cooling and predictive maintenance were the most cited use cases for AI in the data center, according to the people DCD spoke to. Power management, workload management and security were potential use cases that are yet to see significant traction.
“There have been exciting innovations, particularly from organisations like Google, which is showing the huge potential AI can provide,” says Dave Sterlace, Head of Technology, Global Data Center Solutions at ABB. “The potential is there and is being demonstrated but it’s not hugely widespread yet.”
Oliver Goodman, Head of Engineering, Telehouse Europe, says his company collects data on facility temperature, humidity and how 'hard' infrastructure is working in order to understand what could be done to extend the serviceable life of equipment and whether savings to be made in terms of energy efficiency and the capital expenditure of upgrades and part replacements.
“AI can take collect data such as customer load, aisle temperatures and humidities in each data hall and perform an action based on certain trigger or set points. So, if the customer load goes beyond a certain level, cooling infrastructure can be ramped up or down to provide sufficient cooling in the most energy efficient way.
Callum Faulds, Director at Linesight UK, adds that AI has been useful during the pandemic to keep a minimal number of staff on-site and keep those that are there safe.
“Safety and security applications such as automatic temperature checks, touchless authorization, payment and control systems, and traffic monitoring, which played a vital role throughout the pandemic are likely to remain in the future.”
Machine learning can also be deployed in addition to AI to automatically understand load patterns and predict when fluctuations will occur, as well as for infrastructure operations; for example for load transfer or intelligent switching between redundant and resilient equipment. This frees up the operational resource to concentrate on maintenance and repairs rather than plant running cycles.”
However, ABB’s Sterlace adds that conversations with many customers are still centered in the early phases of what machine learning and AI is, and the potential benefits it could bring rather than deployment.
EXPLAINED: What is a lights-out data center?
Vendors still too closed for widespread AI, data needs to be more open
Vendors are beginning to incorporate machine learning and AI into their products, but these are often still point-solutions that don’t play well with others. Though sometimes talked about, a single ‘Siri for the data center’ or pane of glass that can manage every aspect of a data center is still yet to emerge.
“The suppliers are providing solutions, but most are providing a solution that is specific to particular products, often from this same vendor, causing vendor lock in,” says David Cheriton, Chief Data Center Scientist, Juniper Networks.
“Data center operations & management is still very piecemeal due to the typically heterogeneous equipment stack,” adds Michael Cantor, CIO at Park Place Technologies. “Different vendors are at different levels of capability, and I would say that few are embedding true AI/ML into their operations stack.”
However, Telehouse’s Goodman says much of the innovation from the large hyperscalers will ultimately filter down to smaller operators as third parties that are helping the likes of Google spin-off and bring their own products to market.
“The costs of developing an in-house AI are high and become more so when you have the infrastructure variances of a colocation company, however the AI product market is growing with many new players each year, and this will be a fantastic opportunity for data center improvement, including legacy sites, regardless of the size and quantity of your estate.”
Developing AI in-house not impossible for those willing to try
While the most advanced models and use cases will require dedicated in-house data science expertise, it is possible to begin developing your own models using self-service ML tools such as AWS SageMaker, assuming facilities are able to collect the right data.
“An SME can hire a very small team of data engineers that can use ready-made models, for example, from the NVIDIA NGC catalog,” says Canonical’s Mazur. “Anyone can set up an MLOps pipeline and start collecting data in a way that is useful for data scientists. Regarding the scale, it’s better to take existing models available online and adjust them to the data center for smaller use cases, but it’s worthwhile to invest in custom ML models for over 1,000 servers.”
“A model can be trained with a couple of months of data collection with a sampling rate around a few minutes,” says Alibaba’s Zhao. “Some equipment already provides structured monitoring data. It would be really useful to establish some industry standards of monitoring data format for major data center equipment manufactures to follow, which will accelerate the adoption of AI/ML technologies.”
While many systems used to use closed protocols which made it difficult to extract the data for use in a wider AI or control system, Goodman says we are now seeing much better adoption though the use of common and open protocol communication interfaces, and this is something his company deliberately specifies for in vendors.
“At the moment there’s a lot of data collected that’s not always well utilized that we can extract more benefit from. That’s where most operators can look to make big gains and the move towards implementing more sensors that help beyond that,” he says. “As sensor technology becomes cheaper, and the communications networks behind them and the collection of data become more robust, then the AI products behind them will become more established but only if there is a compelling use case.”
Retrofitting AI in the data center still a challenge, but not impossible
Implementing AI & ML in a greenfield site built with the latest and greatest equipment and techniques is entirely possible if a company so desires. But many data centers are decades old, containing equipment that predates the latest innovations in the space, which requires more work to make ‘smart’.
“Although modern data centers do not resemble traditional commercial buildings – they are purpose-built pieces of infrastructure – many are still built using traditional methods and adopt design strategies which often causes the individual subsystems to be divided into standalone systems or silos and make implicit assumptions about control system failures, often at considerable cost,” says ABB’s Sterlace.
“For example, traditional management systems that do not actually perform any management – they just monitor - and in which individual subsystems manage themselves and exclude any possibility of mutual coordination, are common. No overall system supplier is tasked with unification or consolidation.”
While retrofitting AI can be a tall ask, it is entirely possible is companies are willing to put in the time and effort to install sensors – and assuming customers will allow close assess to their hardware during any such project in colo facilities – and create the data models.
“Companies considering developing their own AI/ML for data center management will need sensors in all parts of the data center to monitor temperature, humidity and electricity draw by rack, row, cage, room, etc,” says Yann Lechelle, CEO Scaleway. “In order to monitor mechanical electrical equipment, a proper information system must be put in place to log the data in an industrial way. Only then can proper data processing occur. In our latest data center, we have 2,500 sensors per room for 11 rooms.”
“It is totally doable to retrofit an old facility into an AI-driven world with external IoT devices and we have explored and verified its feasibility in some of Alibaba Cloud’s facilities as well,” adds Zhao.
As an example, Canonical’s Mazur says he created a predictive maintenance solution for a telecom operator:
“I used simple boards, essentially Raspberry Pi equivalents, that were connected to older devices to gather data and run small ML models locally. These models were exported to the cloud, then similar devices competed with each other, like in the AWS DeepRacer league.”
To do this though, operators need to be brave and plan ahead. Obviously uptime and availability are sacrosanct, and so any sort of retrofit needs to be done so in a thoughtful way that doesn’t impact operations.
“One of the key challenges to automate existing facilities is the fear of breaking when the existing systems are running. Needless to say, it is hard to retrofit without touching,” explains Juniper’s Cheriton.