recent research report has demonstrated that it is theoretically possible to bring down an entire data center by manipulating the servers to suddenly create a power surge. At first glance, this seems quite alarming but some simple steps can protect you from being ‘attacked’.

Lightning
– Thinkstock / Cappan

Some have expressed surprise that data center owners have been ‘over-subscribing’ their power allocation so that this ‘attack’ may occur. In reality this is nothing new; oversubscription of the available power and cooling in a data center is the norm for service providers, colo as well as cloud providers.

So yes, in theory, many data centers, not just cloud but colo and enterprise data centers could be caused to overload a part of their power distribution system with aggressive workload deployment.

This has come about through a number of factors and is, for many, a by-product of the lack of IT knowledge in the data center groups.

  1. Energy Star type programs and recent changes from Intel have made server power draw very dependent upon workload. DCIM tools which “provision” a standard percentage of number Watts per server now simply add risk, and even those with deep pockets (e.g. those in the financial industry) who actively try to waste energy by turning off power management can no longer stop Intel chipsets power managing themselves.
  2. Historic very flat IT loads have allowed operators to gain greater utilization of their asset through managing actual draw rather than PSU power or peak power to IT equipment.
  3. DCIM tools know absolutely nothing about IT, service deployment etc. (this is correct, they should simply provide an interface to expose their DCIM data to competent IT management platforms).
  4. The people with access to the M&E metering normally have no access to IT and vice-versa, which is another big thank you to “corporate security” in many cases.

There is nothing wrong with over-subscription, in any large enough environment you get a statistical levelling of load as not all customers or not all platforms will peak load at the same time, and this is one of the basic advantages of a service provider. The only questions are what level of over-subscription are you prepared to allow and how do you manage it?

It is well known that the peak power draw of the IT equipment exceeds the power infrastructure in most data centers, in particular the power-on surge current. If you drop power to a data hall you can’t just turn it back on, you frequently have to go circuit by circuit as trying to bring up an entire PDU would just pop the breakers, as all the IT equipment PSUs drew their inrush surge current.

In terms of mitigating this “attack vector” there are many simple and sensible approaches, note that these are all driven from the IT deployment / management tooling and not DCIM:

  1. Stop deploying new VMs to a row / zone of the floor when the PDU(s) feeding that zone hit comfort % of rated power (note that in many data centers the power system is 2N or N+1 and so actually overloading equipment is rather harder as you have (est.) 120% to 250% of the ‘limit’ power available.
  2. Use Intel Node Manager (as Baidu and many others do) to assign rack, row and room level power budgets; this tool allows you to reach out and slow down the clocks on whole data centers worth of servers to meet a specified power budget, and this fundamentally defeats the “attack”.
  3. Actively drop VMs in a row / zone of the floor when the PDU(s) feeding the zone hit scary% of the rated power, in many (see AWS) cloud type environments loss of machines and restarts elsewhere are normal for customers

Things not to do to mitigate the “attack”:

  1. Try to fix it by filling disks up with junk power metering data; don’t buy “smart” power strips and try to meter every single server, that will only help a forensic investigation after the event and still won’t tell you which workloads tipped you over.
  2. Add “I’m scared” + “fudge factor” % to your power allocations “in case” of a workload spike; this just wastes money, energy and capacity and won’t help you anyway. If you can’t understand and control the spikes they’ll get you sooner or later.