A software bug and a failure in AWS’ emergency backup power supply were responsible for extending the major service outage on Sunday, the company revealed.
Heavy rainfall and gusting winds of up to 96 km/h knocked out the supply of power from a utility provider, requiring for emergency provision of energy. AWS has two backup power systems to deliver emergency supply but some server instances saw both of these fail, according to the Register.
While the company worked to restore the instances that had been knocked out of action, engineers discovered “a latent bug” in the company’s instance management software. A minority of instances had to be restored to working condition manually, meaning they were not fully operational until later on Monday.
Availability of data was also disrupted in some instances where dead disks required manual repair.
Interrupted power supply
AWS’ diesel rotary uninterruptable power supply (DRUPS), which integrates a diesel generator and a mechanical UPS, would under such circumstances usually fill the energy supply deficit.
“Under normal operation, the DRUPS uses utility power to spin a flywheel which stores energy. If utility power is interrupted, the DRUPS uses this stored energy to continue to provide power to the data center while the integrated generator is turned on to continue to provide power until utility power is restored,” Amazon said.
On Sunday, “a set of breakers responsible for isolating the DRUPS from utility power failed to open quickly enough.”
These breakers are installed to “assure that the DRUPS reserve power is used to support the data center load during the transition to generator power. Instead, the DRUPS system’s energy reserve quickly drained into the degraded power grid.”
Power required by the data center to continue operating was therefore not delivered, operations failed and large amounts of data were made unavailable.
AWS has pledged to introduce more circuit breakers to allow generators to activate before UPS systems are depleted in the event that utility power supply fails in the future. It also plans to make changes to its software, expected to be made available in Sydney in July, to make its APIs more resilient.