Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Amazon traces cloud outage to faulty breaker

  • Print
  • Share
  • Comment
  • Save

Amazon Web Services has released details about the root cause of the outage of one of its public-cloud’s availability zones that started in the evening on 14 June and lasted until next morning, US Pacific time.

In a note posted on the cloud’s status dashboard, the company said the outage was caused by a cable fault in the power distribution system of the electric utility that served the data center hosting the US-East-1 region of the cloud in northern Virginia.

The entire facility was switched over to back-up generator power, but one of the generators overheated and powered off because of a defective cooling fan. The virtual-machine instances and virtual-storage volumes that were powered by this generator were transferred to a secondary back-up power system, provided by a separate power-distribution circuit that has its own backup generator capacity.

But, one of the breakers on this backup circuit was configured incorrectly and opened as soon as the load was transferred to the circuit. The breaker was set up to open at too low a power threshold.

“After this circuit breaker opened … the affected instances and volumes were left without primary, back-up, or secondary back-up power,” Amazon’s note read.

Customers in this availability zone that were running multi-availability-zone configurations “avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.”

Among the customers affected with downtime were a number of popular web services, including Pinterest, Heroku, Quora, Foursquare and others, according to news reports. Heroku, for example, reported widespread outages of its production and development infrastructures that lasted for eight hours.

The faulty circuit breaker opened around 9pm, Amazon says, and the failed generator was restarted around 10:20pm. Most affected VM instances recovered by 10:50pm, and most cloud-storage volumes were “returned to customers” by about 1am.

Amazon said it had completed an audit of all of its back-up power distribution circuits and found another breaker that “needed corrective action.”

“We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes,” Amazon said.

Related images

  • Werner Vogels, CTO, Amazon.

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Is Hyperconvergence a Viable Alternative to the Public Cloud?

    Thu, 31 Mar 2016 15:00:00

    Enterprise IT leaders are right to be skeptical of such bold claims. After all, is it really possible to deliver an on-premises infrastructure that delivers the same agility, elasticity, and cost-effectiveness of public cloud providers like Amazon Web Services? If you’re following the traditional IT model, with its many siloes and best-of-breed point solutions, the answer is, most likely, no. To truly deliver a viable alternative to public cloud, you need to look beyond traditional IT. Join Evaluator Group and SimpliVity to learn more about how hyperconverged infrastructure can deliver the efficiency, elasticity, and agility of public cloud."

  • "Single Pane of Glass” comes to your Datacenter facility & IT operations

    Wed, 24 Feb 2016 18:00:00

    Join Hewlett Packard Enterprise and RoviSys,as well as OSIsoft, for a webinar hosted by DatacenterDynamics’ CTO Stephen Worn, as they discuss the implementation of the “Single Pane of Glass” solution and some of its resulting benefits: •An estimated 10 Million kWh saved in the first full year of operation •Ability to easily meet “Best-Practices” throughout data center operations

  • Overhead Power Distribution – Best Practice in Modular Design

    Wed, 3 Feb 2016 16:00:00

    Overhead power distribution in your data center offers many attractive possibilities, but is not without its challenges. Join Starline's Director of Marketing, Mark Swift; CPI’s Senior Data Center Consultant, Steve Bornfield; and University of Florida's Joe Keena for an exploration of the options and some of the pitfalls, supported by real-life examples from the field.

  • White Space 29: See the diagram

    Fri, 29 Jan 2016 10:40:00

    Peter is investigating DevOps, Bill looks at security while Max is stuck on the infrastructure level

  • White Space 28: The good, the bad and the ugly

    Thu, 21 Jan 2016 12:10:00

    Join the DCD team again this week, as they discuss AMD's ARM based efforts, waste heat, BT's EU deal and cloud for Canucks and much more! Enjoy.

More link