Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Amazon traces cloud outage to faulty breaker

  • Print
  • Share
  • Comment
  • Save

Amazon Web Services has released details about the root cause of the outage of one of its public-cloud’s availability zones that started in the evening on 14 June and lasted until next morning, US Pacific time.

In a note posted on the cloud’s status dashboard, the company said the outage was caused by a cable fault in the power distribution system of the electric utility that served the data center hosting the US-East-1 region of the cloud in northern Virginia.

The entire facility was switched over to back-up generator power, but one of the generators overheated and powered off because of a defective cooling fan. The virtual-machine instances and virtual-storage volumes that were powered by this generator were transferred to a secondary back-up power system, provided by a separate power-distribution circuit that has its own backup generator capacity.

But, one of the breakers on this backup circuit was configured incorrectly and opened as soon as the load was transferred to the circuit. The breaker was set up to open at too low a power threshold.

“After this circuit breaker opened … the affected instances and volumes were left without primary, back-up, or secondary back-up power,” Amazon’s note read.

Customers in this availability zone that were running multi-availability-zone configurations “avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.”

Among the customers affected with downtime were a number of popular web services, including Pinterest, Heroku, Quora, Foursquare and others, according to news reports. Heroku, for example, reported widespread outages of its production and development infrastructures that lasted for eight hours.

The faulty circuit breaker opened around 9pm, Amazon says, and the failed generator was restarted around 10:20pm. Most affected VM instances recovered by 10:50pm, and most cloud-storage volumes were “returned to customers” by about 1am.

Amazon said it had completed an audit of all of its back-up power distribution circuits and found another breaker that “needed corrective action.”

“We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes,” Amazon said.

Related images

  • Werner Vogels, CTO, Amazon.

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Do Industry Standards Hold Back Data Centre Innovation?

    Thu, 11 Jun 2015 14:00:00

    Upgrading legacy data centres to handle ever-increasing social media, mobile, big data and Cloud workloads requires significant investment. Yet over 70% of managers are being asked to deliver future-ready infrastructure with reduced budgets. But what if you could square the circle: optimise your centre’s design beyond industry standards by incorporating the latest innovations, while achieving a significant increase in efficiency and still maintaining the required availability?

  • The CFD Myth – Why There Are No Real-Time Computational Fluid Dynamics?

    Wed, 20 May 2015 14:00:00

    The rise of processing power and steady development of supercomputers have allowed Computational Fluid Dynamics (CFD) to grow out of all recognition. But how has this affected the Data Center market – particularly in respect to cooling systems? The ideal DCIM system offers CFD capability as part of its core solution (rather than as an external application), fed by real-time monitoring information to allow for continuous improvements and validation of your cooling strategy and air handling choices. Join DCIM expert Philippe Heim and leading heat transfer authority Remi Duquette for this free webinar, as they discuss: •Benefits of a single data model for asset management •Challenges of real-time monitoring •Some of the issues in CFD simulation, and possible solutions •How CFD can have a direct, positive impact on your bottom line Note: All attendees will have access to a free copy of the latest Siemens White Paper: "Using CFD for Optimal Thermal Management and Cooling Design in Data Centers".

  • Prioritising public sector data centre energy efficiency: approach and impacts

    Wed, 20 May 2015 11:30:00

    The University of St Andrews was founded in 1413 and is in the top 100 Universities in the world and is one of the leading research universities in the UK.

  • A pPUE approaching 1- Fact or Fiction?

    Tue, 5 May 2015 14:00:00

    Rittal’s presentation focuses on the biggest challenge facing data centre infrastructures: efficient cooling. The presentation outlines the latest technology for rack, row, and room cooling. The focus is on room cooling with rear door heat exchangers (RHx)

  • APAC - “I Heard It Through the Grapevine” – Managing Data Center Risk

    Wed, 29 Apr 2015 05:00:00

    Join this webinar to understand how to minimize the risk to your organization and learn more about Anixter’s unique approach.

More link