Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Amazon traces cloud outage to faulty breaker

  • Print
  • Share
  • Comment
  • Save

Amazon Web Services has released details about the root cause of the outage of one of its public-cloud’s availability zones that started in the evening on 14 June and lasted until next morning, US Pacific time.

In a note posted on the cloud’s status dashboard, the company said the outage was caused by a cable fault in the power distribution system of the electric utility that served the data center hosting the US-East-1 region of the cloud in northern Virginia.

The entire facility was switched over to back-up generator power, but one of the generators overheated and powered off because of a defective cooling fan. The virtual-machine instances and virtual-storage volumes that were powered by this generator were transferred to a secondary back-up power system, provided by a separate power-distribution circuit that has its own backup generator capacity.

But, one of the breakers on this backup circuit was configured incorrectly and opened as soon as the load was transferred to the circuit. The breaker was set up to open at too low a power threshold.

“After this circuit breaker opened … the affected instances and volumes were left without primary, back-up, or secondary back-up power,” Amazon’s note read.

Customers in this availability zone that were running multi-availability-zone configurations “avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.”

Among the customers affected with downtime were a number of popular web services, including Pinterest, Heroku, Quora, Foursquare and others, according to news reports. Heroku, for example, reported widespread outages of its production and development infrastructures that lasted for eight hours.

The faulty circuit breaker opened around 9pm, Amazon says, and the failed generator was restarted around 10:20pm. Most affected VM instances recovered by 10:50pm, and most cloud-storage volumes were “returned to customers” by about 1am.

Amazon said it had completed an audit of all of its back-up power distribution circuits and found another breaker that “needed corrective action.”

“We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes,” Amazon said.

Related images

  • Werner Vogels, CTO, Amazon.

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (APAC)

    Wed, 26 Aug 2015 05:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (Americas)

    Tue, 25 Aug 2015 18:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (EMEA)

    Tue, 25 Aug 2015 14:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • 5 Reasons Why DCIM Has Failed

    Wed, 15 Jul 2015 10:00:00

    Historically, DCIM systems have over-promised and under-delivered. Vendors have supplied complex and costly solutions which fail to address real business drivers and goals. Yet the rewards can be vast and go well beyond better-informed decision-making, to facilitate continuous improvement and cost savings across the infrastructure. How can vendors, customers and the industry as a whole take a better approach? Find out on our webinar on Wednesday 15 July.

  • Is Your Data Center Network Adapting To Constant Change? (APAC)

    Wed, 24 Jun 2015 05:00:00

    Over the next three years, global IP data center traffic is forecast to grow 23 percent—and 75 percent of that growth is expected to be internal*. In a constantly changing environment and as planners seek to control costs by maximizing floor space, choosing the right cabling architectures is now critical. Is your structured cabling system ready to meet the challenge? Join Anixter's Technical Services Director, Andrew Flint and DatacenterDynamics CTO Stephen Worn and Jonathan Jew, Editor ASI as they discuss how to: •Create network stability and flexibility •Future-ready cabling topology •Make the right media selection •Anticipate and plan for density demands Essential viewing for data center planners and operators everywhere – Register Now! Please note that these presentations will only be delivered in English. 1.EMEA: Tuesday 23 June, 3 p.m BST 2.Americas: Tuesday 23 June, 1 p.m CST 3.APAC: Wednesday 24 June, 1 p.m SGT APAC customers – please note the equivalent country times: India: 10:30am; Indonesia, Thailand: 12 noon; Singapore, Malaysia, Philippines, China, Taiwan, Hong Kong: 1pm; Australia (Sydney): 3pm ; New Zealand: 5pm.

More link