Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Amazon traces cloud outage to faulty breaker

  • Print
  • Share
  • Comment
  • Save

Amazon Web Services has released details about the root cause of the outage of one of its public-cloud’s availability zones that started in the evening on 14 June and lasted until next morning, US Pacific time.

In a note posted on the cloud’s status dashboard, the company said the outage was caused by a cable fault in the power distribution system of the electric utility that served the data center hosting the US-East-1 region of the cloud in northern Virginia.

The entire facility was switched over to back-up generator power, but one of the generators overheated and powered off because of a defective cooling fan. The virtual-machine instances and virtual-storage volumes that were powered by this generator were transferred to a secondary back-up power system, provided by a separate power-distribution circuit that has its own backup generator capacity.

But, one of the breakers on this backup circuit was configured incorrectly and opened as soon as the load was transferred to the circuit. The breaker was set up to open at too low a power threshold.

“After this circuit breaker opened … the affected instances and volumes were left without primary, back-up, or secondary back-up power,” Amazon’s note read.

Customers in this availability zone that were running multi-availability-zone configurations “avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.”

Among the customers affected with downtime were a number of popular web services, including Pinterest, Heroku, Quora, Foursquare and others, according to news reports. Heroku, for example, reported widespread outages of its production and development infrastructures that lasted for eight hours.

The faulty circuit breaker opened around 9pm, Amazon says, and the failed generator was restarted around 10:20pm. Most affected VM instances recovered by 10:50pm, and most cloud-storage volumes were “returned to customers” by about 1am.

Amazon said it had completed an audit of all of its back-up power distribution circuits and found another breaker that “needed corrective action.”

“We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes,” Amazon said.

Related images

  • Werner Vogels, CTO, Amazon.

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Powering Big Data with Big Solar

    Tue, 12 Jul 2016 18:00:00

    The data center industry is experiencing explosive growth. The expansion of online users and increased transactions will result in the online population to reach 50% of the world’s projected population, moving from 2.3 billion in 2012 to an expected 3.6 billion people by 2017. This growth is requiring data centers to address the carbon impact of their business and the increasing need for data centers to integrate more renewable resources into their projects. Join First Solar to learn: -Why major C&I companies are looking to utility-scale solar as a viable addition to their energy sourcing portfolios. -How cost-effective utility-scale solar options can support datacenters in securing renewable supply. -Case study of how a major data center player implemented solar into their portfolio

  • Smart Choices for your Digital Infrastructure

    Tue, 28 Jun 2016 10:00:00

    Your data centre is a key part of successfully transforming and building your digital business. The challenge today is to create a highly reliable, flexible, scalable and cost-effective digital infrastructure. Your cabling system is an important element in the creation of that infrastructure. Attend and learn how to: - Piece together different elements of standards, technical specifications and physical properties in order to choose the right networking equipment - Reduce the time and labour spent maintaining, repairing or installing cabling by adopting improved design and management practices.

  • White Space 39: Attacks on power and cooling

    Tue, 17 May 2016 08:25:00

    This week on White Space, we talk about the security of Industrial Control Systems – the systems that control your CRAC or PDUs. If these devices are connected to a network, attackers can reach them, and shut down a facility. Special guests Ed Ansett and George Rockett.

  • White Space 38: Leaving Las Vegas

    Tue, 10 May 2016 13:25:00

    This week we talk about: Tax Break for a data center Efficiency standards News form the Las Legad event - EMC World The Dell/EMC merger. And much more...

  • Designing Flexibility into your Data Center Power Infrastructure

    Wed, 4 May 2016 18:00:00

    As power density is rapidly increasing in today’s data center, provisioning the right amount of power to the rack without under sizing or over provisioning the power chain has become a real design challenge. Managing the current and future power needs of the data center requires Cap-Ex to deploy a flexible power infrastructure: safely handling peak power demands, balancing critical loads and easily scaling to meet growing power needs. In this webinar you will learn: > How to create Long term power flexibility and improved availability for your operation > How to increase energy efficiency and improve SLAs through a comprehensive set of best practices.

More link