Archived Content

The following content is from an older version of this website, and may not display correctly.

(Updated with play-by-play summary of events and comments from App Engine engineering director)

Google returned its cloud-based application-platform service to normal operation at 11:45am PDT on Friday after an outage that resulted in about half of all application requests to fail and lasted longer than four hours.

When Google App Engine, the company’s Platform-as-a-Service (PaaS) offering, went down, it affected a multitude of customers.

The outage at first affected nearly all App Engine services and users, according to an earlier emailed update from the App Engine engineering team. “We currently show that a majority of App Engine users and services are affected,” the update read.

Online storage service Dropbox and blogging platform Tumblr also experienced outages, however, there was no indication that the three companies’ downtime incidents were connected.

As of 10:50am PDT, App Engine and Tumblr were still down (Google reporting 55.9% App Engine availability), while the landing page on Dropbox’ website was online. Tumblr’s homepage and Dropbox’ service status page were unreachable.

"The malfunction appears to be limited to a single component which routes requests from users to the application instance they are using, and does not affect the application instances themselves," Google said in an update posted around 10:50am.

In a blog entry posted late in the afternoon, the App Engine explained more about what had happened that morning. Here's their summary of events:

 

  • 4:00 am - Load begins increasing on traffic routers in one of the App Engine datacenters.
  • 6:10 am - The load on traffic routers in the affected datacenter passes our paging threshold.
  • 6:30 am - We begin a global restart of the traffic routers to address the load in the affected datacenter.
  • 7:30 am - The global restart plus additional load unexpectedly reduces the count of healthy traffic routers below the minimum required for reliable operation. This causes overload in the remaining traffic routers, spreading to all App Engine datacenters. Applications begin consistently experiencing elevated error rates and latencies.
  • 8:28 am - google-appengine-downtime-notify@googlegroups.com is updated with notification that we are aware of the incident and working to repair it.
  • 11:10 am - We determine that App Engine’s traffic routers are trapped in a cascading failure, and that we have no option other than to perform a full restart with gradual traffic ramp-up to return to service.
  • 11:45 am - Traffic ramp-up completes, and App Engine returns to normal operation.

The company said it had increased its routing capacity in response to the incident to make such cascading failures less likely to happen in the future. No client data was lost and applications returned back to normal operation without developers' intervention, Peter Magnusson, App Engine's engineering director, wrote in the blog post.

"There is no need to make any code or configuration changes to your applications."

To compensate for broken Service Level Agreements, Google will give every paid application 10% off their usage bill for October.

This the week’s second major cloud-service outage. On Monday, Amazon Web Services, which provides the leading Infrastructure-as-a-Service offering, widely used by some of the most popular web properties, experienced problems at its northern Virginia data center, bringing down websites like Reddit and Netflix, among others.