On Monday, Google Cloud’s automated failover system caused two applications to go down for several hours between 12:33 and 14:23 PST.
The issues originated with Memcache, an App engine service that speeds up application performance by caching temporary but frequently accessed datastore queries.
Best left alone
As explained on the Google Cloud Status Dashboard the problems were caused by the the fact that during a configuration update, the global database that specifies data center availability within Memcache itself became unavailable. As a result, an automated function, whereby “the configuration is considered invalid if it cannot be refreshed within 20 seconds” brought down Memcache, too.
This sent a surge of traffic to another application, Datastore, causing elevated latency and errors, and a second surge when Memcache came back online.
Customers using Managed VMs also reportedly experienced failures for their HTTP and App Engine API requests, but those using the updated version of the service, Flexible Environment, did not report any issues.
Responding to an automated alert, Google engineers attempted to revert the most recent changes to the application’s configuration file, but failed, as it required an update to the configuration in the global database, which was unavailable at the time. They finally managed to update it by sending a request with a longer deadline.
This is not the first time in recent months that Google has suffered from self-inflicted grief. In August, Google Cloud experienced 19 hours of load balancer issues, which were only resolved when technicians rolled back changes previously made to its configuration.