Google App Engine stutters
Google has apologized after its US-Central cloud infrastructure region suffered errors and high latency for nearly two hours.
The problem was caused after traffic routers underwent a software update at the same time as data was being shifted between data centers.
Explaining the incidence on the Google Cloud status page, the company said: “The incident was triggered by a periodic maintenance procedure in which Google engineers move App Engine applications between datacenters in US-CENTRAL in order to balance traffic more evenly.
“As part of this procedure, we first move a proportion of apps to a new datacenter in which capacity has already been provisioned. We then gracefully drain traffic from an equivalent proportion of servers in the downsized datacenter in order to reclaim resources. The applications running on the drained servers are automatically rescheduled onto different servers.
“During this procedure, a software update on the traffic routers was also in progress, and this update triggered a rolling restart of the traffic routers. This temporarily diminished the available router capacity.”
21 percent of Google App Engine applications hosted in the Central region saw error rates above ten percent, while 16 percent of applications saw lower rates of errors. The issue lasted from 13:13 to 15:00 PDT.
Google said that the App Engine automatically redirected requests to other data centers to reduce the overload, and that engineers manually redirected traffic at 13:56. Fixing a a configuration error that caused an imbalance of traffic in the other data centers fully resolved the incident.
The company said that it has added more traffic routing capacity in order to prevent similar issues from occuring in the future. “We will also change how applications are rescheduled so that the traffic routers are not called and also modify that the system’s retry behavior so that it cannot trigger this type of failure.
“We know that you rely on our infrastructure to run your important workloads and that this incident does not meet our bar for reliability. For that we apologize.”