Google’s Compute Engine cloud service was more or less out of action during an incident lasting around two hours and 40 minutes on Thusday 19 February, thanks to a failure of its routing tables. The search giant has applied a fix that should prevent a repeat performance while it pins down the underlying cause.
From late on Wednesday (PST), the Google Compute Engine virtual network stopped updating routing information. Outgoing traffic gradually declined and the majority of the service was down early on Thursday, as the system’s cached routes expired. The service is running normally as we write, and Google has extended the lifetime of the cache to stop the event happening again.
Google apology
“We consider GCE’s availability over the last 24 hours to be unacceptable, and we apologise if your service was affected by this outage,” said a Google Cloud Platform statement. “Today we are completely focused on addressing the incident and its root causes, so that this problem or other hypothetical similar problems cannot recur in the future.”
Google says the Compute Engine instances all continued running; they just rapidly lost the ability to talk to anything outside their own private networks. The problem started when outbound traffic from Google Compute Engine instances went down by ten percent late on Wednesday. This level grew rapidly until the majority of traffic was out (70 percent), for 40 minutes, straddling midnight, Eastern US time. Normal traffic had resumed by 01:20, Google says.
The root cause is apparently still unknown. Google says “the internal software system which programs GCE’s virtual network for VM egress traffic stopped issuing updated routing information,” but the cause of the problem is “still under active investigation”.
The Compute Engine virtual machines could still communicate using cached route information, but this gradually decayed as the cache entries expired. Google engineers spotted the problem, and decided that reloading the route information would fix it. “They were able to force a reload to fix the networking approximately 60 minutes after the issue was identified and well before all entries had expired,” says Google.
Google doesn’t know why the routes stopped updating, or whether it will happen again, until it pins down the cause. But we can be pretty sure any repeat won’t have the same impact, because Google has already applied an admirably quick-simple-and-obvious fix: the routing cache lifespan has been increased from several hours to a week, which gives Google enough time to push out new route information by hand.
Google has previously been one of the most reliable cloud services, according to CloudHarmony, with one hour’s downtime in the last twelve months.
Limited outages are actually relatively common according to application monitoring firm Dynatrace, which runs an Outage Analyzer showing cloud outages in real time, and is keen to warn business users that any cloud contract is vulnerable.
“This is a prime example of the role third-parties play in businesses’ digital strategies, and the vulnerabilities they face if they do not have the ability to detect and respond immediately to ensure their end users are not impacted,” said David Jones, Dynatrace’s digital performance expert. Jones suggests users should have monitoring to see that cloud providers meet their service level agreements (SLAs).