Google has issued an apology and post mortem after a series of software errors took its Compute Engine off line worldwide for 18 minutes earlier this week.
The outage, which began shortly after 19:00 Pacific Time on Monday, was caused by a propagation of bad IP configuration data across Google’s network of data centers. The company retired a block of IP addresses used by Google Compute Engine (GCE), but the process backfired, creating an inconsistent network configuration. A previously-unknown error prevented the system from rolling back to a good configuration, and instead retired all GCE’s IP blocks, a configuration which eventually spread across Google’s infrastructure and made GCE unreachable.
“No risk of recurrence”
Google is giving customer credits which are higher than those specified in their service level agreements. Google Compute Engine users will get 10 percent off their bills and Google Cloud VPN users will get 25 percent off. Google App Engine, Google Cloud Storage, and other cloud products were not affected.
”We recognize the severity of this outage, and we apologize to all of our customers for allowing it to occur,” said Benjamin Treynor Sloss, VP of 24x7 (yes, that’s his job title) at Google, adding that the cause was fully understood and GCE is “not at risk of a recurrence”, before giving the gory details.
Google propagates changes to its IP address configuration using the Internet’s BGP protocol so its services can be found by other Internet users. At 14:50 PT, Google retired an unused block of IP addresses from GCE, and set the propagation off. But there was a timing “quirk” that meant the process did not reach a second configuration file and the update process saw an inconsistency.
At that point things went further wrong because of a “previously-unseen software bug”. The network should have rolled aborted the change and kept the previous known good configuration, but instead the management software started removing all GCE IP blocks, and pushing this configuration to the network.
Even at this stage, the system should have bene able to block the bad update, because of a “canary” procedure, which tested it on a single site. This site correctly raised the alarm but a “second software bug” ignored the warning, and set off the full rolloout of the update. One by one, Google sites removed all reference to GCE IP addresses.
Google only spotted the problem at 18:14, when the response time of GCE went up, because more users were being routed to the remaining sites, which were further from the ultimate users. At the same time, Cloud VPN failed in the Asia-east1 region. Engineers failed to find out what was wrong, but took the decision to roll back to a good configuration. Although the service went offline, it was back up in 18 minutes.
Engineering changes
For the future, Sloss promised would make 14 separate engineering changes, to fix the system of safeguards, adding in direct monitoring for any decrease in capacity. Further changes could be made in future.
“This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it,” concludes Sloss. “It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform.”