Google Cloud experienced widespread issues on Sunday, June 2, impacting the search giant's own services, as well as that of its cloud customers.
The intermittent outage, which has since been resolved, was blamed on "high levels of network congestion."
Networking not working
Google services like YouTube, Nest and Gmail, as well as Cloud customers like Snapchat, Shopify, Vimeo and Discord were impacted by the problem, which began around 12:15 Pacific time.
"We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, G Suite and YouTube. Users may see slow performance or intermittent errors," the company said on its status page at the time.
While the congestion occurred in the US, its impact was felt globally, and was described by network monitoring company ThousandEyes as a "large scale" outage
The problem was resolved as of 4:00pm Pacific time, with Google promising to "conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a detailed report of this incident once we have completed our internal investigation. This detailed report will contain information regarding SLA credits."
In a statement, the company apologized for the inconvenience and thanked customers for their "patience and continued support." It added: "Please rest assured that system reliability is a top priority at Google, and we are making continuous improvements to make our systems better."
The outage which, among other things, meant that Nest users could not control their thermostats, comes after several major disruptions have impacted the largest cloud companies in recent years, highlighting the difficulty of building a resilient service, even with enormous resources.
Update: Google's VP of 24x7, Benjamin Treynor Sloss, said in a blog post: "In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.
"Google’s engineering teams detected the issue within seconds, but diagnosis and correction took far longer than our target of a few minutes. Once alerted, engineering teams quickly identified the cause of the network congestion, but the same network congestion which was creating service degradation also slowed the engineering teams’ ability to restore the correct configurations, prolonging the outage. The Google teams were keenly aware that every minute which passed represented another minute of user impact, and brought on additional help to parallelize restoration efforts."