Network peer takes on too much and Google cloud loses connectivity
Last week, Google Compute Engine lost the ability to connect to several Internet destinations, because a network carrier bit off more than it could chew.
The 70-minute network glitch happened on Monday 23 November, and meant that Google Compute Engine instances in the cloud giant’s europe-west1 region could not reach “a subset of Internet destinations”, mostly in the Middle East and Eastern Europe, according to a report on the Google Cloud Platform blog. The traffic volume on the region went down by 13 percent before the problem was fixed.
Google’s St Ghislain site in Belgium runs europe-west1-b
Source: Google / Wikimapia
On Monday morning, Google engineers switched on a new link to a network carrier which is already a Google peer. This time however, the new peering agreement was set up wrongly, signalling that it could route traffic to many more destinations than Google engineers had anticipated.
Normally automatic safety checks would have spotted the likely trouble and fixed the problem by only allocating traffic which the link was capable of handling but in this instance, the link was added manually, and the engineers did not spot the problem automatic congesting checks don’t pick up for an hour after a link comes on.
The result was that network performance was hit for an hour - plus the time (nine mnutes or so) it took for Google engineers to fix the issue.
In future, Google won’t allow manual link activation, so it doesn’t happen again.
Google issued an apology: “This is not the level of quality and reliability we strive to offer you, and we have taken and are taking immediate steps to improve the platform’s performance and availability.”
Mis-advertised routing is a recurrent problem in international networks, according to a report on The Register.
The europe-west1-b data center at St Ghislain in Belgium, within he same zone suffered persistent disk storage trouble in August after a lightning strike.