Google Cloud suffers 19 hours of network load balancer issues

Self-inflicted wound fixed after Google rolls back configuration change

A subset of Google Cloud’s Network Load Balancer has suffered connectivity issues in the us-east1, us-central1, europe-west1, asia-northeast1 and asia-east1 regions

The problem, which meant the network load balancer could not connect to backends, lasted some 19 hours before being fully addressed.

Unload

Google initially reported the issue at 00:52 Pacific time (first reported by The Register), and by 06:00 reported it had “determined the infrastructure component responsible for the issue and mitigation work is currently underway.”

An hour later the company added: “Our previous actions did not resolve the issue. We are pursuing alternative solutions.”

Fast forward another hour and a half and Google said it had “identified the event that triggers this issue and [was] rolling back a configuration change to mitigate this issue.”

After further tweaks, the company said that by 10:30 no new instances would have the problem, but that existing instances were still affected.

Google gave instructions to affected users, which were quite complex: “Create a new TargetPool. Add the affected VMs in a region to the new TargetPool. Wait for the VMs to start working in their existing load balancer configuration. Delete the new TargetPool. DO NOT delete the existing load balancer config, including the old target pool. It is not necessary to create a new ForwardingRule.”

Finally, some time later, Google said: “The issue with Network Load Balancers has been resolved for all affected projects as of 20:18 US/Pacific. We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a more detailed analysis of this incident once we have completed our internal investigation.”

It is unclear what the impact of the connectivity issues were on GCloud customers although network architect Tim Armstrong at cloud-heat company Nerdalize did express his discontentment at the length of the problem. He added that he felt compensation was in order.

Armstrong told DCD that Nerdalize “use Google Cloud for some of our monitoring and development environments. We experienced a disruption to those services. We could work around it, but it was an inconvenience to say the least.”

The issue comes less than a week after Google accidentally caused widespread outages in Japan after a border gateway protocol mistake. The company “leaked” a full route table to Verizon causing Japanese traffic to be sent to Google as if it provided transit services. It doesn’t, meaning that traffic filled a link beyond its capacity or hit an access control list.

Google Cloud suffers 19 hours of network load balancer issues

Unload

Tags

Unlocking data center profitability: A guide to DCIM solutions

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies