A major network outage affecting disparate services such as WhatsApp, Reddit, CloudFlare and AWS on Monday was apparently caused by an engineer at TeliaSonera, who misconfigured a router and accidentally sent most of Europe’s traffic to Hong Kong, the Register reports.
The downtime started at 12:10 UTC when, according to a blog posted on CloudFlare’s website, the Internet security and content delivery service provider detected ‘massive’ packets of data had been lost on Telia’s network.
CloudFlare reported that a fix was implemented at 13:43 UTC, with the issue resolved by 14:22.
Data packets are units of data contained in one package so they can be transported along a network path. In this case the transit provider, Telia, appeared to have dropped packets before they reached their destination.
The Register said the issue was caused by an individual engineer redirecting European traffic to Hong Kong.
During the outage, CloudFlare’s status page said the company was “observing network performance issues in some European locations.”
Such was the severity of the incident that, hours after it ended, TeliaSonera sent a note to other network operators apologizing for the downtime.
Telia reliability questioned
Swedish multinational TeliaSonera is a Tier 1 network provider operating its own global fiber backbone, delivering a foundation for the exchange of Internet traffic around the world. CloudFlare uses a number of transit providers, including TeliaSonera.
In response to the downtime, CloudFlare’s chief executive officer Matthew Prince tweeted: “Reliability of Telia over last 60 days unacceptable. Deprioritizing them until we are confident they’ve fixed their systemic issues.”
In its blog post, TeliaSonera said it has moved towards automation of its systems to help deal with this type of incident and minimize future outages.
Meanwhile CloudFlare is already working on a mechanism that proactively detects packet loss and moves traffic away from providers experiencing an outage. This system is only currently activated in its most remote locations so did not trigger in Monday’s incident, but capability will be extended over the next fortnight to all of the company’s points-of-presence.
AWS also reported that its services had been affected by interrupted connectivity.
“Between 5:10am and 6:01am PDT an external provider outside our network experienced an issue which impacted internet connectivity between some customer networks and the EU-WEST-1 Region,” said an AWS representative.
“Connectivity to instances and services in the region was not impacted by the event. The issue is resolved and the service is working normally.”