Netflix stood behind Amazon Web Services following what was one of the biggest outages for the movie streaming and rental company in more than a year.
While Netflix’s three-hour downtime on 29 June was triggered by a power outage that occurred amid a massive storm on the US east coast and knocked out an Amazon data center, root-cause analysis of the outage by Netflix uncovered issues with the movie company’s own system, Greg Orzell and Ariel Tseitlin, technologists who oversee Netflix’ cloud set-up wrote in a blog post.
Orzell and Tseitlin stood by their decision to move from the data center to the cloud several years ago. “While it’s easy and common to blame the cloud for outages because it’s outside of our control, we found that our overall availability over the past several years has steadily improved,” Orzell and Tseitlin wrote.
When they investigate root-causes of their biggest outages, they usually add resiliency patterns that mitigate similar disruptions in the future.
Following the latest outage, “our own root-cause analysis uncovered some interesting findings, including an edge-case in our internal mid-tier load-balancing service,” they wrote.
A feature in Netflix’ cloud infrastructure that was implemented to mitigate downtime caused by Amazon outages acted in an unexpected way, causing a cascading failure. The feature that is part of the cloud load-balancing set-up stops removing unhealthy instances when a lot of them fail at once.
It was implemented as a short-term freeze that could be lifted after a large-scale issue could be investigated. Netflix engineers found it harder to lift the freeze following the Amazon outage.
“Getting out of this state proved both cumbersome and time consuming, causing services to continue to try and use servers that were no longer alive due to the power outage,” Orzell and Tseitlin wrote.
This led to a second-order issue, where clients trying to connect to servers that were no longer available took up all client threads and causing gridlock in most of the services.
“The state of the cloud will continue to mature and improve over time,” the Netflix engineers wrote. “We’re working closely with Amazon on ways that they can they improve their systems, focusing our efforts on eliminating single points of failure that can cause region-wide outages and isolating the failures of individual zones.”