The British Airways IT meltdown last month is a curious story. Originally, reports suggested a maintenance worker had inadvertently switched off the power supply. But CBRE, the contractor which manages the facilities at BA’s aging data center near Heathrow Airport, said that reports that the cause had been determined were “not founded in fact”, according to this article in The Guardian. The piece says, “BA said its investigation was ongoing and the cause had not been determined.” Another article in Sky News says an engineer disconnected the UPS to the data center and then reconnected their servers in “an uncontrolled and uncommanded fashion” as the cause of the calamity. As we all know by now, that took days to resolve. An independent commission has been appointed to study the matter.
BA has two data centers that are located a few miles apart, but there has been no word on what happened at the other data center. And BA certainly isn’t alone, even among airlines. Delta Air Lines suffered similarly in August 2016 when a switch box carrying power into the company’s headquarters failed, grounding flights worldwide. A single point of failure had also brought down systems at Southwest Airlines the previous month, although on that occasion the problem was in a network router.
Are you prepared for this?
This brings up the point: are you prepared for this kind of catastrophic failure? The old adage that no one got fired for having too many backups is certainly true. Clearly, BA (and Delta and Southwest) are now improving their backup procedures, and hopefully, others will learn from these experiences.
Power supply issues aren’t so obvious, despite one of the BA executives expressing wonderment as to why reconnecting power would cause problems. Large-scale servers require a great deal of care in how they are turned off and turned on: it isn’t like flipping a light switch. And because servers depend on other network resources, the order of how you turn things on is critical.
Any backup plan has to account for this, and apparently, BA’s was lacking in this area. You might also want to do some management training of what is involved in disaster recovery for your workplace, so your executives aren’t looking so foolish if this should happen to you.
The trick is identifying your single points of failure, and often you don’t realize what they are until disaster strikes, no matter how much you plan and try to simulate a potential outage.
Single points of failure
I was reminded of a story that I wrote more than ten years ago. I visited the offices of Freddie Mac in suburban DC, doing this article for Network Computing about how they were moving from mainframe systems to IP-based applications.
Large-scale servers require a great deal of care in how they are turned off and turned on: it isn’t like flipping a light switch
Back then they used three Internet service providers. They thought three were enough, until there was a fire in one of the Baltimore highway tunnels underneath the harbor. Now, Baltimore is 50-some miles away from their offices, but it turns out two of Freddie Mac’s ISPs had data lines that passed through that tunnel as well, and the company lost some connectivity.
Since the fire, the IT staff at Freddie Mac ask to see the routing maps from any ISPs it considers working with, though service providers have been more hesitant to supply this kind of information post-9/11. Too bad we can’t have their clout to get this information for our own enterprise connections.They also added a fourth line, making sure it goes through a different telco CO and uses another path.
It isn’t just single points of failure, but also adding in multiple layers of protection. Freddie Mac has three firewalls to separate network traffic into various layers of protection. They have different security zones depending on the application, the user, and the context of the user. They have a standard DMZ (de-militarized zone), and behind that is a zone where applications are authorized to play. Behind that is another zone, where the data lives. Given the volume of financial transactions they handle (billions of dollars move through their networks daily), this makes a lot of sense.
This post is reprinted from David Strom’s curated Inside Security email newsletter. It covers a wide range of security topics interesting to enterprise IT managers. You can subscribe here.