The incident that brought down British Airlines’ IT systems over the weekend, causing cancellations and delays affecting 75,000 passengers around the world, was due to a power surge in a London data center, according to the airline’s CEO - but the exact cause is yet to be determined.
BA chief Alex Cruz is leading the inquiry into the outage with the assistance of several power supply specialists. The airline has two data centers, Boadicea House and Comet House, with three halls in each, located close to the Heathrow airport. The failure evidently resulted in both data centers going down, but details are still scarce on how this could have happened.
Human error the likely cause
“It’s likely BA are still figuring out what the causal factors were and the root cause was,” said Ed Ansett of i3 Solutions. ”Invariably, this type of major event is due to at least two, often three coinciding events. I can’t recall any data center-related total system failures that were due to a single piece of equipment.”
The official line is that the failure happened in two stages, and had nothing to do with outsourced IT staff, nor was it due to a cyber attack. According to BA, the problems started when one of the UPS systems at Heathrow’s Boadicea House data center, which was powered by a combination of mains, battery and diesel, “was shut down”.
That ambiguous phrase could mean that the UPS failed to react properly to an outage or surge from the mains supply. Both local power provider SSE and the National Grid have denied there was a problem with the mains, but the truth may be different, said data center power expert Mike Foskett: “[BA’s facilities] will either be supplied by the 11kV Heathrow ring, which is notoriously unstable, or the new HV supply to T5. BA will have accurate power monitoring and would easily be able to prove if there was a surge or a dip in either voltage or frequency, how long it lasted for and when.”
Alternatively, the UPS may have failed, but that wouldn’t explain the meltdown: “UPSs fail from time to time, that’s a reality that has to be planned for and it’s one reason why redundant systems exist in the first place,” Ansett said. “In a large enterprise, the failure of a UPS system should have no effect on its systems.”
Even with a failed UPS, there should have been no outage, as BA has two data centers. The biggest question, according to Ansett, is why the IT services in the primary data center didn’t immediately failover to their secondary data center.
Cruz’s answer so far is that power was brought back to Boadicea House. The IT equipment was powered up in an “uncontrolled fashion,” causing a surge and “catastrophic physical damage” to the airline’s servers, a source told The Telegraph. This makes little sense, as Boadicea should have been out of use and the services running from Comet House by then.
Cruz stated that damage to the communications hardware affected all 200 systems across the airline’s network, which normally sees “tens of millions of messages” shared on a daily basis.
The inquiry now seeks to determine whether the initial cause of the UPS failure was mechanical or human error, and why the IT services were not switched over to the secondary facility.
Foskett believes people were the cause: ”It is more likely to have been human error, perhaps during a planned operation for some upgrade work.”
Analysts have estimated that the outage could cost BA’s parent company, International Airlines Group (IAG), £100 million ($128m) in refunds and compensations.
It is not the first time that an IT failure has caused disruption to the airline’s service: in 2016, issues with check-in systems caused widespread passenger delays in June, and again in September.
Similarly, outages have caused multiple disruptions for both United Airlines and Delta Air Lines throughout the past year, and just a few days ago, an IT crash in a UK data center brought down passenger processing systems in airports across Australia and New Zealand.