British Airways (BA) has admitted that human error caused the disastrous outage, which grounded thousands of planes over a weekend last month. However, there is still no full explanation of what happened, and Willie Walsh, the the CEO of BA’s parent company IAG, still appears confused about the sequence of events. Calls are mounting for an inquiry beyond the probe announced by BA.
At an industry conference in Mexico, Walsh said an engineer disconnected a power supply, and then reconnected it minutes later causing a power surge which damaged IT equipment, according to a report by the BBC. However, this story differs somewhat from a leaked internal email, and does nothing to explain why business continuity equipment and procedures did not protect against these actions, and when it was reconnected.
“Difficult to understand”
Walsh himself expressed confusion, reportedly saying: “It’s very clear to me that you can make a mistake in disconnecting the power. It’s difficult for me to understand how to make a mistake in reconnecting the power.”
He said the engineer’s actions were unauthorized, and that BA would learn from the mistake told reporters that the engineer was authorized to be in the data center, but was not authorized “to do what he did.”
BA has two data centers designed to provide continuous service for the airline’s operations, however, and the story does not explain why this setup - a common practice for data center operators and service providers - failed.
“The main question is: why didn’t the IT services in the primary data center immediately failover to their secondary data center?” asked business continuity expert Ed Ansett of i3 Solutions, who has co-launched a new group, the Data Centre Infrastructure Incident Reporting Network (DCIRN) this week, offering a neutral forum for firms like BA to share data about data center failures, and help the industry to learn from them.
An internal email leaked to the media last week suggests that the mistake was made by a contractor doing maintenance work: ”The email said: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries… After a few minutes of this shutdown, it was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the systems and significantly exacerbated the problem.”
From all this, it appears that the blame for the failure has shifted from the power grid (there was no mains outage or issue affected both data centers) and the UPS systems, which appear to have been bypassed rather than having failed.
As well as the big question about the lack of failover, the remaining questions DCD would like to know include whether the story about the contractor is true, and why and how it was even possible for an engineer working for BA or a contractor to carry out such an operation, authorized or not. Like everyone else, DCD is waiting for more information from BA.
IAG has commissioned an independent company to conduct an inquiry, and Walsh has promised to disclose details of the findings, going some way to appeasing calls for BA to share details of what went wrong.
DCIRN has been announced this week, and will be formally launched in August by the UK Data Centre Interest Group.
If you know more about the circumstances of the outage, feel free to get in touch at [email protected]