The airline industry has a pretty poor track record on IT services. This weekend a failure at BA grounded all the airline’s flights from Heathrow and Gatwick for a day. Last year, Delta Airlines lost nearly $100 million because of a tech breakdown. So far, BA’s responses have stopped short of explaining how the meltdown occurred.

Airline failures probably get more publicity than other IT meltdowns. All the victims are in one place, and all the holidaymakers have smartphones with which to express their outrage. By comparison, a website crash that takes a government service down for a day wouldn’t have nearly the same emotional impact.

– Paramount Pictures

Evasive action

Airlines are also tight-lipped about the causes of the failure. BA’s CEO Alex Cruz first said there was a supply issue, and then blamed the failure on a power surge, coupled with a failure of the backup power system. He’s opened up a little since then, but is still issuing what is essentially a non-explanation, about as useful as railway companies complaining about leaves on the line.

Speaking to Sky News, Cruz said: ”On Saturday morning around 9:30 there was indeed a power surge that had a catastrophic effect over some communications hardware which eventually affected the messaging across our systems.”

The communications failure propagated to some 200 systems on the BA network, and BA was unable to quickly restore things, because either the backup failed (according to some versions) or else the ”backup systems could not trust the messaging that had to take place amongst them” (according to others).

The failure happened on systems in the UK, and not on services outsourced abroad.

This is all well and good, but it’s not a real explanation, as a power surge is not a bizarre event. It’s one that infrastructure should be prepared for. If you want to know how to prepare for such an event, we can recommend courses - Data Center Power Professional and Business Continuity Management, for instance - given by our colleages in DCPRO. If a power surge took out crucial BA systems, then there was a flaw in the power design. And if BA’s IT services could not recover from that failure, then there was a flaw in BA’s business continuity management.

Crash register?

BA’s secrecy is not a surprise of course. The details of an IT failure in the enterprise sector, like the details of the IT systems themselves, are regarded as commercially confidential. Any failure will be investigated thoroughly by a forensic IT investigator. But the results will almost certainly not be made public, so as not to embarrass BA.

As DCD associate Ed Ansett of i3 Consulting has pointed out, this secrecy ensures that any failure will happen again. Data centers are complex systems including technical and human elements, and a failure will result from a complex interaction between these parts of the system. Investigators can get to the bottom of the failure, but secrecy in the data center industry means that knowledge will never be shared. The investigators sign a non-disclosure agreement and their client fixes the fault. But every other data center in the world could then suffer the same problem, because the data is not shared.

To see this in the aviation industry is massively ironic. If an actual plane has an accident, the crash data is automatically shared, by a legal agreement between airlines. There is a full independent investigation of the cause of the accident, and the results are published. This is a mandatory requirement.

There’s no mandatory investigation of a big IT systems failure, because at present, there are no obvious deaths when the systems go down. As we become more dependent on IT services, this will change, says Ansett: “Over the course of the next few years, as society becomes more technology-dependent, it is entirely possible that failures will start to kill people.”

DCD is aware that large service providers such as Google and Amazon already publish quite detailed (and sometimes embarrassing) descriptions of the causes of failures in their services. Their services are in fact more reliable than many enterprise systems, and it’s possible that failures such as BA’s might add to the pressure to move out of in-house facilities into the cloud.

However, large enterprise users running their own IT will eventually have to publish crash reports for their IT failures. DCD is aware of moves behind the scenes to try and facilitate this, perhaps by sharing details anonymously, or privately amongst peers.

Right now, BA’s public image has suffered a catastrophic meltdown. Its customer service failed completely faced with the loss of all IT services, and its lack of communications since has compounded the misery it caused its customers. The PR loss is comparable to that United Airlines created for itself, without the excuse of an IT failure, when it used force to eject a ticketed passenger from a plane.

One way for BA to help turn that around might be to start to break open the secrecy around IT crashes. If BA could find or create a forum to share its experience for others to learn from, it might go some way to redeem its failings, and start to help others.