Visa has explained what led to a significant system outage in Europe at the beginning of the month, after the UK’s Treasury Committee asked the company to detail what happened.
The company’s European head Charlotte Hogg offered “unreserved” apologies for the outage, but said that during the disruption 91 percent of UK cardholders’ transactions were processed normally. However, despite that claim, during the outage a number of retailers stopped processing card transactions and moved to a temporary cash-only model.
More money, more problems
Visa operates two active-active redundant data centers in the UK, each of which is meant to be able to independently handle 100 percent of the transactions for Visa in Europe.
“The centers communicate with each other through messages regarding the system status, in order to remain synchronized. Each center has built into it multiple forms of backup in equipment and controls. Specifically relevant to this incident, each data center includes two core switches (a piece of hardware that directs transactions for processing) - a primary switch and a secondary switch,” Hogg said in a letter to the Treasury Committee.
“If the primary switch fails, in normal operation the backup switch would take over. In this instance, a component within a switch in our primary data center suffered a very rare partial failure which prevented the backup switch from activating. As a result, it took far longer than it normally would to isolate the system at the primary data center; in the interim, the malfunctioning system at the primary data center continued to try to synchronize messages with the secondary site. This created a backlog of messages at the secondary data center, which, in turn, slowed down that site’s ability to process incoming transactions.”
Hogg added that Visa ”do not yet understand precisely why the switch failed at the time it did, but we are working with the manufacturer to conduct a forensic analysis of the switch.” She also confirmed it was not related to a cyber attack.
When the switch failed, it took some five hours to deactivate the system due to the complexity of the fault. The issue caused two periods of peak disruption, one lasting 10 minutes, another 50 minutes, where the failure rate was 35 percent. Over the course of 10 hours, around 5.2 million transactions failed to process.
Visa has turned to international accountancy firm EY to review the incident. By the end of the year, it expects to transition to its global VisaNet system which “is based on a different technical architecture from the European system, has multiple data centers, and serves multiple geographies. VisaNet has four active-active images that work in tandem, and has significantly more capacity and scale compared to the European system.”
Hogg claims that VisaNet “can isolate and remove a failing component with one command, taking mere minutes to remove the malfunctioning component from the processing environment.”
Ensuring services are resilient can be difficult, with some turning to 2N hardware, and others shifting to the cloud. In the latest issue of DCD Magazine, we looked into how businesses are trying to improve their resilience. Subscribe for free today.