What can the data center industry learn from the SGX outage of November 5?
Trading at the Singapore Stock Exchange (SGX) closed for almost three hours on November 5, 2014 after its systems failed to cope with a voltage fluctuation caused by a lightning strike. As we reported at that time, the lightning strike at 2.18pm took the SGX out of action until 5.15pm for the securities market and 7pm for the derivatives market, exacerbated by incomplete data that resulted in a decision not to switch to the secondary data center.
The unscheduled downtime of 2 hours 42 minutes for securities and 4 hours 27 minutes for derivatives is a serious blow to the reputation of Singapore as a financial hub. The Singapore Government took the outage seriously enough to set up a high-level Board Committee of Inquiry (BCOI) to independently oversee investigation into the incident which took place at SGX’s primary data center at Keppel Digihub.
Source: Thinkstock / Cappan
A series of unfortunate events
The BCOI report was completed earlier this year, and the report sent to SGX on 31 March 2015. The report was publicly released by SGX at the end of June, as it announced an investment of up to US$15M to beef up its infrastructure to address gaps in its service recovery capabilities in the wake of the November outage - and a separate software-related outage on Dec 3.
DatacenterDynamics examinedthe full BCOI report and spoke to a number of data center experts to find out what exactly happened, and to understand what the industry can learn from the incident as a whole.
In the immediate aftermath, SGX management appointed i3 Solutions to ascertain the cause of the power outage, which was identified and rectified by Nov 8. The findings were later validated by experts from Environmental Systems Design Inc (ESD), which was appointed by the BCOI to independently investigate the outage.
The root cause was explained in the BCOI report, point 24, which we reproduce below:
- “The trigger event for the power outage was the simultaneous voltage dip in the electricity supply from Singapore Power on both feeds at PDC that activated both emergency power systems (the Diesel Rotary Uninterruptible Power Supplies (DRUPS)).
- There was a malfunction to a component of one of the two DRUPS, which caused its output frequency to differ from the frequency of the other DRUPS.
- The downstream Static Transfer Switches (STS) could not compensate for the difference in frequency, and caused an out-of-phase power transfer which in turn caused a surge in current that led to a total power outage to the PDC, which shut down all of SGX’s IT systems and equipment and tripped a number of circuit breakers.
- The key cause of the power outage to the PDC was the power transfer by the STS when the frequencies of the output of the two DRUPS were unsynchronized.”
The STS is an electrical switch that connects two power lines in order to switch power from one line to the other for continuous power supply in the event of a power failure. In this case, there was an error in the design, not picked up in subsequent testing, which meant that the STS could not handle an out of phase power transfer, and therefore could not compensate for the difference in frequency caused by the malfunctioning DRUPS. This caused a surge in the output current and culminated in a total power outage at the data center.
It boils down to human error
“I would expect the STS to check for frequency synchronization to ensure that it is safe to switch, to prevent a fault being transfered to the other feed. Therefore, there may be configuration errors involved in this case,” said a senior engineer who saw the report and spoke to DCD on condition of anonymity.
To be clear, the BCOI report does note that the overall design of the PDC was generally robust and that it met “industry resilience standards and best practices.” Given that equipment can fail at inopportune times, why didn’t the design call for the use of an STS with dynamic voltage capability in the first place – and why wasn’t this caught by experts who approved the design?
Speaking of mission critical facilities in general, Martin Ciupa of Green Global Solutions said: “Human error related downtime is often the chief culprit. Human error failure modes are not always fully mapped – it’s very hard, and needs external consultants [that are not] part of day to day team to be brought in.”
The senior executive of a data center operator, also speaking on condition of anonymity, told DCD about the need to consider the bigger picture in terms of the design and implementation of that data center. From an engineering perspective, everything is limited by what you have designed and built, he said.
Each outage in a data center is a learning experience for both service providers and end-users
Goh Thiam Poh, Equinix Singapore
In fairness, the design flaw is not one that would be apparent to most engineers, according to Ed Ansett, the co-founder and chairman of i3 Solutions, the company that helped SGX identify and rectify the fault last year. “I have come across the failure type at SGX before so I knew what to do. It isn’t a new issue, it is well known amongst knowledgeable data center professionals,” he told DCD.
And as to why the switch to the secondary data center (SDC) did not happen, the BCOI report noted that the information available to the crisis management team at SGX then meant that they had to consider the possibility of a temporary communications loss between the PDC and SDC. Because the matching engines in the PDC are designed to continue operating in the event of a communications loss, order books could become mismatched and made it unsafe to failover.
Based on this information, Wong Tew Kiat, managing director of Organisation Resilience Management Pte Ltd, and a veteran in business continuity and data center management suggested that both the PDC and SDC may have different IT configurations. “Such different IT configurations at both PDC [Primary data center] and SDC are likely to be due to the technologies that had changed along the years that it was complex for them to change in parallel at the SDC in time,” he told us.
Pointing to how BCM risks are often assessed by individuals such as the head of business continuity management or enterprise risk management, Wong said: “We will no longer focus only on business continuity, but also need to integrate the risk management from IT hardware, application software, network communications and data center [in a more holistic manner].”
Learning as an industry
The BCOI report does outline various areas for improvements, including its internal procedures, monitoring capabilities, recovery times and communication to stakeholders are concerned, which are areas which SGX announced it will be investing in. If anything though, it is evident from its findings that the outage was not one that SGX could have prevented from happening – due to the fact that they had to rely on the work of external experts.
According to the report:
- “The facilities at SGX’s PDC [Primary data center], including the power supply architecture and systems, are provided by the DCP. SGX does not have in-house expertise to design, construct or operate data centers from the facilities perspective. SGX therefore relies on the DCP for their expertise.”
With the ball back with data center operators, what can the industry do to improve itself? While there is no simple answer to this question, several of the experts we spoke to suggested that sharing of knowledge about data center failures could benefit the industry as a whole.
“The fundamental issue, and it’s a global issue, is broadly speaking there is no mandate to disclose the reason for the failure,” said Ansett. “Why would you? It’s embarrassing, expensive and damages reputations. So, if it doesn’t breach the statute books, then companies will naturally avoid explaining what happened outside their organizations.”
On this front, Ansett praised the publication of the BCOI report by SGX. “SGX is helping other organizations avoid the same type of failure as well as assuring investors and customers that they are on top of the problem,” he told us.
“Each outage in a data center is a learning experience for both service providers and end-users,” agreed Goh Thiam Poh, director of operations, Equinix Singapore. “In the wake of a failure, when post mortem activities yield visibility into the details, it’s easy for customers to sit back and say, ‘Had I known, I never would have done this’ or ‘Had I known, I would have done that’,” said Goh to DCD.
“None of this fixes the past, but it can have a significant impact on the future if we analyze the results, continue to innovate, plan accordingly, and implement best practices in support of the plans,” he said.
For now, SGX had accepted full responsibility for the outage incident, and is in the midst of implementing the BCOI’s recommendations to improve and strengthen its technology infrastructure. The Monetary Authority of Singapore (MAS) had also announced that it will monitor and verify that the remedial measures are implemented satisfactorily through an independent expert.