India’s National Stock Exchange (NSE) has blamed failover logic and configuration issues from its Storage Area Network (SAN) vendor for an outage last month.
On February 24, the NSE went down, halting trading for several hours, with the NSE saying both of its telecoms suppliers were having issues.
After a root-cause analysis, the company says the disruption to its two telecoms links was caused by digging and construction activity between its primary and secondary facilities, but undocumented failover logic that did not meet specified requirements caused availability issues leading to numerous services such as clearing & settlement unable to operate.
Undocumented SAN failover logic leads to outage
In a statement posted to its site explaining the cause of the outage, NSE said it operates its primary data center is in the Bandra Kurla Complex in Mumbai, a ‘Near Disaster Recovery’ (NDR) site nearby in Kurla, and a disaster recovery (DR) site in Chennai. The replication to the NDR site is designed such that in the event of the links between primary and NDR being disrupted, the primary continues operations without any direct effect, and operations had continued without interruption during previous link failures.
On the date of the outage, NSE said it had “instability in links” from both service providers primarily due to digging and construction activity along the path between the two sites.
However, instead of operations failing over as planned, the NSE reportedly saw “unexpected behavior” from its Storage Area Network (SAN) system. The primary SAN became inaccessible to the host servers, resulting in the risk management system of NSE Clearing and other systems such as clearing and settlement, index, and surveillance systems becoming unavailable.
NSE says subsequent incident analysis showed that the problem was caused by failover logic and configuration implemented by the vendor which “did not conform to NSE’s stated design requirements.”
“The SAN system at the primary data center stopped functioning, which was completely unexpected. The specific failure logic used by the vendor is not documented, was not communicated to NSE, and was not appropriate for NSE’s setup. The resultant SAN failure led to the incident on February 24th.”
Going forward, the NSE says “various steps” are under implementation to address the SAN and telecom link issues, including orders for two additional telecom provider links, removal of the unnamed SAN software, and solutions to ‘de-risk dependency of critical applications to a single storage device.’
The exchange also said it has “a strong technology governance process in place” and invests heavily in its technology infrastructure on a continuous basis.
“Over the last 3-4 years, NSE has almost tripled its annual cash spend on capital and operational expenses on technology to approximately Rs. 900 crores,” the company said in a statement.
“NSE regularly tests its DR readiness in line with SEBI regulations wherein quarterly drills are conducted and live trading sessions from DR site are conducted twice a year.”
India’s market regulator The Securities and Exchange Board of India (SEBI) says it has seen the root cause analysis and will be undertaking a number of monitoring and failover/Disaster Recovery tests, as well as making further investigations over the outage.