Asia: Data center veterans speak on reliability

While there is no question that the service-level agreement (SLA) number is an important metric as a measurement of reliability offered by data centers, the devil is in the details, warns Alvin Siagian, the vice president and director at NTT Indonesia.

“When you say it’s 99.982 percent, what is behind it? What is the planned, and what is the unplanned [downtime]? What is the severity level for an incident to be considered as an outage?” asked the industry veteran who has helmed various IT outsourcing and data center roles since 1993.

Voradis Vinyaratn, the executive director and acting managing director of T.C.C. Technology (TCCtech) in Thailand, agrees: “Most people think of the SLA as a magic instrument. In fact, the SLA does not guarantee that there will be no problems.”

Common misconceptions

But what exactly is the role of the SLA today? According to Heng Wai Mun, the executive director at OneAsia, the SLA is an agreement that serves as a framework to govern the relationship between the parties, setting out the contractual expectations on both the buyer and the seller.

Despite its widespread usage, the SLA’s correlation with reliability is one that is fraught with misunderstanding, says the executive with over two decades of experience in the IT and telecoms industry.

“I think the most common misconception is the ‘better’ the SLA is, the higher number of 9s can be found in the agreement. While service availability percentage is indeed an important parameter, it cannot be the sole parameter needed to evaluate an SLA,” said Heng.

“For example, there are vendors who provide 100 percent SLAs, but have many caveats and exclusions to what an outage might mean. Also, the payouts might be so watered downed that [an outage] becomes just a calculated risk that the vendor is willing to take,” he said.

Moving beyond an outage

When it comes to reliability, Siagian sees the SLA as merely one component among many that organizations should keep an eye out for. For a proper understanding of a provider’s capability, however, an appraisal of the reactive and maintenance capabilities of a provider is necessary, he says.

“What is your maintenance procedures like? If you say you want to keep me on, tell me. What is the response time, what is the resolution time? How long does it take for someone to be on site? What is the mean time to repair (MTTR)?” said Siagian.

Another vital component that is not often discussed probably revolves around the problem resolution process. This includes incident and outage management, and begins with an initial analysis to identify whose fault a problem lies with and if a workaround could be quickly enacted to minimize the downtime.

“Once you have workaround in place, you bring the servers up. After that, you troubleshoot to find the root cause, plan and make the necessary changes, and discuss how you improve matters going forward,” he said, noting that the participation of pertinent vendors and subcontractors is necessary. “You decide who performs, who helps, who approve for each of the items. Then we talk about the SLA.”

Getting it right

All executives agree that relying solely on the stated SLA on a contract is not adequate and must be backed by additional due diligence. Heng explained the rationale: “I believe a proper evaluation should be done on a technical and operations level to understand whether a vendor has the capability to fulfil the stated SLA.”

“Does the operator have the necessary expertise and experience to manage the environment? Does the operator in turn outsource their facilities management to a third party? These are all key questions to ask and to base the evaluation on,” says Heng.

On his part, Voradis cautioned against an overemphasis on specifications. He pointed out that failures can happen when key elements such as staff training and process validation are weak or overlooked, even when the “hardware and software may be good.”

Voradis advised getting a better feel of the operations: “That’s why we strongly recommend a site visit as [part of the evaluation process] where you can talk to the people who would be looking after one of your most important assets. It is of utmost important for you to walk away with confidence in them.”

And there are other considerations to bear in mind too, starting from the physical and cyber security of a colocation facility. This should include ensuring that operational procedures are sound, and that business risk mitigation and remedy plans are in place, he told us.

This sentiment was echoed by Siagian: “Check resource schedule on the internal resources such as operations, technical support, capacity management and quality management – basically, the organizational structure. Check their business continuity planning. Do they ever exercise it?”

Siagian left us with a somewhat unorthodox suggestion in closing. He said: “Interview customers that have gone through problems. You don’t want [to speak only with] happy customers.”

Asia: Data center veterans speak on reliability

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence