For decades, data center builders have gone to great lengths to make facilities which can deliver services with extreme levels of reliability. Now, a new generation of IT people say the cloud can do this just as well.

Where should CIOs turn to for resilient services - the cloud, or redundant data center hardware? Big surprise: the answer is neither. True resilience comes from properly understanding the services you are running.

A redundant story

Traditionally, data centers have been made more reliable by using redundant architectures. There are backup servers with backup storage ready to take over, and the facility itself is made reliable by battery backup from uninterruptible power supplies, duplicate power feeds and spare cooling capacity.

The wisdom of data center reliability has been codified in the Tier system of the Uptime Institute, as well as the European standard EN 50600. It comes down to duplication - sometimes referred to as “2N” - of power and cooling, and concurrent maintainability, the ability to service the facility without downtime.

Duplicating resources can push operators to maintain two separate data centers, far enough apart (say 100km) so a natural disaster won’t strike them both, and mirroring live data in so-called “active-active synchronization.”

That’s complicated and expensive, say enterprise IT staff, who have been slyly adopting cloud services as an alternative, and have noticed that these are pretty reliable, by and large. They are implemented on duplicated hardware, running in multiple data centers. They are accessed at the level of virtual machines or applications, which run across multiple servers.

This can be a big plus in terms of reliability, if it is designed to minimize fault propagation. Services like Netflix are built so that individual modules can pick the cheapest resources to run on, says Liam Newcombe, CTO of data center analytics and TCO firm Romonet: “You can take out entire servers or buildings and Netflix doesn’t notice. When the building comes back on, it replicates back the few transactions it missed, and these are replicated out to a distributed database.”

What if this is all the duplication you need? If applications are running in the cloud, then reliability becomes something for the cloud provider to worry about. It’s especially tempting since using cloud services is simpler - and often cheaper - than managing your own facility.

Reliability experts say it’s not as simple as that - but they acknowledge that a major shift is happening. Richard Hartmann builds data centers for colocation provider SpaceNet in Germany. He makes them reliable, and is a passionate advocate for the EN 50600 reliability standard. Despite this, he says, there’s a new generation of users - the cloud natives - who will see standards like this as irrelevant: “No one in the cloud will have EN 50600.”

Cloud native people don’t care about having redundant sites, he says, and even if they did, you can’t build reliable architectures like active-active synchronization on top of cloud services.

The cloud is a different thing

To move to the cloud, one has to accept that reliability is implemented differently: “Cloud natives have a totally different view of what redundancy means. They no longer care about the underlying infrastructure as long as they have enough components. If you have ten database servers, you don’t care if half of them go down.”

Of course, cloud outages do happen. To quote Metallica’s John Hetfield, it’s all fun and games, till someone loses an eye.

In 2017, Amazon made an error, and accidentally deleted servers providing the index to its AWS Simple Cloud Storage Service (AWS S3). All over the Internet services went down, including Quora, Giphy, Instagram, IMDb, American Airlines, Imgur, and Slack, to name but a random few.

The error at AWS had such serious consequences because these services all had unknowingly been built with S3 as a single point of failure. So the new cloud-native mindset doesn’t dispense with the need for a considered approach to reliability - it shifts the responsibility up into the design of the service. And how will they know if their design is reliable?

The Uptime Institute, the source of the Tier standards for reliable facilities, wants to help here, with an Uptime Hybrid Reliability assessment, to gauge the reliability of hybrid cloud implementations - but it’s not as simple as offering a new set of Tier guidelines for the cloud.

“A company may have databases right across the cloud, but the CIO is still responsible for providing availability,” says Todd Traver, VP of IT optimization strategies at the Institute. In some ways the situation is worse, because that CIO is expected to offer guarantees for services provided using elements from third parties.

As Newcombe explains, even if those underlying cloud services offer a service level agreement (SLA), it can’t be used to deliver a service level agreement to customers down the line: “Service penalties don’t flow through an SLA chain,” he says: “You can’t pass losses down the chain.”

Cloud providers will only repay the service fee of their direct customers, not the much higher fees paid for applications built on top of those cloud services. In the end, a cloud provider is not an insurance company, and won’t reimburse losses. All this means that CIOs feel they have lost control, says Traver: “They no longer provide or even manage the various elements of their IT. And it is not good enough to just cross your fingers and hope AWS will be there!

“In the past, they had a 2N data center, which would never go down,” he says. “The applications would be non-resilient; they 100 percent depended on a data center. Now they are spread across multiple locations. The data center is critically important still - but also the application you use. There are a lot more pieces and parts.”

A CIO who might previously have implemented a service under their own control in a Tier III or Tier IV data center now has to contend with a hybrid that combines their own IT resources with multiple cloud services.

That’s a complicated task. Uptime’s approach is to add in the other factors: as well as the in-house data center, the hybrid assessment takes in the networks, platforms and applications involved, and the nature of the overriding organization.

In other words, it shifts responsibility back to the CIO, or the provider of a service. Resilience is no longer about building a reliable data center. It’s about whether you yourself are managing a reliable service, constructed from multiple components.

In the end, Traver says, a CIO tasked with providing reliable infrastructure must do due diligence on all the services used, perhaps even including a check of the data center architectures of their cloud providers - paying attention to the issues that are created by the combination that has been adopted.

The never ending responsibility

It’s also going to be a continuous responsibility - those who get the Hybrid Reliability stamp of approval should check and recheck throughout the year, in case architectures or components subtly change.

“Companies are coming to us who have either had a large outage in the past, or are concerned they don’t know how resilient they are,” Traver says. At the time of writing, no stamps of approval have been issued, but those working towards them include a cloud database spread across multiple locations that “recently had a bit of an outage,” and wanted to know: “Why did we not see this coming?”

Uptime consultants will carry out the assessment over the course of a week or so. During that time, Traver expects some flaws will be found and fixed, and the process should educate the in-house staff to a level where they can self-assess to stay online until the next audit: “It will be like going for your annual physical.”

Uptime plans to cover what happens in the event of a failure, perhaps helping the provider define a degraded service level, which should help in aligning expectations with what can be delivered.

So has the cloud changed everything? Actually no, says Newcombe. It is just making visible a gap between customer assumptions and reality which has always existed in conventional data centers.

For instance, a data center which promises 99.95 percent uptime or “four nines” might imply the site will be down for less than 0.05 percent of the year - about four hours. However, Newcombe points out: “Your outsourced provider might well be entitled to take a couple of hours every Sunday.” Over a year this would massively outweigh the 0.05 percent allowed for unplanned outages - and would also probably be longer than outages in cloud services over that time.

Bringing this issue to the surface could benefit everyone, because the techniques used in web service design are available, to some extent, to all. When services have been broken up, the architect should be able to decide how critical each part of the service is, how important is the data it produces, and how vital its availability really is.

Marketing retweets may be expendable, but paid transactions are not. The data in the payroll system is crucial, but it only needs to be continuously available for a short period when the wages are being calculated.

More fundamentally, firms are relying on legacy applications which are simply not written in a way which is compatible with the cloud, or with modern microservice architectures. “Legacy apps should have been rewritten that way in the ‘90s, but they weren’t,” Newcombe says.

Such apps can be run fairly well in legacy data centers, but they can’t be migrated into the cloud in a stable way. Breaking those applications up and moving them to web services will be fantastically complex. And while it may be tempting to blame this crisis on the cloud, the real culprit is the inertia which has left those legacy applications in use.

“We need to have apps which are fault isolating and degrade gracefully,” Newcombe says. “That is an entire discipline that people like Netflix have got right, and other people have got wrong.”

In the end, the cloud model is inevitable, and users and businesses must adapt to it. In doing so, users will be changing one set of risks and threats for a different set of risks - and hopefully increasing their awareness in the process.

“If you already own your own data center it’s an extraordinarily cheap resource if you can run it,” Newcombe adds. It’s also the best place to keep legacy applications like SAP which are designed to run there.

The in-house data center will continue to be popular with companies wanting control of customer data, either because of measures like Europe’s GDPR, or simply because they just like to have control.

Cloud infrastructure will be more open and usable, and - if done correctly - the cloud model can match or exceed the reliability of over-complicated enterprise data centers.

Whichever way you move, the important thing will be to understand the trade-offs you are making.

Fault isolation

Whether you are in the cloud or an on-premises facility, the big problem is how faults in one part of a complex system can affect the rest of the system.

“You want fault isolation, but often a complex system is interconnected in such a way as to create a fault propagation path,” says Newcombe. For example, if a database is continuously mirrored, then a copy is always available if the server fails. But if the data is corrupted, then both copies will be bad, because those errors have propagated to the other instance.

“In large distributed databases, such as Cassandra, the software is itself intended to be distributed, and it is architected to be fault tolerant,” he says, “knowing how to retry and degrade in a graceful manner.”

Legacy applications were designed to run on a “really solid monolithic block,” says Newcombe, but that approach “never really worked. When those horrible monoliths go bang, they go bang in the most appalling way and take the longest time to fix.”

By contrast, “cloudy” microservices “tend to go wrong in minor ways.” Fixing such an app should be considerably quicker.

This article appeared in the June/July issue of DCD magazine. Subscribe for free today:

2N versus hybrid resilience

A redundant story

The cloud is a different thing

The never ending responsibility

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence