Resiliency describes the extent to which a system, digital infrastructure or application architecture is able to maintain its intended service levels, with minimal or no impact on users or business objectives, in spite of planned and unplanned disruptions. It also describes the ability of a system, infrastructure or application to recover full business operations after a disruption or disaster has occurred.

It has occasionally been said that the move toward distributed and cloud architectures/business models represents a trend toward a more patient, gradual availability model, where failures are tolerated in exchange for cheaper services. This is not the case - underlying tradeoffs in data availability do not make businesses more tolerant of failures or slowdowns. In fact, the exact opposite is the case - in the 2017 Uptime survey, only eight percent of respondents said their management is less concerned about outages than it was a year ago. Incidents and failures appear to resonate outward, and have an ever-greater impact.

Resiliency is all too obvious when it is absent. Some recent data center and core infrastructure failures have led to national and global headlines and major financial and reputational losses. Many of these incidents show that failures of equipment or processes quickly escalate, and frequently involve multiple facility and IT systems.

How is resiliency changing?

Power equipment at QTS Irving
Power equipment at QTS Irving – QTS

A de facto design principle of almost all data centers today, and much of IT, is a primary focus on physical or infrastructure resiliency – the ability to continue running, in spite of maintenance, power or equipment failures, through redundancy of equipment and power distribution.

But over the next decade, we expect that, for a sizeable but as yet undetermined number of operators, resiliency and redundancy at the individual data center level will, in whole or part, be complemented and/or replaced by resiliency at the IT level. This is not necessarily a resiliency tactic or strategy, but an inevitable move, resulting from the fact that applications themselves are becoming more distributed.

Achieving distributed resiliency will not happen uniformly. At its best, it will involve the rapid handover of workloads to alternative venues in the event of failures, under automated software control, supported by reliable networks. This has been variously called ‘software-level resiliency,’ ‘network level resiliency’ or ‘cloud-level resiliency.’

Although adoption of these technologies will have a big impact on data centers and data center design, it has wider implications as a key part of digital transformation for all companies. If resiliency can be fully and safely achieved based on distributed, lightweight systems and data centers, it will enable companies to rewrite the way they plan, build and spend on IT and digital services.

Potential benefits of distributing resiliency

To some degree, the general move to distributed applications will require a new approach to ensuring resiliency, regardless of whether this is an intentional strategic/architectural move. CIOs and service providers will find that, over time, they must tactically address weaknesses in their infrastructure that cause incidents and loss of service.

However, many of the benefits associated with a deliberate move to distributed resiliency are compelling. The long-term vision is that resiliency ultimately becomes autonomic – managing itself, shifting loads and traffic across geographies according to needs, replicating data and optimizing for performance and economics with little intervention. The short- and medium-term promise is that, after a transition, CIOs will have a more reliable, agile infrastructure that costs less and can support distributed modern applications far better. The key benefits, some theoretical and some already manifesting themselves, are as follows.

Availability

As long as at least three data centers are involved and the networking has sufficient capacity and variety, then extremely high availability should exist. Failure at two or more data centers is extremely unlikely, and even less so across a larger number of data centers or multiple regions.

Efficiency

Cloud resiliency (whether public clouds are involved or not) means that once more than two data centers are used, spare active capacity is spread among a number of them. If utilization is pushed to high levels, then the more data centers involved, the more efficient the operation becomes.

Agility

A distributed approach means that applications can be developed to run at any of the sites, or across all – and appropriate infrastructure investments can also be made with greater flexibility.

Elimination or reduction of disaster recovery costs

A move to cloud-based resiliency will ultimately mean a move to an all-active distributed data center model. This means both the processes and resources required for effective DR may no longer be needed or at least may be reduced and redefined, with governance and policies becoming more important. Cloud-based DR services, now multiplying, represent an effective partial solution, avoiding the costs of unused capacity.

Tried and tested resiliency

One of the problems with traditional resiliency and DR strategies is that testing is extremely difficult and sometimes risky, because tests must be carried out on live systems, sometimes at a data center-wide level. Distributed resiliency uses soft failovers, with ‘hot swappable’ software and cloud services. Many foreseeable failures can be easily and repeatedly tested in different ways.

Ability to support (some) distributed applications

Almost by definition, single-site or traditional resiliency approaches cannot fully support applications that have components in multiple locations across networks. When failures occur, local applications will be missing core components. Distributed strategies will not eliminate these problems, but by using replication of components and multiple sites, they may make such incidents less likely.

Maintaining integrity and availability

Historically, the easiest way to manage the resiliency of an application, a system or key data has always been to keep a master copy and guard it like the crown jewels – often in a heavily protected, highly resilient data center. The second copy is always imperfect, and may represent merely a best effort to replicate as much of the data as quickly as possible. This architecture carries its own risks and is unsuited to the needs of digital businesses today.

With cloud services and architectures now part of the mix, or even the totality, the CIO must determine which type (or types) of resiliency is most appropriate for each type of application and data, based on business needs and technical risk, and then architect the best combination of IT infrastructure. This will span data center resiliency, applications, databases and networking, and must take into account organizational structure, processes, tools and automation. From all this, the organization must then deliver comprehensive and consistent applications that meet and exceed customer expectations for service availability and resiliency.

Andy Lawrence is ececutive director of Uptime Institute Research