During the current Covid-19 crisis, enterprise dependency on cloud platform providers (Amazon Web Services, Microsoft Azure, Google Cloud Platform) and on software as a service (Salesforce, Zoom, Teams) has increased. Operators report booming demand, with Microsoft’s Chief Executive Officer Satya Nadella saying:We have seen two years' worth of digital transformation in two months.”

This success brings it with new challenges and some growing responsibilities. Corporate operators tend to classify their applications and services into levels or categories, which defines what they might require for each one in terms of service levels, access, transparency, and accountability. Test and development applications, shadow IT, and smaller and newer online businesses were the initial and main drivers of cloud demand in the first decade, and these have relatively light availability needs. But for critical production applications, now the biggest target for cloud companies, enterprises have expectations and compliance requirements that are far more stringent.

Network cable failure server crash error outage threat
– Thinkstock / AKodisinghe

The creeping criticality of Zoom

During the pandemic, a lot of applications (such as video conferencing, but a lot more besides) have become more critical than they were before. This is a trend that was already underway, but whereas it was previously happening almost imperceptibly, it is now much more obvious. Uptime Institute has described this trend as “creeping criticality” — and it can lead to a circumstance we call “asymmetric resiliency,” which occurs when the level of resiliency required by the application is not matched by the infrastructure supporting it.

The big public cloud operators do not think this applies to them, because all have addressed the issue of availability at the outset. The dominant cloud architecture, with multisite replication using availability zones which are all built on fairly robust data centers (if not always uniformly or demonstrably so), has proved largely effective. Cloud services do suffer some infrastructure-related outages, according to Uptime Institute data, but no more than most enterprises.

Unfortunately for the cloud companies, assurances of availability or of adherence to best practices are not enough. In both the 2019 and the latest 2020 global annual survey of IT and critical infrastructure operators, Uptime Institute asked respondents if they put mission-critical workloads into the cloud, and whether resiliency and visibility is an issue. The answers in each year are almost identical: Over 70 percent did not put any critical applications in the cloud, with a third of this group (21 percent of the total sample) saying the reason was a lack of visibility/accountability about resiliency. And one in six of those who do place applications in the public cloud also say they do not have enough visibility (see chart below).

The pressure to be more open and accountable is growing. During the Covid-19 pandemic, governments have tried to better understand the vulnerability of their national critical infrastructure, and it hasn’t always been easy. In the UK, the government has expressed concern that the cloud players were unforthcoming with good information.

The rules being developed and applied in the financial services sector, especially in Europe, may provide a model for other sectors to follow. The EBA (European Banking Authority) says that banks may not use a cloud, hosting or colocation company for critical applications unless they are open to full, on-site inspection. That presents a challenge for the banks, who must verify that dozens or hundreds of data centers are well built and well run. Those rules have proved important, because without them, banks seeking access did not always get a welcoming response.

The challenge for financial services regulators and operators is that the banking system is highly interdependent, with many end-to-end processes spanning multiple businesses and multiple data centers and data services. Failure of one can mean a systemic failure; minimization of risk, and full accountability, is essential in the sector — assurances of 99.99 percent availability are not enough.

During the pandemic, the cloud providers have performed well, but failures or slowdowns have occurred. Microsoft’s success, for example, had led to constraints in the availability of Azure services in many regions. For test, dev or even some smaller businesses, this is not too serious: for critical, enterprise IT, it might be.

Uptime Institute data suggests that less than 10 percent of mission-critical workloads are running in the public cloud, and that this number is not rising as fast as overall cloud growth (even if new applications like Microsoft Teams have taken off globally). With more accountability and visibility, this number might increase faster.