Microsoft has issued a strategy paper which offers some insight into its approach to operating cloud scale data centers where 'service availability is increasingly being engineered at the software level rather than by focusing on hardware redundancy."
Here are some extracts:
At cloud-scale, equipment failure is an expected operating condition – whether it be servers, circuit breakers, power interruption, lightning strikes, earthquakes, or human error – no matter what happens, the service should gracefully failover to another cluster or data center while maintaining end-user service level agreements (SLAs).
...At Microsoft, we’ve begun to follow a different model, with a strategic focus on resilient software. We work to drive communications that are more inclusive between developers, operators, and the business. By sharing common business goals and key performance indicators, it has allowed us to more deeply measure the holistic quality and availability of ourapplications. As developers create new software features, they interact with the data center and network teams through a development operations model. This enables everyone to participate in the day-to-day incident triage and bug fixes, while also leveraging chaos-type scenario testing events to determine what is likely going to fail in the future.
The operations team on-boards the software applications and develop a playbook on how to operate it. Focus is placed on the capabilities that need to be provided by the underlying infrastructure, service health, compliance and service level agreements, incident and event management, and how to establish positive cost control around the software and service provided.
The software and the playbook then is layered on top of public, private, and hybrid cloud services that provide an infrastructure abstraction layer where workloads are placed virtually, capacity is advertised, and real-time availability is communicated with the services running on top of the cloud infrastructure.
From a hardware standpoint, the focus is on smart physical placement of the hardware against infrastructure. We define physical and logical failure domains and recognize that workload placement within the data center is a multi-disciplined skillset. We manage our hardware against a full-stack total cost of ownership (TCO) model. And we consider performance per dollar per watt, not just cost per megawatt or transactions per second. At the data center layer, we are focused on efficient performance of these workloads – how do we maintain high availability of the service while making economic decisions around the hardware that is acquired to run them.
We automate events, processes, and telemetry; integrating those communications through the whole stack – the data center, network, server, operations, and back into the application to inform future software development
A tremendous amount of data analytics is available to provide decision support via runtime telemetry and machine learning that completes the loop back to the software developers, helping them write better code to keep service availability high.
The telemetry and tools available today to debug software are several orders of magnitude more advanced than even the best data center commissioning program or standard operating procedure. Software error handling routines can resolve an issue far faster than a human with a crash cart. For example during a major storm, smart algorithms can decide in the blink of an eye to migrate users to another data center because it is less expensive than starting the emergency back-up generators.
Hardware will fail and as cloud providers and a new generation of application developers embrace this fact, service availability is increasingly being engineered at the software platform and application level rather than by focusing on hardware redundancy. By developing against compute, storage, and bandwidth resource pools, hardware failures are abstracted from the application and developers are incented to excel against constraints in latency, instance availability, and budget.
What a cloud should provide
In the cloud, software applications should be able to understand the context of their environment. Smartly engineered applications can migrate around different machines and different data centers almost at will, but the availability of the service is dependent on how that workload is placed on top of the physical infrastructure.
Data centers, servers, and networks need to be engineered in a way that deeply understands failure and maintenance domains to eliminate the risk of broadly correlated failures within the system.
Additionally, we reduce the hardware redundancy in this space by focusing on TCO-driven metrics like performance per dollar per watt, and balancing that against risk and revenue. At cloud-scale, each software revision cycle is an opportunity to improve the infrastructure. The tools available to the software developers
– whether it is debuggers or coding environments – allow them to understand failures much more rapidly than we can model in the data center space.
The full paper: Cloud-Scale Data Centers.