Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Microsoft's 'Software Resilient Data Centers'

  • Print
  • Share
  • Comment
  • Save

Microsoft has issued a strategy paper which offers some insight into its approach to operating cloud scale data centers where 'service availability is increasingly being engineered at the software level rather than by focusing on hardware redundancy."
 

Here are some extracts: 

At cloud-scale, equipment failure is an expected operating condition – whether it be servers, circuit breakers, power interruption, lightning strikes, earthquakes, or human error – no matter what happens, the service should gracefully failover to another cluster or data center while maintaining end-user service level agreements (SLAs).


In many companies today, data center capacity is consumed through a rigid series of processes where each element of the stack is designed and optimized in a silo. The software is developed assuming 100 percent available hardware and scale-up performance...


Resilient Software
...At Microsoft, we’ve begun to follow a different model, with a strategic focus on resilient software. We work to drive communications that are more inclusive between developers, operators, and the business. By sharing common business goals and key performance indicators, it has allowed us to more deeply measure the holistic quality and availability of ourapplications. As developers create new software features, they interact with the data center and network teams through a development operations model. This enables everyone to participate in the day-to-day incident triage and bug fixes, while also leveraging chaos-type scenario testing events to determine what is likely going to fail in the future.


The operations team on-boards the software applications and develop a playbook on how to operate it. Focus is placed on the capabilities that need to be provided by the underlying infrastructure, service health, compliance and service level agreements, incident and event management, and how to establish positive cost control around the software and service provided.


The software and the playbook then is layered on top of public, private, and hybrid cloud services that provide an infrastructure abstraction layer where workloads are placed virtually, capacity is advertised, and real-time availability is communicated with the services running on top of the cloud infrastructure.


From a hardware standpoint, the focus is on smart physical placement of the hardware against infrastructure. We define physical and logical failure domains and recognize that workload placement within the data center is a multi-disciplined skillset. We manage our hardware against a full-stack total cost of ownership (TCO) model. And we consider performance per dollar per watt, not just cost per megawatt or transactions per second. At the data center layer, we are focused on efficient performance of these workloads – how do we maintain high availability of the service while making economic decisions around the hardware that is acquired to run them.

We automate events, processes, and telemetry; integrating those communications through the whole stack – the data center, network, server, operations, and back into the application to inform future software development

A tremendous amount of data analytics is available to provide decision support via runtime telemetry and machine learning that completes the loop back to the software developers, helping them write better code to keep service availability high.


The telemetry and tools available today to debug software are several orders of magnitude more advanced than even the best data center commissioning program or standard operating procedure. Software error handling routines can resolve an issue far faster than a human with a crash cart. For example during a major storm, smart algorithms can decide in the blink of an eye to migrate users to another data center because it is less expensive than starting the emergency back-up generators.


Hardware will fail and as cloud providers and a new generation of application developers embrace this fact, service availability is increasingly being engineered at the software platform and application level rather than by focusing on hardware redundancy. By developing against compute, storage, and bandwidth resource pools, hardware failures are abstracted from the application and developers are incented to excel against constraints in latency, instance availability, and budget.


What a cloud should provide

 

In a hardware-abstracted environment, there is a lot of room for the data center to become an active participant in the real-time availability decisions made in the software applications.
Resilient software solves for problems beyond the physical world. However, to get there, the development of the software requires an intimate understanding of the physical in order to abstract it away.

 

In the cloud, software applications should be able to understand the context of their environment. Smartly engineered applications can migrate around different machines and different data centers almost at will, but the availability of the service is dependent on how that workload is placed on top of the physical infrastructure.
 

Data centers, servers, and networks need to be engineered in a way that deeply understands failure and maintenance domains to eliminate the risk of broadly correlated failures within the system.

 

Additionally, we reduce the hardware redundancy in this space by focusing on TCO-driven metrics like performance per dollar per watt, and balancing that against risk and revenue. At cloud-scale, each software revision cycle is an opportunity to improve the infrastructure. The tools available to the software developers

– whether it is debuggers or coding environments – allow them to understand failures much more rapidly than we can model in the data center space.


The full paper: Cloud-Scale Data Centers

 

Related images

  • Hot and cold aisle at Microsoft's Dublin data center. It looks familiar, but Microsoft says it is shifting the resiliency from the physical to the logical...

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Live Customer Roundtable: Optimizing Capacity (12:00 EST)

    Tue, 8 Sep 2015 16:00:00

    The biggest challenge facing many data centers today? Capacity. How to optimize what you have today. And when you need to expand, how to expand your capacity smarter. Learn from the experts about how Data Center Infrastructure Management (DCIM) and Prefabricated Modular Data Centers are driving best practices in how capacity is managed and optimized: - lower costs - improved efficiencies and performance - better IT services delivered to the business - accurate long-range planning Don;t miss out on our LIVE customer roundtable and your chance to pose questions to expert speakers from Commscope, VIRTUS and University of Montana. These enterprises are putting best practices to work today in the only place that counts – the real world.

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (APAC)

    Wed, 26 Aug 2015 05:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (Americas)

    Tue, 25 Aug 2015 18:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (EMEA)

    Tue, 25 Aug 2015 14:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • 5 Reasons Why DCIM Has Failed

    Wed, 15 Jul 2015 10:00:00

    Historically, DCIM systems have over-promised and under-delivered. Vendors have supplied complex and costly solutions which fail to address real business drivers and goals. Yet the rewards can be vast and go well beyond better-informed decision-making, to facilitate continuous improvement and cost savings across the infrastructure. How can vendors, customers and the industry as a whole take a better approach? Find out on our webinar on Wednesday 15 July.

More link