Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Microsoft's 'Software Resilient Data Centers'

  • Print
  • Share
  • Comment
  • Save

Microsoft has issued a strategy paper which offers some insight into its approach to operating cloud scale data centers where 'service availability is increasingly being engineered at the software level rather than by focusing on hardware redundancy."
 

Here are some extracts: 

At cloud-scale, equipment failure is an expected operating condition – whether it be servers, circuit breakers, power interruption, lightning strikes, earthquakes, or human error – no matter what happens, the service should gracefully failover to another cluster or data center while maintaining end-user service level agreements (SLAs).


In many companies today, data center capacity is consumed through a rigid series of processes where each element of the stack is designed and optimized in a silo. The software is developed assuming 100 percent available hardware and scale-up performance...


Resilient Software
...At Microsoft, we’ve begun to follow a different model, with a strategic focus on resilient software. We work to drive communications that are more inclusive between developers, operators, and the business. By sharing common business goals and key performance indicators, it has allowed us to more deeply measure the holistic quality and availability of ourapplications. As developers create new software features, they interact with the data center and network teams through a development operations model. This enables everyone to participate in the day-to-day incident triage and bug fixes, while also leveraging chaos-type scenario testing events to determine what is likely going to fail in the future.


The operations team on-boards the software applications and develop a playbook on how to operate it. Focus is placed on the capabilities that need to be provided by the underlying infrastructure, service health, compliance and service level agreements, incident and event management, and how to establish positive cost control around the software and service provided.


The software and the playbook then is layered on top of public, private, and hybrid cloud services that provide an infrastructure abstraction layer where workloads are placed virtually, capacity is advertised, and real-time availability is communicated with the services running on top of the cloud infrastructure.


From a hardware standpoint, the focus is on smart physical placement of the hardware against infrastructure. We define physical and logical failure domains and recognize that workload placement within the data center is a multi-disciplined skillset. We manage our hardware against a full-stack total cost of ownership (TCO) model. And we consider performance per dollar per watt, not just cost per megawatt or transactions per second. At the data center layer, we are focused on efficient performance of these workloads – how do we maintain high availability of the service while making economic decisions around the hardware that is acquired to run them.

We automate events, processes, and telemetry; integrating those communications through the whole stack – the data center, network, server, operations, and back into the application to inform future software development

A tremendous amount of data analytics is available to provide decision support via runtime telemetry and machine learning that completes the loop back to the software developers, helping them write better code to keep service availability high.


The telemetry and tools available today to debug software are several orders of magnitude more advanced than even the best data center commissioning program or standard operating procedure. Software error handling routines can resolve an issue far faster than a human with a crash cart. For example during a major storm, smart algorithms can decide in the blink of an eye to migrate users to another data center because it is less expensive than starting the emergency back-up generators.


Hardware will fail and as cloud providers and a new generation of application developers embrace this fact, service availability is increasingly being engineered at the software platform and application level rather than by focusing on hardware redundancy. By developing against compute, storage, and bandwidth resource pools, hardware failures are abstracted from the application and developers are incented to excel against constraints in latency, instance availability, and budget.


What a cloud should provide

 

In a hardware-abstracted environment, there is a lot of room for the data center to become an active participant in the real-time availability decisions made in the software applications.
Resilient software solves for problems beyond the physical world. However, to get there, the development of the software requires an intimate understanding of the physical in order to abstract it away.

 

In the cloud, software applications should be able to understand the context of their environment. Smartly engineered applications can migrate around different machines and different data centers almost at will, but the availability of the service is dependent on how that workload is placed on top of the physical infrastructure.
 

Data centers, servers, and networks need to be engineered in a way that deeply understands failure and maintenance domains to eliminate the risk of broadly correlated failures within the system.

 

Additionally, we reduce the hardware redundancy in this space by focusing on TCO-driven metrics like performance per dollar per watt, and balancing that against risk and revenue. At cloud-scale, each software revision cycle is an opportunity to improve the infrastructure. The tools available to the software developers

– whether it is debuggers or coding environments – allow them to understand failures much more rapidly than we can model in the data center space.


The full paper: Cloud-Scale Data Centers

 

Related images

  • Hot and cold aisle at Microsoft's Dublin data center. It looks familiar, but Microsoft says it is shifting the resiliency from the physical to the logical...

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (APAC)

    Wed, 26 Aug 2015 05:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (Americas)

    Tue, 25 Aug 2015 18:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • Power Optimization – Can Your Business Survive an Unplanned Outage? (EMEA)

    Tue, 25 Aug 2015 14:00:00

    Most outages are accidental; by adopting an intelligent power chain, you can help mitigate them and reduce your mean-time to repair. Join Anixter and DatacenterDynamics for a webinar on the five best practices and measurement techniques to help you obtain the performance data you need to optimize your power chain. Register today!

  • 5 Reasons Why DCIM Has Failed

    Wed, 15 Jul 2015 10:00:00

    Historically, DCIM systems have over-promised and under-delivered. Vendors have supplied complex and costly solutions which fail to address real business drivers and goals. Yet the rewards can be vast and go well beyond better-informed decision-making, to facilitate continuous improvement and cost savings across the infrastructure. How can vendors, customers and the industry as a whole take a better approach? Find out on our webinar on Wednesday 15 July.

  • Is Your Data Center Network Adapting To Constant Change? (APAC)

    Wed, 24 Jun 2015 05:00:00

    Over the next three years, global IP data center traffic is forecast to grow 23 percent—and 75 percent of that growth is expected to be internal*. In a constantly changing environment and as planners seek to control costs by maximizing floor space, choosing the right cabling architectures is now critical. Is your structured cabling system ready to meet the challenge? Join Anixter's Technical Services Director, Andrew Flint and DatacenterDynamics CTO Stephen Worn and Jonathan Jew, Editor ASI as they discuss how to: •Create network stability and flexibility •Future-ready cabling topology •Make the right media selection •Anticipate and plan for density demands Essential viewing for data center planners and operators everywhere – Register Now! Please note that these presentations will only be delivered in English. 1.EMEA: Tuesday 23 June, 3 p.m BST 2.Americas: Tuesday 23 June, 1 p.m CST 3.APAC: Wednesday 24 June, 1 p.m SGT APAC customers – please note the equivalent country times: India: 10:30am; Indonesia, Thailand: 12 noon; Singapore, Malaysia, Philippines, China, Taiwan, Hong Kong: 1pm; Australia (Sydney): 3pm ; New Zealand: 5pm.

More link