Cookie policy: This site uses cookies (small files stored on your computer) to simplify and improve your experience of this website. Cookies are small text files stored on the device you are using to access this website. For more information on how we use and manage cookies please take a look at our privacy and cookie policies. Some parts of the site may not work properly if you choose not to accept cookies.

sections

Microsoft's 'Software Resilient Data Centers'

  • Print
  • Share
  • Comment
  • Save

Microsoft has issued a strategy paper which offers some insight into its approach to operating cloud scale data centers where 'service availability is increasingly being engineered at the software level rather than by focusing on hardware redundancy."
 

Here are some extracts: 

At cloud-scale, equipment failure is an expected operating condition – whether it be servers, circuit breakers, power interruption, lightning strikes, earthquakes, or human error – no matter what happens, the service should gracefully failover to another cluster or data center while maintaining end-user service level agreements (SLAs).


In many companies today, data center capacity is consumed through a rigid series of processes where each element of the stack is designed and optimized in a silo. The software is developed assuming 100 percent available hardware and scale-up performance...


Resilient Software
...At Microsoft, we’ve begun to follow a different model, with a strategic focus on resilient software. We work to drive communications that are more inclusive between developers, operators, and the business. By sharing common business goals and key performance indicators, it has allowed us to more deeply measure the holistic quality and availability of ourapplications. As developers create new software features, they interact with the data center and network teams through a development operations model. This enables everyone to participate in the day-to-day incident triage and bug fixes, while also leveraging chaos-type scenario testing events to determine what is likely going to fail in the future.


The operations team on-boards the software applications and develop a playbook on how to operate it. Focus is placed on the capabilities that need to be provided by the underlying infrastructure, service health, compliance and service level agreements, incident and event management, and how to establish positive cost control around the software and service provided.


The software and the playbook then is layered on top of public, private, and hybrid cloud services that provide an infrastructure abstraction layer where workloads are placed virtually, capacity is advertised, and real-time availability is communicated with the services running on top of the cloud infrastructure.


From a hardware standpoint, the focus is on smart physical placement of the hardware against infrastructure. We define physical and logical failure domains and recognize that workload placement within the data center is a multi-disciplined skillset. We manage our hardware against a full-stack total cost of ownership (TCO) model. And we consider performance per dollar per watt, not just cost per megawatt or transactions per second. At the data center layer, we are focused on efficient performance of these workloads – how do we maintain high availability of the service while making economic decisions around the hardware that is acquired to run them.

We automate events, processes, and telemetry; integrating those communications through the whole stack – the data center, network, server, operations, and back into the application to inform future software development

A tremendous amount of data analytics is available to provide decision support via runtime telemetry and machine learning that completes the loop back to the software developers, helping them write better code to keep service availability high.


The telemetry and tools available today to debug software are several orders of magnitude more advanced than even the best data center commissioning program or standard operating procedure. Software error handling routines can resolve an issue far faster than a human with a crash cart. For example during a major storm, smart algorithms can decide in the blink of an eye to migrate users to another data center because it is less expensive than starting the emergency back-up generators.


Hardware will fail and as cloud providers and a new generation of application developers embrace this fact, service availability is increasingly being engineered at the software platform and application level rather than by focusing on hardware redundancy. By developing against compute, storage, and bandwidth resource pools, hardware failures are abstracted from the application and developers are incented to excel against constraints in latency, instance availability, and budget.


What a cloud should provide

 

In a hardware-abstracted environment, there is a lot of room for the data center to become an active participant in the real-time availability decisions made in the software applications.
Resilient software solves for problems beyond the physical world. However, to get there, the development of the software requires an intimate understanding of the physical in order to abstract it away.

 

In the cloud, software applications should be able to understand the context of their environment. Smartly engineered applications can migrate around different machines and different data centers almost at will, but the availability of the service is dependent on how that workload is placed on top of the physical infrastructure.
 

Data centers, servers, and networks need to be engineered in a way that deeply understands failure and maintenance domains to eliminate the risk of broadly correlated failures within the system.

 

Additionally, we reduce the hardware redundancy in this space by focusing on TCO-driven metrics like performance per dollar per watt, and balancing that against risk and revenue. At cloud-scale, each software revision cycle is an opportunity to improve the infrastructure. The tools available to the software developers

– whether it is debuggers or coding environments – allow them to understand failures much more rapidly than we can model in the data center space.


The full paper: Cloud-Scale Data Centers

 

Related images

  • Hot and cold aisle at Microsoft's Dublin data center. It looks familiar, but Microsoft says it is shifting the resiliency from the physical to the logical...

Have your say

Please view our terms and conditions before submitting your comment.

required
required
required
required
required
  • Print
  • Share
  • Comment
  • Save

Webinars

  • Next Generation Data Centers – Are you ready for scale?

    Wed, 24 Aug 2016 16:00:00

    This presentation will provide a general overview of the data center trends and the ecosystem that comprises of “hyperscale DC”, “MTDC”, and “enterprise DC”.

  • White Space 46: We'll always have Paris

    Fri, 15 Jul 2016 10:35:00

    This week on White Space, we look at the safest data center locations in the world, as rated by real estate management firm Cushman & Wakefield. It will come as no surprise that Iceland comes out on top, while the US and the UK have barely made the top 10. French data center specialist Data4 is promoting Paris as a global technology hub, where it is planning to invest at least €100 million. Another French data center owned by Webaxys is repurposing old Nissan Leaf car batteries in partnership with Eaton. Brexit update: We’ve also heard industry body TechUK outline an optimistic vision of Britain outside the EU – as long as the country remains within the single market and subscribes to the principles of the General Data Protection Regulation.

  • Powering Big Data with Big Solar

    Tue, 12 Jul 2016 18:00:00

    The data center industry is experiencing explosive growth. The expansion of online users and increased transactions will result in the online population to reach 50% of the world’s projected population, moving from 2.3 billion in 2012 to an expected 3.6 billion people by 2017. This growth is requiring data centers to address the carbon impact of their business and to integrate more renewable resources into their projects. Join First Solar to learn: -Why major C&I companies are looking to utility-scale solar as a viable addition to their energy sourcing portfolios. -How cost-effective utility-scale solar options can support datacenters in securing renewable supply. -Case study of how a major data center player implemented solar into their portfolio

  • DC Professional - Meet John Laban

    Tue, 12 Jul 2016 15:25:00

    John has worked in the Telecommunications and Information Transport Systems (ITS) industry for over 35 years, beginning his career at the London Stock Exchange as a BT telecommunication technician. Believing there was a general lack of quality in the ITS industry, John was driven to "professionalize" the ITS industry – starting with a professional diploma programme for the Telecommunications Managers Association – which led to him becoming the first BICSI RCDD in the UK and soon after, a BICSI Master Instructor teaching RCDD and Technician programmes. Find out more about John and upcoming sessions here https://www.dc-professional.com/people/284/

  • White Space 45: Waste Not

    Sun, 10 Jul 2016 15:50:00

    In this episode of White Space, we look back at the news of the week with a special guest Adrian Barker, general manager for EMEA at RF Code and specialist in sensors and data.

More link