Managing capacity to stave off disaster

The world runs on data centers. A broad statement yet even the experts don’t always see the big picture on how critical data center operations are to the hum of modern life and enterprise. It would stand to reason then that planning and optimizing systems for maximum efficiency and availability should top the business priority list. This isn’t always the case.

Most organizations operating data centers tend to be reactive rather than proactive. As a result, IT departments spend nearly half their time responding to performance issues, service outages, and similar problems. Each unexpected incident requires an IT staff of seven to spend nearly 3.5 hours resolving the issue. The average number of “fires” per week is eight. In total, the average organization spends 190 hours per week reacting to performance problems. (Source: TeamQuest Global IT Management Survey)

We have a problem

This reactive mode of operating is highly inefficient. Most performance issues can be traced to oversight and the failure to plan for current and future demand on systems. Avoiding outages by over-provisioning is wasteful, expensive and adds unnecessary complexity and vulnerability. Under-utilized servers and ghost machines are forgotten and ignored, eating up power, footprint, and rackspace, but not producing useful work. If unmonitored, they are probably hosting outdated software, unprotected data, or malicious intruders.

On the flip side, under-provisioning from poor capacity planning and management leads to system overload, whereby application performance suffers or entire services crash. When an outage impacts end-users, it’s no tree falling in a forest. Inconvenienced customers immediately take to social media with their complaints, and beleaguered IT “firefighters” scramble to save the day. How many outages can your business absorb before customers lose faith and turn away? The global Skype outage demonstrates that brand loyalty means nothing when your apps are unusable.

More serious outages of paid services aren’t uncommon: Plusnet left thousands in the UK without phone or broadband access for several hours. Such failures affect the service provider, as well as the individuals and businesses dependent on them, causing irreversible brand and reputational damage. This is no way to conduct business.

Network outages are typically traced to DNS malfunctions (Plusnet) or vague network or configuration issues (Skype) however most high profile outages share a more common thread: the amount of traffic hitting data servers is greatly underestimated and therefore subjected to external events. When combined with untested (or nonexistent) failover protocols and incident response planning, the lack of mature capacity management planning is all too apparent.

Capacity management isn’t a piece of software layered onto the network stack. It’s an oft-overlooked discipline that ensures IT infrastructure is provisioned at the right time in the right volume so the data center operates at optimum efficiency. At the heart of it is the continuous enterprise-wide practice: assessing current and future capacity demand through metrics, advanced analytics, and stakeholder input. While capacity management may appear as a matter of supply and demand, modern data centers—and the services running on them—are the very definition of complexity. Layers of network, storage, applications, and data hardware and software create interdependencies, and a flaw or latency in one component triggers unexpected effects throughout the ecosystem.

How can such intricacy be managed consistently enough to maximize efficiency and avoid outages? Mature capacity management and optimization requires close and continuous alignment with business planning across the enterprise. Information technology objectives should be communicated in plain language to the entire staff with feedback solicited. As awareness increases across all departments, errors decrease accordingly. Proper IT optimization or performance analysis can improve efficiency across the organization improving overall business and workforce productivity.

Best practices

Focus on customer experience from both a historical and forecasting perspective by collecting and analyzing utilization, response, and performance data. Use predictive analytics, previous patterns determine capacity requirements and potential pitfalls of planned initiatives. Advanced analytics can predict the impact of sales, promotions, seasonal surges, etc. When planning an application, service, or configuration roll out, include resources for running simulations to test for user experience and discovery of hidden flaws or interdependencies.

In any data center scenario, intelligent planning and repeated testing can help prevent service failure, thereby avoiding the costs required to clean up after the mess. Unplanned outages erode market share, revenue, customer experience and the trust between IT and business, all of which makes business operations easier and competition more difficult.

Data center computing power is supposed to make our lives easier. As our dependency on connected technology infrastructure deepens, careful capacity management means less firefighting and more focus on providing robust and reliable services.

As data centers become more reliant on IT optimization tools, proper capacity planning and the ability to use analytics show throughout the enterprise, delivering on everything from preventing cloud outages to improving overall business productivity, allowing IT the time they need to focus on improvements rather than incidents.

Per Bauer is the director of Global Services at TeamQuest - a software company specializing in systems management, performance management and capacity planning.

Managing capacity to stave off disaster

We have a problem

Best practices

Tags

The make vs. buy decision for data center infrastructure management software – A clear choice

2023 Data Center Market Trends: Hong Kong Asia's Connectivity Hub

Emerging Energy Storage Technologies

Success story: Kao Data and Cadence