Consider the impact of incident management in a modern IT environment. A regular workflow can easily take up to two hours of time and resource to find, fix, implement and resolve. This might be something an IT team can absorb when the incidents in question are just a handful, but a manual approach simply does not scale.
Case in point is a modern managed services environment. The number of incidents can quickly escalate into the hundreds and thousands, depending on the size of the environment, the number of virtual and physical instances in play, the geographic distribution of the infrastructure and the volume of individual customers being serviced.
Take Cisco for example. As one of the world’s biggest communications technology providers, it needs to be focused on efficiency in order to service its customers effectively and deliver a robust, profitable range of managed services.
The two-hour conundrum
Incident response isn’t necessarily quicker when you are a large, multinational technology business. It usually means there’s just more incidents to respond to. In this example, incident response can run to around two hours and 20 minutes from the moment an incident is seen, the issue is fully contextualized and understood, the right experts and applications are informed, and actions are finally made to resolve matters.
A typical incident response looks like this:
- 25 min: Manual Creation of Remedy Incident from MAP Case + Ticket Assignment
- 20 min: Manually Filling Out Remedy Fields, Template Entry, etc.
- 20 min: Technical Verification of Incident (Device Check, etc.)
- 25 min: Manually Open Carrier Ticket(s) & Facilitate Testing (BGP)
- 15 min: Possible Field Service Engagement & Manual Task Entry
- 30 min: Manual Resolution of Remedy INC (Task Closure, Required Fields, etc.)
- 5 min: Manual Closure of Associated MAP Cases
All the steps are logical, but the time overhead is simply unsustainable. In a large, technology-centered organization, such a burden can tie up DevOps teams, as well as front-line IT support staff for significant amounts of time, stifling innovation and development of code that could add value to the business in favor of rudimentary fault finding and resolution.
Cisco, and any other business, can’t operate efficiently if so much time and personnel resource is being absorbed by incident management, so a better way is needed. The answer lies in automation.
Time and motion savings
Bringing down each incident cycle from over two hours to less than 20 minutes delivers immediate savings and benefits. In an environment producing 10,000 incidents in any given time period, the saving is in the region of 18,500 resource hours - equating to a cash saving of almost $800,000. Not to mention that such an improvement also translates into a 79 percent productivity improvement over the manual process that existed before.
Key to the productivity improvement and resource reduction achieved with Cisco was implementing machine-speed completion of many tasks. By developing and deploying 300 automation policies, a substantial amount of the incident management process can be moved away from manual processing and intervention.
Multiple systems can be aggregated automatically, or even replaced and consolidated. Such an approach is also a stepping stone to greater visibility, enabling single point of visibility - itself another time and cost saver.
Automation in a managed services environment can be used to observe and recognize an issue, report and record it, analyze and understand it, search for knowledge and solutions, implement and even execute a solution and then tidy up and close the incident, dealing with much of the admin overhead.
Ticket creation and completion of remedy fields can be immediately reduced from a 45-minute process down to just two minutes, with the majority of the form filling and logging automated by policy and auto-completion.
Verification time can be cut from 20 minutes to just five, with automated pre-screening removing much of the manual intervention needed. Field service engagement can also be cut by two thirds to just five minutes, while the automation of carrier ticket creation and testing is slashed from 25 minutes to just another five.
Finally, resolution and closure, using automated processes, requires just six minutes, rather than a combined 35, giving us a largely automated incident response mechanism that now only impacts DevOps and other personnel for 19 minutes at a time.
Faster also means better
By using algorithms, it is possible to determine what’s the most important, what’s useful data with relevant performance data. The system pulls in multiple sources, consolidating sources and making informed decisions based on the available data.
It even means that you no longer need to expose user accounts and credentials to a large team of people in order to access the information, helping to keep passwords and data more secure.
The result is not only a quicker resolution and more productive and cost-effective process, it also can result in a more detailed information-gathering process. By working at machine-speed, it is more viable to capture detail that manual input would leave out on the grounds it would take too long.
Event enrichment is about delivering meaningful diagnostics quickly, easily and automatically. If Cisco or its customer has simultaneous events, what are next steps? Whether there are 10 or 10,000 incidents being logged, the process can be aided by capturing more volume and more detailed data in tickets. Not only will this help with resolution further down the line, it will also help with developing new automations when patterns of reoccurring (and thus policy-mapped) incidents emerge.
Such an approach can benefit managed service providers and users spanning the whole business landscape. From mid-sized companies with less than a hundred managed instances, to large enterprises and service providers with thousands of instances under management.
Rajiv Patnam is VP of global solutions at ScienceLogic