• IT REVOLUTION
  • Newsletter
  • About
  • Contact
  • My Resources
  • Books
  • Resources
  • Courses
  • Podcast
  • Videos
  • Conference
  • Blog
  • IT REVOLUTION
  • Newsletter
  • About
  • Contact
  • My Resources

IT Revolution

Helping technology leaders achieve their goals through publishing, events & research.

  • IT REVOLUTION
  • Newsletter
  • About
  • Contact
  • My Resources
  • Books
  • Resources
  • Courses
  • Podcast
  • Videos
  • Conference
  • Blog

Using Swarming for Incident Response

June 17, 2021 by IT Revolution Leave a Comment

The traditional model of incident management using ticket handling progresses a ticket through multiple tiers: L1, L2, L3. This model creates queues that elongate response times and create ticket handoffs, which loses vital context with each group. In complex systems and failures, the ticket is delayed in getting to the correct responders. The end result is long response times and customer frustration.

In the new Prepare/Respond/Review Incident Management Framework, we advise against using this tiered ticketing system and moving toward incident swarming.

Incident Response

Swarming provides a mechanism to remove queues and handoffs for major incident handling and to quickly bring responders and dependent responders together. Incident swarming focuses accountability to drive reduction in recovery time and to share knowledge about the incident rapidly.

The Swarming Model

The incident-team swarming model is an alternative to solve many tiered-approach challenges. It is based on a networked collaboration across the incident team rather than a funnel approach. There are very few tiers and escalation is fast, getting all on-call members from all teams on as quickly as possible.

It is recommended to define triage groups for every product that includes all parties to be paged/escalated for an outage. For example, an application that uses a popular database platform would have DBA and storage on their triage list. This model prefers the full triage on-calls to be paged, and then members are dismissed once the problem has been targeted.Tickets can be escalated quickly by the initial intake point (L1 help desk) or routed automatically to the owning team.

Swarming-Team Formation

Most organizations form their swarm teams based on individuals’ areas of expertise and reputation. A combination of the models seen at the companies BMC at CSG can be categorized by these four types:

  • Severity 1 swarm teams are composed of three individuals working on a weekly rotation. Their objective is to address the highest-severity tickets as quickly as possible. Therefore, they focus on a smaller number of tickets. Domain and application SMEs are called to be part of this swarm.
  • Severity 1 triage swarm teams are composed of the on-call members for all supporting teams for an application. For example, a Windows application that uses an SQL server would have a triage team defined as Windows System Admin, SQL Server DBA, Storage Admin, Application Build/Run team. When an impact occurs to this application, the full triage group is paged. Once the causal factors are narrowed, team members are dismissed.
  • The local dispatch swarm is dedicated to a product or application. They meet daily every sixty to ninety minutes to cherry-pick tickets that they can resolve immediately and forward the tickets they can’t solve to product-line support teams. In doing so, they reduce the resolution-cycle time of a significant number of tickets.
  • The backlog swarm addresses challenging tickets brought to them by product-line support teams. Any tickets still open after being processed by the triage swarm are brought to the backlog swarm, bringing together cross-functional team members from various departments and technical skills. They meet daily and work on the most challenging incidents.

Benefits of Swarming

When incident swarming is well planned and given the required environment and empowerment, it leads to significant business benefits, including:

  • Decreases mean time to resolve (MTTR).
  • Creates faster first-contact resolution and a higher level of issues resolved within the support swarm, without escalation to other support groups.
  • Decreases average cost per issue due to a quicker turnaround and fewer escalations.
  • Creates greater customer satisfaction and higher retention.
  • Decreases incident backlog.
  • Engages broader knowledge of the team throughout the entire resolution process.
  • Creates tremendous improvements in knowledge sharing, skill development, team growth, retention, and productivity.
  • Creates shared purpose, goals, and areas of expertise within incident swarm response teams, leading to improved collaboration.
  • Moves teams away from individual heroics and measures swarming teams’ performances using team performance metrics.

Conclusion

When combined with the other patterns within the Prepare/Respond/Review Incident Management Framework, swarming can provide an effective and swift mechanism for responding to incidents and outages.

To learn about the full framework, download the white paper here.

 

 

 

 

Most Recent Articles

  • Model Life-Cycle Management at Continental Tires
  • Flow Engineering
  • Value Stream Management and Organizing Around Value

Filed Under: DevOps Enterprise Forum Guidance Papers, Leadership, Organizational Change Tagged With: devops enterprise forum, incident management, swarming, teams

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

newsletter sign up

Topics

Tags

agile agile conversations better value sooner safer happier business business agility business leadership case study cloud continuous delivery devops DevOps Advice Series devops case study devops enterprise forum DevOps Enterprise Summit devops handbook digital transformation dominica degrandis douglas squirrel enterprise Gene Kim incident management information technology IT jeffrey fredrick jez humble John Willis Jonathan Smart leadership lean making work visible manuel pais mark schwartz matthew skelton nicole forsgren operations Project to Product project to product tranformation seven domains of transformtion software software delivery Sooner Safer Happier teams team topologies the idealcast WaysofWorkingSeries

Recent Posts

  • Model Life-Cycle Management at Continental Tires
  • Flow Engineering
  • Value Stream Management and Organizing Around Value
  • Don’t Just Survive Your Audit, Thrive In It
  • Exclusive Excerpt from The Value Flywheel Effect

Privacy Policy

Featured Book

Featured Book Image

Events

  • DevOps Enterprise Summit Virtual - Europe
    Virtual · 10 - 12 May 2022
  • DevOps Enterprise Summit US Flagship Event
    Las Vegas · October 18 - 20, 2022
  • DevOps Enterprise Summit Virtual - US
    Virtual · December 6 - 8, 2022
  • Facebook
  • LinkedIn
  • Twitter
  • YouTube
Copyright © 2022 IT Revolution. All rights reserved.
Site by Objectiv.