June 17, 2021

Using Swarming for Incident Response

By IT Revolution

The traditional model of incident management using ticket handling progresses a ticket through multiple tiers: L1, L2, L3. This model creates queues that elongate response times and create ticket handoffs, which loses vital context with each group. In complex systems and failures, the ticket is delayed in getting to the correct responders. The end result is long response times and customer frustration.

In the new Prepare/Respond/Review Incident Management Framework, we advise against using this tiered ticketing system and moving toward incident swarming.

Incident Response

Swarming provides a mechanism to remove queues and handoffs for major incident handling and to quickly bring responders and dependent responders together. Incident swarming focuses accountability to drive reduction in recovery time and to share knowledge about the incident rapidly.

The Swarming Model

The incident-team swarming model is an alternative to solve many tiered-approach challenges. It is based on a networked collaboration across the incident team rather than a funnel approach. There are very few tiers and escalation is fast, getting all on-call members from all teams on as quickly as possible.

It is recommended to define triage groups for every product that includes all parties to be paged/escalated for an outage. For example, an application that uses a popular database platform would have DBA and storage on their triage list. This model prefers the full triage on-calls to be paged, and then members are dismissed once the problem has been targeted.Tickets can be escalated quickly by the initial intake point (L1 help desk) or routed automatically to the owning team.

Swarming-Team Formation

Most organizations form their swarm teams based on individuals’ areas of expertise and reputation. A combination of the models seen at the companies BMC at CSG can be categorized by these four types:

Severity 1 swarm teams are composed of three individuals working on a weekly rotation. Their objective is to address the highest-severity tickets as quickly as possible. Therefore, they focus on a smaller number of tickets. Domain and application SMEs are called to be part of this swarm.
Severity 1 triage swarm teams are composed of the on-call members for all supporting teams for an application. For example, a Windows application that uses an SQL server would have a triage team defined as Windows System Admin, SQL Server DBA, Storage Admin, Application Build/Run team. When an impact occurs to this application, the full triage group is paged. Once the causal factors are narrowed, team members are dismissed.
The local dispatch swarm is dedicated to a product or application. They meet daily every sixty to ninety minutes to cherry-pick tickets that they can resolve immediately and forward the tickets they can’t solve to product-line support teams. In doing so, they reduce the resolution-cycle time of a significant number of tickets.
The backlog swarm addresses challenging tickets brought to them by product-line support teams. Any tickets still open after being processed by the triage swarm are brought to the backlog swarm, bringing together cross-functional team members from various departments and technical skills. They meet daily and work on the most challenging incidents.

Benefits of Swarming

When incident swarming is well planned and given the required environment and empowerment, it leads to significant business benefits, including:

Decreases mean time to resolve (MTTR).
Creates faster first-contact resolution and a higher level of issues resolved within the support swarm, without escalation to other support groups.
Decreases average cost per issue due to a quicker turnaround and fewer escalations.
Creates greater customer satisfaction and higher retention.
Decreases incident backlog.
Engages broader knowledge of the team throughout the entire resolution process.
Creates tremendous improvements in knowledge sharing, skill development, team growth, retention, and productivity.
Creates shared purpose, goals, and areas of expertise within incident swarm response teams, leading to improved collaboration.
Moves teams away from individual heroics and measures swarming teams’ performances using team performance metrics.

Conclusion

When combined with the other patterns within the Prepare/Respond/Review Incident Management Framework, swarming can provide an effective and swift mechanism for responding to incidents and outages.

To learn about the full framework, download the white paper here.

- About The Authors

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

with Dominica DeGrandis

with Matthew Skelton & Manuel Pais

Through May 3, 2024

April 24-25, 2024

August 20-22, 2024

By Gene Kim

By Dr. André Martin

Using Swarming for Incident Response