June 7, 2021

Framework for Incident Management: Prepare, Respond, Review

By IT Revolution

Learn to improve your organization’s incident management with this framework for incident management: Prepare, Respond, Review.

In this post, based on the white paper A Framework for Incident Response, Assessment, and Learning, by Shaaron A. Alvares, Josh Atwell, Jason Cox, Erica Morrison, Scott Prugh, and Randy Shoup, we present fresh incident management framework to help you improve your overall organizational response to incidents.

In the white paper this framework is broken down into a taxonomy of dysfunctions and patterns to help you greatly improve your incident response and posture.

Why You Need an Incident Management Framework

Incidents and outages are an existential threat to businesses that build, operate, and consume technology services. Businesses and customers rely heavily on these critical systems. When they fail, customer credibility can be irreparably harmed, putting both business reputation and revenue at stake.

Your teams are already responding to incidents, but how well are they doing it? How are they adjusting as the technology landscape changes? Could they do better?

With this framework for incident management, we’ll help you point the north star toward the ideal state and change the narrative about incidents from one of blame to one of learning over the long term.

We’ll also provide real-world, right-sized patterns and examples that can be used for incremental improvement to change behavior with a view to a long-term investment, giving pragmatic and tactile practices and patterns, with examples from some of the top practitioners and companies, to address a complicated topic that is hard to cover well.

The traditional ITIL-based incident-management framework gave companies a structured way of categorizing, handling, and resolving incidents. This framework, as well as adjacent processes, such as problem management, have become the reference model for organizations to deal with the realities of handling incidents.

However, software systems in today’s enterprises are composed of hundreds of different systems and technologies that interact in surprising ways. As complexity has increased, the ITIL framework has not evolved to deal with the messy reality.

As such, the traditional way of thinking about incidents and dealing with them has become operational debt and can prevent organizations from evolving. There is also a dearth of practical, accessible, hands-on experience about how leading companies deal with the realities of incident management and response in this complex world.

Dysfunctions with traditional incident management include:

Focus on blame and finger pointing vs. inquiry, learning and improvement.
Application of hindsight bias.
Incidents treated as exceptions: Both incident workflows and incident work are externalized from the daily work of the teams that build and run the software.
Lack of practice which creates a reactive vs. proactive posture.
Focus on a singular “root cause” versus understanding multiple contributing factors and making broad-based improvements.
Insufficient response protocol and structure.
Lack of tools and practices that provide visibility into the incident response process.
Narrow assessment and understanding of impact.

Benefits of Improving Incident Management

Incidents cannot be prevented. But we can greatly reduce the frequency, duration, and impact of incidents on both our customers and our employees who operate these systems.

The benefits of improving incident management and response are substantial and can yield reduced impact to customers, improved confidence from customers in the business, reduced stress on teams and employees, and increased revenue.

Overarching principles for improvement include:

Move toward a “just culture” where incidents are used as an opportunity to learn.
Integrate incident workflows and understanding into normal operating behavior across the entire service lifecycle.
Encourage ownership and accountability for customer outcomes.
Use incidents to surface true IT system behavior and true process behavior.
Recognize that complex systems fail in surprising ways.
Break down silos and build trust by incentivizing cross-collaboration and learning.
Make incremental and continuous improvements.

An Incident Management Framework

The incident problem space is very large, and our goal is to break it down, remove the mythology, and create a framework that can be evolved over time with more depth and breadth as our industry learns more.

We propose this incident management framework:

Prepare, Respond, Review.

The figure below outlines the Prepare, Response, Review incident response pattern cycle, as well as the common patterns within the pre-incident (prepare), incident response (respond), and post-incident (review) phases.

In the full white paper, the authors dive into each of the patterns below in more detail, providing solutions in each area.

Prepare (Pre-Incident Patterns)

Make incidents visible and part of daily work
Well defined incident roles
Well defined incident response triggers
Well defined on-call rotation & schedule
On-call onboarding and training
Incident command training & certification
Well defined communication plan
Well defined behavior protocols

Respond (Incident Response Patterns)

Periodic CAN reporting (Conditions, Actions, Needs)
Shared incident state document
Incident call recording
Incident swarming

Review (Post Incident Response Patterns)

Localized incident reviews
Global incident reviews
Post review improvement items
Incident review template
Incident impact assessment

Before diving into the patterns, we find it is essential for organizations to measure how well their teams are currently doing at incident management.

Incident Management Assessment

Below is an incident response assessment: a collection of probing questions that allow you and your team to answer and assess your current incident-response preparedness. Take these questions to your team to see how well you are doing and where there are areas for improvement.

Pre-Incident Assessment

Make Incidents Visible and Part of Daily Work

Do you have a shared backlog across operations and engineering teams that makes pre-work, response work, and review work visible?
Do you reserve capacity across operations and engineering for pre-work, response work, and review work?
Do you share and discuss a backlog of incident work with stakeholders, including product management?
Do incidents have long handoffs between first responders (help desk), secondary responders, and tertiary responders?

Well-Defined Incident-Management Roles

Do you have clear, dedicated roles to avoid overlap, confusion, and delays?
Who on your incident-response team is responsible for driving the resolution in a timely manner while keeping everyone on the response team on track?
Do you run postmortems after each incident to help the team improve in areas they missed during the incident

Well-Defined Incident-Response Triggers

Is your team overwhelmed with alerts and notifications?
How long does it take for incident responders to get knowledge they need to resolve the issue?
Does everyone understand the business impact associated with reported outages?

Well-Defined On-Call Rotation & Schedule

Do your teams have a scheduled on-call rotation?
Does the on-call rotation include developers?
Can other teams easily find the right person to contact during an incident if they need help?

On-Call Onboarding & Training

Are your on-call engineers properly prepared ahead of time, before an incident occurs?
How are your on-call engineers trained to approach an incident?
What steps do you take to on-board new on-call engineering team members?

Incident-Command Training & Certification

What activities does your organization do to ensure that each incident response is managed consistently and collaboratively?
Do you have a structured-training program for your incident-response leaders?
How do you communicate incident-response roles and ensure each responder is aware of those roles, including who has ownership of the incident’s resolution?

Well-Defined Communication Plan

Do you have a defined incident-communication plan?
Does your communication plan outline the communication owner, communication frequency, content, audience, and delivery?
Does each service/application have its own specific communication plan?

Active Incident Response Assessment

Periodic CAN Reporting

Does your incident process have a well-defined status/CAN report for stakeholders?
Have you defined a regular cadence to send status/CAN reports to stakeholders?
Do you have a dedicated scribe to manage the CAN reporting process?

Shared Incident-State Document

Are all incident-response team members actively recording and sharing information?
How do new responders get relevant background information on the incident?

Incident-Call Recording

Do you record your incident calls so your incident-response team has the ability to review the details of an incident?
Does your incident-response team have data collected if the core incident team needs to go back to some resolution events to address missed information, confusion, or disagreements?
What information does your team have to review during postmortems and incident-review meetings?

Incident Swarming

Do your incident responders coordinate to resolve incidents more quickly and develop domain knowledge?
Are your tickets managed by incident responders in real time or managed in a tiered approach?

Post-incident Assessment

Local Incident Reviews

How long after an incident is resolved do your responders hold a review?
Does your environment encourage continuous improvement, learning, and accountability by conducting blameless incident-review workshops?
Does your organization capture and share the incident-review improvements and documentation across the organization?

Global Incident Reviews

How often does your organization gather to review incidents and disseminate learnings across all teams?
Do you regularly ask actionable questions to foster a culture of open incident review?
Do you invite cross-functional teams and stakeholders to build organization-wide resilience?
During Global Incident Reviews, do other teams reach out to provide assistance to help with improvement items and broad-based patterns of improvement?

Post-Review Improvement Items

Do your teams identify actionable improvements to the system after an incident?
Are those actionable improvements consistently tracked, prioritized, and implemented?
Do your teams make tradeoff and risk decisions on improvements in the backlog?

Incident-Review Template

Do you have a work-management system or central knowledge repository for storing and sharing incident-review information?
Does your incident-response team have an incident-review template?
Do you regularly evaluate how you collect incident information to identify adjustments as needed?

Incident-Impact Assessment

Do your teams have a framework to assess the impact of an incident?
Is incident-impact assessment part of the review process?
Do you leverage incident-impact assessment to discover true system behavior?
Do you use the results of incident-impact assessments to inform broad-based improvements?

Conclusion

It should be very clear that the requirements for effective incident response are broad and nuanced. In the full white paper, we have intended to highlight key patterns that can be reviewed and used to assess the effectiveness of your incident-response plan.

We acknowledge that every environment contains its own priorities and constraints, but like any good architecture, there is typically a high percentage of consistency between organizations. Key outcomes around quickly identifying and resolving incidents are universal. These patterns are developed to reflect those requirements as well as present some emerging patterns that have yielded strong results for high-performing teams.

The desired state for incident response should encompass a few key characteristics.

It should be able to quickly identify the source of the incident and inform the correct responders quickly and with necessary information to resolve the problem.
Incident-response teams should work collaboratively with the common goal of resolving the issue with transparency, clear communication, and in a manner that can result i n continual improvement.
Incidents should be reviewed with an emphasis on organizational learning and action toward improvement instead of assigning root cause and blame.

Please review the patterns in the full white paper for the most detailed explanation of this incident response framework.

- About The Authors

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

No comments found

GENE KIM & STEVE YEGGE

By GOVERNOR, HARRISON, WATERHOUSE, & ZIMMAN

By GENE KIM, MIKE COLLINS

By GENE KIM & MIKE COLLINS

BY MATTHEW SKELTON, MANUEL PAIS

By GENE KIM, KEVIN BEHR, GEORGE SPAFFORD

BY NICOLE FORSGREN, PHD, JEZ HUMBLE, GENE KIM

BY KIM, HUMBLE, DEBOIS, WILLIS, & FORSGREN

with Andrew Davis and Steve Pereira

with Dominica DeGrandis

with Matthew Skelton & Manuel Pais

Fall 2025

Spring 2025

Fall 2024

SPRING 2024

The Phoenix Project

Investments Unlimited

The DevOps Handbook, 2nd Edition

Framework for Incident Management: Prepare, Respond, Review

Why You Need an Incident Management Framework

Benefits of Improving Incident Management

An Incident Management Framework

Prepare (Pre-Incident Patterns)

Respond (Incident Response Patterns)

Review (Post Incident Response Patterns)

Incident Management Assessment

Pre-Incident Assessment

Make Incidents Visible and Part of Daily Work

Well-Defined Incident-Management Roles

Well-Defined Incident-Response Triggers

Well-Defined On-Call Rotation & Schedule

On-Call Onboarding & Training

Incident-Command Training & Certification

Well-Defined Communication Plan

Active Incident Response Assessment

Periodic CAN Reporting

Shared Incident-State Document

Incident-Call Recording

Incident Swarming

Post-incident Assessment

Local Incident Reviews

Global Incident Reviews

Post-Review Improvement Items

Incident-Review Template

Incident-Impact Assessment

Conclusion

IT Revolution

Leave a Comment Cancel Reply

IT Revolution

Jump to Section

More Like This

Why ‘Move Fast and Break Things’ Is Wrong: Lessons from AWS, eBay, and Google on Speed with Safety

The Resilience of the Core: Why the Death of SaaS is Premature in the Era of Vibe Coding

The Great Developer Divide: How AI Is Reshaping the Software Job Market Into Three Tiers

The New York Times Just Made the Case for Vibe Coding—Here’s the Deeper Story

Hear about new books, research, and events from one of the most trusted brands in the industry.