Learn to improve your organization’s incident management with this framework for incident management: Prepare, Respond, Review.
In this post, based on the white paper A Framework for Incident Response, Assessment, and Learning, by Shaaron A. Alvares, Josh Atwell, Jason Cox, Erica Morrison, Scott Prugh, and Randy Shoup, we present fresh incident management framework to help you improve your overall organizational response to incidents.
In the white paper this framework is broken down into a taxonomy of dysfunctions and patterns to help you greatly improve your incident response and posture.
Why You Need an Incident Management Framework
Incidents and outages are an existential threat to businesses that build, operate, and consume technology services. Businesses and customers rely heavily on these critical systems. When they fail, customer credibility can be irreparably harmed, putting both business reputation and revenue at stake.
Your teams are already responding to incidents, but how well are they doing it? How are they adjusting as the technology landscape changes? Could they do better?
With this framework for incident management, we’ll help you point the north star toward the ideal state and change the narrative about incidents from one of blame to one of learning over the long term.
We’ll also provide real-world, right-sized patterns and examples that can be used for incremental improvement to change behavior with a view to a long-term investment, giving pragmatic and tactile practices and patterns, with examples from some of the top practitioners and companies, to address a complicated topic that is hard to cover well.
The traditional ITIL-based incident-management framework gave companies a structured way of categorizing, handling, and resolving incidents. This framework, as well as adjacent processes, such as problem management, have become the reference model for organizations to deal with the realities of handling incidents.
However, software systems in today’s enterprises are composed of hundreds of different systems and technologies that interact in surprising ways. As complexity has increased, the ITIL framework has not evolved to deal with the messy reality.
As such, the traditional way of thinking about incidents and dealing with them has become operational debt and can prevent organizations from evolving. There is also a dearth of practical, accessible, hands-on experience about how leading companies deal with the realities of incident management and response in this complex world.
Dysfunctions with traditional incident management include:
- Focus on blame and finger pointing vs. inquiry, learning and improvement.
- Application of hindsight bias.
- Incidents treated as exceptions: Both incident workflows and incident work are externalized from the daily work of the teams that build and run the software.
- Lack of practice which creates a reactive vs. proactive posture.
- Focus on a singular “root cause” versus understanding multiple contributing factors and making broad-based improvements.
- Insufficient response protocol and structure.
- Lack of tools and practices that provide visibility into the incident response process.
- Narrow assessment and understanding of impact.
Benefits of Improving Incident Management
Incidents cannot be prevented. But we can greatly reduce the frequency, duration, and impact of incidents on both our customers and our employees who operate these systems.
The benefits of improving incident management and response are substantial and can yield reduced impact to customers, improved confidence from customers in the business, reduced stress on teams and employees, and increased revenue.
Overarching principles for improvement include:
- Move toward a “just culture” where incidents are used as an opportunity to learn.
- Integrate incident workflows and understanding into normal operating behavior across the entire service lifecycle.
- Encourage ownership and accountability for customer outcomes.
- Use incidents to surface true IT system behavior and true process behavior.
- Recognize that complex systems fail in surprising ways.
- Break down silos and build trust by incentivizing cross-collaboration and learning.
- Make incremental and continuous improvements.
An Incident Management Framework
The incident problem space is very large, and our goal is to break it down, remove the mythology, and create a framework that can be evolved over time with more depth and breadth as our industry learns more.
We propose this incident management framework:
Prepare, Respond, Review.
The figure below outlines the Prepare, Response, Review incident response pattern cycle, as well as the common patterns within the pre-incident (prepare), incident response (respond), and post-incident (review) phases.
In the full white paper, the authors dive into each of the patterns below in more detail, providing solutions in each area.
Prepare (Pre-Incident Patterns)
- Make incidents visible and part of daily work
- Well defined incident roles
- Well defined incident response triggers
- Well defined on-call rotation & schedule
- On-call onboarding and training
- Incident command training & certification
- Well defined communication plan
- Well defined behavior protocols
Respond (Incident Response Patterns)
- Periodic CAN reporting (Conditions, Actions, Needs)
- Shared incident state document
- Incident call recording
- Incident swarming
Review (Post Incident Response Patterns)
- Localized incident reviews
- Global incident reviews
- Post review improvement items
- Incident review template
- Incident impact assessment
Before diving into the patterns, we find it is essential for organizations to measure how well their teams are currently doing at incident management.
Incident Management Assessment
Below is an incident response assessment: a collection of probing questions that allow you and your team to answer and assess your current incident-response preparedness. Take these questions to your team to see how well you are doing and where there are areas for improvement.
Make Incidents Visible and Part of Daily Work
- Do you have a shared backlog across operations and engineering teams that makes pre-work, response work, and review work visible?
- Do you reserve capacity across operations and engineering for pre-work, response work, and review work?
- Do you share and discuss a backlog of incident work with stakeholders, including product management?
- Do incidents have long handoffs between first responders (help desk), secondary responders, and tertiary responders?
Well-Defined Incident-Management Roles
- Do you have clear, dedicated roles to avoid overlap, confusion, and delays?
- Who on your incident-response team is responsible for driving the resolution in a timely manner while keeping everyone on the response team on track?
- Do you run postmortems after each incident to help the team improve in areas they missed during the incident
Well-Defined Incident-Response Triggers
- Is your team overwhelmed with alerts and notifications?
- How long does it take for incident responders to get knowledge they need to resolve the issue?
- Does everyone understand the business impact associated with reported outages?
Well-Defined On-Call Rotation & Schedule
- Do your teams have a scheduled on-call rotation?
- Does the on-call rotation include developers?
- Can other teams easily find the right person to contact during an incident if they need help?
On-Call Onboarding & Training
- Are your on-call engineers properly prepared ahead of time, before an incident occurs?
- How are your on-call engineers trained to approach an incident?
- What steps do you take to on-board new on-call engineering team members?
Incident-Command Training & Certification
- What activities does your organization do to ensure that each incident response is managed consistently and collaboratively?
- Do you have a structured-training program for your incident-response leaders?
- How do you communicate incident-response roles and ensure each responder is aware of those roles, including who has ownership of the incident’s resolution?
Well-Defined Communication Plan
- Do you have a defined incident-communication plan?
- Does your communication plan outline the communication owner, communication frequency, content, audience, and delivery?
- Does each service/application have its own specific communication plan?
Active Incident Response Assessment
Periodic CAN Reporting
- Does your incident process have a well-defined status/CAN report for stakeholders?
- Have you defined a regular cadence to send status/CAN reports to stakeholders?
- Do you have a dedicated scribe to manage the CAN reporting process?
Shared Incident-State Document
- Are all incident-response team members actively recording and sharing information?
- How do new responders get relevant background information on the incident?
- Do you record your incident calls so your incident-response team has the ability to review the details of an incident?
- Does your incident-response team have data collected if the core incident team needs to go back to some resolution events to address missed information, confusion, or disagreements?
- What information does your team have to review during postmortems and incident-review meetings?
- Do your incident responders coordinate to resolve incidents more quickly and develop domain knowledge?
- Are your tickets managed by incident responders in real time or managed in a tiered approach?
Local Incident Reviews
- How long after an incident is resolved do your responders hold a review?
- Does your environment encourage continuous improvement, learning, and accountability by conducting blameless incident-review workshops?
- Does your organization capture and share the incident-review improvements and documentation across the organization?
Global Incident Reviews
- How often does your organization gather to review incidents and disseminate learnings across all teams?
- Do you regularly ask actionable questions to foster a culture of open incident review?
- Do you invite cross-functional teams and stakeholders to build organization-wide resilience?
- During Global Incident Reviews, do other teams reach out to provide assistance to help with improvement items and broad-based patterns of improvement?
Post-Review Improvement Items
- Do your teams identify actionable improvements to the system after an incident?
- Are those actionable improvements consistently tracked, prioritized, and implemented?
- Do your teams make tradeoff and risk decisions on improvements in the backlog?
- Do you have a work-management system or central knowledge repository for storing and sharing incident-review information?
- Does your incident-response team have an incident-review template?
- Do you regularly evaluate how you collect incident information to identify adjustments as needed?
- Do your teams have a framework to assess the impact of an incident?
- Is incident-impact assessment part of the review process?
- Do you leverage incident-impact assessment to discover true system behavior?
- Do you use the results of incident-impact assessments to inform broad-based improvements?
It should be very clear that the requirements for effective incident response are broad and nuanced. In the full white paper, we have intended to highlight key patterns that can be reviewed and used to assess the effectiveness of your incident-response plan.
We acknowledge that every environment contains its own priorities and constraints, but like any good architecture, there is typically a high percentage of consistency between organizations. Key outcomes around quickly identifying and resolving incidents are universal. These patterns are developed to reflect those requirements as well as present some emerging patterns that have yielded strong results for high-performing teams.
The desired state for incident response should encompass a few key characteristics.
- It should be able to quickly identify the source of the incident and inform the correct responders quickly and with necessary information to resolve the problem.
- Incident-response teams should work collaboratively with the common goal of resolving the issue with transparency, clear communication, and in a manner that can result i n continual improvement.
- Incidents should be reviewed with an emphasis on organizational learning and action toward improvement instead of assigning root cause and blame.
Please review the patterns in the full white paper for the most detailed explanation of this incident response framework.