Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
June 21, 2021
Once an incident is resolved, there is a tendency to move on and go back to normal daily work. This is a missed opportunity to gather critical learnings and understand true system behavior as well as process and system breakdowns.
As part of the new Prepare/Respond/Review Incident Management Framework, conducting effective post-incident reviews and taking clear actions based on those review is essential.
Post-incident reviews are a key component of an organization’s culture. They are a critical feedback loop that contributes both to system understanding and continuous learning. Hold a post-incident review within twenty-hour hours of an outage’s resolution.
There should be two types of post-incident review: local and global.
Attendees of the local post-incident review (PIR) should include all engineering team members engaged on the call and full teams of those closest to the problem.
The Incident Commander and Scribe should both be present at the PIR. Senior leadership and customer representatives do not attend this meeting in order to maintain an open, safe space. The exception is engineering leadership from the team closest to the issue. They may occasionally attend and may help ensure PIRs are being run correctly.
The purpose of the review meeting is to focus on what happened and what can be learned from the incident. To do so, the team takes the following actions:
Some of the most important questions to ask are:
Immediate tactical fixes are important and should be identified in order to stabilize systems as fast as possible, but longer-term and broad-based improvements should be discussed as well to identify solutions to avoid incidents from reoccurring.
Taking notes during the PIR ensures that information persists beyond the meeting. These notes should be published somewhere like a wiki and made accessible throughout the organization. Filling out a form typically drives behaviors that focus on filling out the form instead of having a good discussion. We recommend letting the conversation flow to ensure you can answer the questions listed above but avoiding a form. If a form must be filled out, fill it out later based on your notes.
Action items should be captured and translated into the incident team’s existing work-tracking system. It may be valuable to also include them in the original Incident State Document.
The PIR must be facilitated in a blameless fashion to foster a psychologically safe environment to maximize understanding of the incident and identify improvements to be made. It must keep the focus on identifying shortcomings in the systems and in the existing processes. Complex systems fail for a variety of reasons; as such, the review should not focus on people or finger pointing. Norman L. Kerth’s Agile Retrospective Prime Directive can be leveraged while facilitating a PIR:
Local post-incident reviews generate significant learning about localized behavior and system and process behavior, including the quality of response. But when teams capture reviews in a siloed way, the organization and other teams don’t get access to all the lessons learned.
In addition to the local post-incident review, generate global learning by making the output of the local review widely available.
This can be achieved by providing a forum and cadence for reviewing key learning and remediation approaches. Forums can build relationships across the organizations, improve trust and develop esprit de corps that aid future response scenarios. The can also provide an opportunity for customer-support groups and tangential responders to learn and ask questions, build more trust and improving overall organizational resilience.
The following practices break down silos between teams and maximize cross-functional learning throughout the entire organization:
Global post-incident reviews should be held in the same fashion as a regular incident standup. Adopt a defined cadence; for example, CSG holds a global review two times per week for an hour. Be sure to invite teams and stakeholders from across the entire organization in order to increase awareness about the incidents discussed, to build an open culture of incident management, and to build resilience across the organization. Extend the invitation to your customer-representative teams, including Customer Liaisons, Platform Engineer teams, and engineering teams across multiple product portfolios.
During these sessions and after specific incidents are done being assessed and reviewed, it is important to update or capture all the knowledge shared and acquired in the organization’s Major Incident Management Framework and Best Practices knowledge base. This document will increase awareness of incident response and solutions and enable continuous improvement throughout the entire organization.
Every incident surfaces opportunities for improvement, but without efforts to implement actionable change after an incident, critical learnings from the incident will be lost and customer and stakeholder confidence may suffer.
After an incident is resolved, the organization and team must improve their ability to detect, diagnose, mitigate, resolve, and prevent future incidents. They can reinforce and encourage collective ownership of system reliability and the customer experience, restore and maintain customer and stakeholder confidence, and identify broad-based system and process changes that improve system robustness as well as reduce future impact.
It is said that we should “never let a good crisis go to waste.” A post-incident review provides a chance to change the system or the processes, but the organizations must seize the opportunity and institute clear actions from the post-incident review.
As part of the post-incident review, look for contributing factors to the incident and try to identify specific and actionable opportunities for improvement.
Also make sure that the improvement items identified are specific, targeted, and actionable. Suggestions that “we should test more” or “we should be more careful” are not particularly helpful because they do not lead to specific action. This focuses the team’s follow-up efforts and, when those follow-up efforts are completed, goes a long way toward restoring the confidence of customers.
Use the same tools and processes to track post-review improvement items as you use for daily work. For example, if your team uses Jira to track daily work, use Jira to track post-review improvement items in the same way.
Not all improvement items identified in a post-incident review are worth doing, just as not all possible feature ideas are worth doing. By tracking improvement items in the backlog, teams can easily prioritize them alongside—and in the same way as—daily feature work.
This forces teams to be specific and actionable and provides a means to track, prioritize, and follow up on post-incident improvement items in the same way as daily work. Post-incident improvement items can also be connected, collapsed, and consolidated with other issues and work items.
It can be tempting to identify a very specific change that would solve the specific problem that occurred in this particular incident. Where possible, look for opportunities to solve a class of problems that might cause a set of incidents. It can be helpful to prompt the discussion with targeted questions like:
Not all improvements can—or should—be implemented, due to feasibility and effort. Make sure to prioritize the improvements that will make bigger impacts and will solve larger classes of problems.
These are just a few of the post-incident patterns organizations can take to maximize their learning from each incident, and improve their response the next time. To continue reading more about post-incident review and about the Prepare/Review/Respond Incident Management Framework, download the full white paper here.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
You've been there before: standing in front of your team, announcing a major technological…
If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…
Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…
We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…