Once an incident is resolved, there is a tendency to move on and go back to normal daily work. This is a missed opportunity to gather critical learnings and understand true system behavior as well as process and system breakdowns.
As part of the new Prepare/Respond/Review Incident Management Framework, conducting effective post-incident reviews and taking clear actions based on those review is essential.
Post-incident reviews are a key component of an organization’s culture. They are a critical feedback loop that contributes both to system understanding and continuous learning. Hold a post-incident review within twenty-hour hours of an outage’s resolution.
There should be two types of post-incident review: local and global.
Local Post-Incident Review
Attendees of the local post-incident review (PIR) should include all engineering team members engaged on the call and full teams of those closest to the problem.
The Incident Commander and Scribe should both be present at the PIR. Senior leadership and customer representatives do not attend this meeting in order to maintain an open, safe space. The exception is engineering leadership from the team closest to the issue. They may occasionally attend and may help ensure PIRs are being run correctly.
The purpose of the review meeting is to focus on what happened and what can be learned from the incident. To do so, the team takes the following actions:
- Reviews the timeline.
- Identifies and discusses what went wrong.
- Discusses what went right.
Some of the most important questions to ask are:
- How could we have detected this sooner? Did we have the right triggers?
- How could we have diagnosed the incident more rapidly? Did responders have the information they needed to diagnose the issue?
- What would have helped resolve this faster? Do we require new triggers, data collection, tools, or processes?
- What specific actions should we take to improve?
- Where did we get lucky?
- What did we learn about how our system behaves?
- How could we have prevented the incident from occurring?
- What went well in handling this incident?
Record and Take Action
Immediate tactical fixes are important and should be identified in order to stabilize systems as fast as possible, but longer-term and broad-based improvements should be discussed as well to identify solutions to avoid incidents from reoccurring.
Taking notes during the PIR ensures that information persists beyond the meeting. These notes should be published somewhere like a wiki and made accessible throughout the organization. Filling out a form typically drives behaviors that focus on filling out the form instead of having a good discussion. We recommend letting the conversation flow to ensure you can answer the questions listed above but avoiding a form. If a form must be filled out, fill it out later based on your notes.
Action items should be captured and translated into the incident team’s existing work-tracking system. It may be valuable to also include them in the original Incident State Document.
Blameless is Key for Learning
The PIR must be facilitated in a blameless fashion to foster a psychologically safe environment to maximize understanding of the incident and identify improvements to be made. It must keep the focus on identifying shortcomings in the systems and in the existing processes. Complex systems fail for a variety of reasons; as such, the review should not focus on people or finger pointing. Norman L. Kerth’s Agile Retrospective Prime Directive can be leveraged while facilitating a PIR:
- If a team member made a poor decision, the conversation should be about what information they were missing that would have helped them understand the situation more clearly.
- If someone made a mistake, the conversation should be about how to make the system safer so that this type of mistake isn’t possible or is at least more easily detectable.
- No one tolerates finger pointing. If you see it in the meeting, call it out.
- Beware of “try harder” and “human error.”
- A culture of taking ownership and accountability for issues you’re involved in models behavior for others. This eventually leads to a culture where teams don’t blame but instead own their part
Global Post-Incident Review
Local post-incident reviews generate significant learning about localized behavior and system and process behavior, including the quality of response. But when teams capture reviews in a siloed way, the organization and other teams don’t get access to all the lessons learned.
Spread the Learning Far and Wide
In addition to the local post-incident review, generate global learning by making the output of the local review widely available.
This can be achieved by providing a forum and cadence for reviewing key learning and remediation approaches. Forums can build relationships across the organizations, improve trust and develop esprit de corps that aid future response scenarios. The can also provide an opportunity for customer-support groups and tangential responders to learn and ask questions, build more trust and improving overall organizational resilience.
Break Down Silos
The following practices break down silos between teams and maximize cross-functional learning throughout the entire organization:
- Hold a Global Incident Review if a major incident has occurred.
- During the Global Incident Review workshop, teams and stakeholders should focus on the assessment of the impact to the business first and then to the technology stack.
- Tell the story of the incident to provide the best possible context and to drive the audience’s engagement.
- Discuss remediation plans and follow-up improvement items.
- Discuss what the organizations and all teams (not just the team impacted) can learn from the event.
- Identify improvements needed to diagnose the incident, including service impacted, priority level, and the correct resolver teams engaged to improve response time in the future.
- Review the repair steps and identify recommendations to reduce a future incident repair duration.
- Review the duration to initiate and complete activities to ultimately identify improvement recommendations.
- Assess whether incident communication was effective or if anything can be improved to reduce delays, confusions and lead time.
Global post-incident reviews should be held in the same fashion as a regular incident standup. Adopt a defined cadence; for example, CSG holds a global review two times per week for an hour. Be sure to invite teams and stakeholders from across the entire organization in order to increase awareness about the incidents discussed, to build an open culture of incident management, and to build resilience across the organization. Extend the invitation to your customer-representative teams, including Customer Liaisons, Platform Engineer teams, and engineering teams across multiple product portfolios.
During these sessions and after specific incidents are done being assessed and reviewed, it is important to update or capture all the knowledge shared and acquired in the organization’s Major Incident Management Framework and Best Practices knowledge base. This document will increase awareness of incident response and solutions and enable continuous improvement throughout the entire organization.
Post-Incident Review Improvement Items
Every incident surfaces opportunities for improvement, but without efforts to implement actionable change after an incident, critical learnings from the incident will be lost and customer and stakeholder confidence may suffer.
After an incident is resolved, the organization and team must improve their ability to detect, diagnose, mitigate, resolve, and prevent future incidents. They can reinforce and encourage collective ownership of system reliability and the customer experience, restore and maintain customer and stakeholder confidence, and identify broad-based system and process changes that improve system robustness as well as reduce future impact.
It is said that we should “never let a good crisis go to waste.” A post-incident review provides a chance to change the system or the processes, but the organizations must seize the opportunity and institute clear actions from the post-incident review.
As part of the post-incident review, look for contributing factors to the incident and try to identify specific and actionable opportunities for improvement.
Also make sure that the improvement items identified are specific, targeted, and actionable. Suggestions that “we should test more” or “we should be more careful” are not particularly helpful because they do not lead to specific action. This focuses the team’s follow-up efforts and, when those follow-up efforts are completed, goes a long way toward restoring the confidence of customers.
Improvement Items In the Backlog
Use the same tools and processes to track post-review improvement items as you use for daily work. For example, if your team uses Jira to track daily work, use Jira to track post-review improvement items in the same way.
Not all improvement items identified in a post-incident review are worth doing, just as not all possible feature ideas are worth doing. By tracking improvement items in the backlog, teams can easily prioritize them alongside—and in the same way as—daily feature work.
This forces teams to be specific and actionable and provides a means to track, prioritize, and follow up on post-incident improvement items in the same way as daily work. Post-incident improvement items can also be connected, collapsed, and consolidated with other issues and work items.
It can be tempting to identify a very specific change that would solve the specific problem that occurred in this particular incident. Where possible, look for opportunities to solve a class of problems that might cause a set of incidents. It can be helpful to prompt the discussion with targeted questions like:
- How could we have detected the incident more easily?
- How could we have diagnosed the incident more rapidly?
- How could we have mitigated the effects of the incident on the customer experience?
- How could we have resolved the incident more quickly?
- How could we have prevented the incident from occurring?
Not all improvements can—or should—be implemented, due to feasibility and effort. Make sure to prioritize the improvements that will make bigger impacts and will solve larger classes of problems.
These are just a few of the post-incident patterns organizations can take to maximize their learning from each incident, and improve their response the next time. To continue reading more about post-incident review and about the Prepare/Review/Respond Incident Management Framework, download the full white paper here.