How to Make Incident Reviews Better
Now, you’re probably thinking, “I can’t give one-on-one interviews for every incident. I don’t have time to do this.”
I want to go back to my earlier point, a lot of the reason that you don’t have time is because the incident reviews today are not that great. It feels like, why should we spend more time on something that’s not that great? But you can make it better.
Now there’s some starting points that you can do. Like, which kinds of incidents should be given more time and space to analyze?
Which Incidents to Start With
It doesn’t have to be every incident. And it doesn’t have to be every incident that just caused customer impact or just hit Twitter big time. There’s certain signals that you can use to see which incidents should be given more time and space.
Like if there were more than two teams involved, especially if they had never worked together before. Or if it involved engineering and a non-engineering team, like customer service or PR or marketing working together. That’s a good indication that more time and space should be given. Or if it involved a misuse of something that seemed trivial, like expired certs.
I think every single organization I’ve been in, someone from leadership has been like, “Why are we having all these expired certs incidents? Let’s look into them a little bit more.” Usually when it’s something seemingly trivial that is triggering a lot of incidents, it’s actually an indication of a deeper organizational problem, not someone not knowing how expired certs work.
If the incident was almost really bad, if we found ourselves going, “I’m so glad no one noticed that,” that’s usually an indication that we can dig into this deeper and that we have a lot to learn from it, in a nice way that gives us time and space.
If it took place during a big event, like an earnings call or if the CEO was doing something within the organization, or if we had a big demo, or if promotion packets were due, or if everyone was out of the office. Those are usually indications as well.
If a new service or interaction between services was involved.
If more people joined the incident channel than usual. Are you tracking how often there are lurkers compared to actual participants in the channel? There’s usually a lot of people wanting answers, but sometimes there’s three or four people actually debugging the incident. That ratio of lurkers to actual participants can tell you a lot about the incident as well and usually indicates there’s more to dig into there.
When to Analyze an Incident
So when are we ready for incident analysis? When are we ready to level up our postmortems and not just have this standard RCA doc and not just have this meeting that people feel like they wasted time at?
You’re ready now.
Having customers means you’re ready to benefit from incident analysis in some form. And the earlier you start, the better. The earlier you can ingrain this in your organization, the better.
Ways to Improve Incident Analysis
So what can you do today to improve incident analysis?
You can give folks more time and space to come up with better analysis. And this can be trained and aided. Use incidents that were not high profile, that didn’t have a lot of emotional stakes, and give them a couple of weeks to look at them in addition to their regular work. It doesn’t need to be something that they drop everything and work on. But you can get a lot of value out of giving them some time and space to actually review the incident under a different lens.
Come up with some different metrics. Look at the people. Don’t just have MTTR and MTTD and error counts. Look at the teams. Look at if they’ve worked together before. Look at if they were pushing out a new service to production. Look at how many people they have on their team. Look at how often we’re relying on people who were not on call. Look at how many lurkers to actual incident responders you have. Look at the coordination costs of the incidents.
You can do investigations on call rotations, treat this like you would incident response. Have folks who were not involved in the incident doing the incident review, because you get that unbiased perspective, you get someone that can ask Kieran those questions without Kieran feeling like they’re blaming them.
Having folks that weren’t involved in the incidents doing the incident reviews actually levels up your entire organization, because now they’re learning about a system for an incident they didn’t participate in. And that expertise is amazing to see.
And allow investigation for the big ones. You need time for this, and I know you’re getting asked for answers from your boards. I know you’re getting asked for answers from your C-suites. But giving time and space is going to help with these big ones over time, and they’re not going to seem as big over time.
My company actually offers a couple of things to help with this too. We have a Move Fast and Learn From Incidents workshop, where we give you two fake incidents that you can practice some of this in, without using one of your real ones. And we also have a product available that’s in closed beta today, that’ll gives you some more information to reach out afterwards.
So How Do You Know It’s Working?
There are more folks attending the incident reviews and more folks reading them, not because they’re being asked to, not because they’re required to, but because they want to. This is an indication that they’re actually learning something. I actually saw folks get promoted because of what they were learning in these incident reviews at an organization where they really invested the time to level up their people and level up their incident reviews.
You’re not seeing the same folks pop into every incident.
You’re not having to react with that Batman emoji anymore.
And folks are feeling more confident about their on-call rotations. They’re not hesitant about ignoring an alert or responding to an alert, they’re feeling better about it.
Teams are collaborating more. You’re not seeing as high of coordination costs in your incident. And there’s a better shared understanding of the definition of an incident. Something I challenge you to do is ask a few different folks in your organization what an incident is and see how many answers you get without them needing to pull up your sub doc guide. That’s also usually an indication that your coordination costs might be quite high.
The People Proof
I want to share some testimonials from people that improved incident reviews in their organizations and spent time and space to do this. Someone said, “I just changed the way I was proposing to use this part of the system in a design that I was working on as a result of reading this incident review document.” They were working on a completely separate project and were able to learn about how a piece of technology got implemented because of reading an incident review.
That’s what incident review should be for, it doesn’t need to just focus on the socio or on the technical. It’s a training mechanism.
I had someone say, “Never have I seen such an in-depth analysis of any software system that I’ve ever had the pleasure of working with.”
He was saying that folks that read this document are coming out with a better and more informed understanding of services that started out of just having one or two people understanding them. And they end up being educational pieces that people pull up later in the organization.
I’ve seen the incident review get outputted and people are still pulling it up months later, not during incidents or anything, but as part of implementation. As part of onboarding guides. As part of getting ramped up to a team. They can be beautiful living documents.
There are a few components I recommend as parts of a strong post-incident process.
When an incident occurs, assign an impartial investigator. The initial analysis is done by the investigator to identify if there’s people we need to talk to one-on-one, and then they do an analysis of the disparate sources involved: like the Slack transcripts, the Zoom transcripts, the PRs, the tickets.
Then they might do some individual chats before the incident review.
Then we might want to align and collaborate on something together. Facilitate the meeting, output the report, and then after some soak time, after a day or so after this, then come up with action items. I promise your action items are going to be so much better if you don’t do them right away. And you’ll actually see people getting them done because they’re inspired to, not just because they feel like they have to.
I realize this might feel like a lot for every incident, so think about the metrics I gave you earlier, for certain incidents you should apply some of this to. And it can be condensed and consolidated for other incidents as well.
And if you’re interested in further resources on incident analysis, the learningfromincidents.io community open sources a lot of our learnings. We write about how we’re doing this in organizations, actual chop wood and carry water stories, not so much on the theory, but actually how it’s working in practice.
And if you’re interested a little bit more on the error counting mechanisms I had brought up earlier, and why they can actually hurt us sometimes, there’s a very quick paper, it’s about two pages, called The Error of Counting “Errors” by Robert L. Wares. It’s taken from another industry. Software has a lot to learn from other industries, like medicine and aviation and maritime, on how we look at accident investigation. We don’t need to reinvent the wheel there and it’s a really great paper to look at.
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.