This post has been adapted from Nora Jones’ 2021 DevOps Enterprise Summit Virtual – Europe presentation. You can view the full presentation here.
We’ve all had incidents. They’re unexpected. They’re stressful. And sometimes in management, there’s inevitable questions that creep up. What can we do to prevent this from ever happening again? What caused this? Why did this take so long to fix?
The organizations I’ve worked in, and the research that myself and my team have done in this space, has shown the following responses to the question of why are incident reviews important:
“I’m honestly not sure.”
“Management wants us to.”
“It gives the engineer space to vent.”
“I think people would be mad if we didn’t.”
“We have obligations to customers.”
“We have tracking purposes.”
“We want to see if we’re getting better.”
“We want to have the answers to the board’s questions.”
I think we all know that some form of post-incident review is important, but we don’t all agree on why it’s important. We want to make efforts to improve, we want to show that we’re improving, but we’re spinning our wheels in a lot of ways because we’re not actually making efforts to improve the post-incident reviews themselves, we’re making efforts to try to stop incidents.
But without making efforts to try to improve the post-mortem reviews, or improve the incident reviews, we’re actually not going to improve incidents on any level.
The good news is incident analysis can be trained and aided, but it has to be trained and aided to be approved upon.
At the DevOps Enterprise Summit, John Allspaw has talked about how the metrics we are tracking today, like MTTR and MTTD and number of incidents, are actually shallow metrics. I get why we’re tracking those things, it’s an emotional release, it’s something that can make us feel better. But he posed an open question and challenge to the audience.
He said, “Where are the people in this tracking? And where are you?”
We haven’t changed much as an industry in this regard. Gathering useful data about incidents does not come for free. You need time and space to determine it.
I’m going to talk to you about why giving this time and space to your engineers, and your organizations, to improve post-incident reviews can actually work within your favor. It can give you that ROI you’re looking for and level up your entire organization.Spoiler alert, sometimes the thorough analysis or incident review actually reveals things that we're not ready to see, hear, or change. So as leaders, we have to be open to hearing some of these things. Click To Tweet
I’m going to tell you about this through multiple stories that I’ve experienced myself, and show you new paths on how you can do this in ways that are not disruptive to your business, as well as next steps for you to embark on.
Spoiler alert, sometimes the thorough analysis or incident review actually reveals things that we’re not ready to see, hear, or change. So as leaders, we have to be open to hearing some of these things.
Effective Incident Analysis Metrics
There’s a famous equation in a book called Seeing What Others Don’t by Gary Klein. Gary Klein is a cognitive psychologist who studies experts and expertise in organizations. This metric he came up with is performance improvement. It’s the combination of error reduction + insight generation. You can’t have one without the other.
Yet we focus as an industry way too much on the error reduction piece and not on the insight generation piece. Except we’re not actually going to improve the performance of our organizations if we’re only focusing on the error reduction piece. And I get it, that is an easy thing to measure. As software engineers, we’re taught to look for technical errors, we’re taught to look for some of these things, we’re not so much taught to generate insights. We’re not so much taught to disseminate insights. And we don’t get celebrated for it.we focus as an industry way too much on the error reduction piece and not on the insight generation piece. Click To Tweet
That’s something that we can do as leaders: we can actually celebrate the insight generation and dissemination and training materials by folks in our organization.
Next are three different stories about the value incident analysis brought about in different organizations. These are based on true events I have witnessed or been a part of, but their names and details have been changed.
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.