• IT REVOLUTION
  • Newsletter
  • About
  • Contact
  • My Resources
  • Books
  • Resources
  • Courses
  • Podcast
  • Videos
  • Conference
  • Blog
  • IT REVOLUTION
  • Newsletter
  • About
  • Contact
  • My Resources

IT Revolution

Helping technology leaders achieve their goals through publishing, events & research.

  • IT REVOLUTION
  • Newsletter
  • About
  • Contact
  • My Resources
  • Books
  • Resources
  • Courses
  • Podcast
  • Videos
  • Conference
  • Blog

Incident Analysis: Your Organization’s Secret Weapon

June 28, 2021 by IT Revolution Leave a Comment

This post has been adapted from Nora Jones’ 2021 DevOps Enterprise Summit Virtual – Europe presentation. You can view the full presentation here.


We’ve all had incidents. They’re unexpected. They’re stressful. And sometimes in management, there’s inevitable questions that creep up. What can we do to prevent this from ever happening again? What caused this? Why did this take so long to fix? 

The organizations I’ve worked in, and the research that myself and my team have done in this space, has shown the following responses to the question of why are incident reviews important: 

“I’m honestly not sure.” 

“Management wants us to.” 

“It gives the engineer space to vent.” 

“I think people would be mad if we didn’t.” 

“We have obligations to customers.” 

“We have tracking purposes.” 

“We want to see if we’re getting better.” 

“We want to have the answers to the board’s questions.” 

I think we all know that some form of post-incident review is important, but we don’t all agree on why it’s important. We want to make efforts to improve, we want to show that we’re improving, but we’re spinning our wheels in a lot of ways because we’re not actually making efforts to improve the post-incident reviews themselves, we’re making efforts to try to stop incidents. 

But without making efforts to try to improve the post-mortem reviews, or improve the incident reviews, we’re actually not going to improve incidents on any level. 

The good news is incident analysis can be trained and aided, but it has to be trained and aided to be approved upon. 

At the DevOps Enterprise Summit, John Allspaw has talked about how the metrics we are tracking today, like MTTR and MTTD and number of incidents, are actually shallow metrics. I get why we’re tracking those things, it’s an emotional release, it’s something that can make us feel better. But he posed an open question and challenge to the audience.

He said, “Where are the people in this tracking? And where are you?” 

We haven’t changed much as an industry in this regard. Gathering useful data about incidents does not come for free. You need time and space to determine it. 

I’m going to talk to you about why giving this time and space to your engineers, and your organizations, to improve post-incident reviews can actually work within your favor. It can give you that ROI you’re looking for and level up your entire organization. 

Spoiler alert, sometimes the thorough analysis or incident review actually reveals things that we're not ready to see, hear, or change. So as leaders, we have to be open to hearing some of these things. Click To Tweet

I’m going to tell you about this through multiple stories that I’ve experienced myself, and show you new paths on how you can do this in ways that are not disruptive to your business, as well as next steps for you to embark on.

Spoiler alert, sometimes the thorough analysis or incident review actually reveals things that we’re not ready to see, hear, or change. So as leaders, we have to be open to hearing some of these things.

Effective Incident Analysis Metrics

There’s a famous equation in a book called Seeing What Others Don’t by Gary Klein. Gary Klein is a cognitive psychologist who studies experts and expertise in organizations. This metric he came up with is performance improvement. It’s the combination of error reduction + insight generation. You can’t have one without the other. 

Yet we focus as an industry way too much on the error reduction piece and not on the insight generation piece. Except we’re not actually going to improve the performance of our organizations if we’re only focusing on the error reduction piece. And I get it, that is an easy thing to measure. As software engineers, we’re taught to look for technical errors, we’re taught to look for some of these things, we’re not so much taught to generate insights. We’re not so much taught to disseminate insights. And we don’t get celebrated for it.

we focus as an industry way too much on the error reduction piece and not on the insight generation piece. Click To Tweet

That’s something that we can do as leaders: we can actually celebrate the insight generation and dissemination and training materials by folks in our organization. 

Next are three different stories about the value incident analysis brought about in different organizations. These are based on true events I have witnessed or been a part of, but their names and details have been changed. 

Next: Story #1: Netflix…


 

Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.

This post is based on her 2021 presentation DevOps Enterprise Summit-Virtual Europe, which you can watch for free in the IT Revolution Video Library.

Pages: 1 2 3 4 5

Filed Under: DevOps Enterprise Summit, Leadership, Organizational Change, Workplace Culture Tagged With: DevOps Enterprise Summit, incident management, incident response, jeli, nora jones

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

newsletter sign up

Topics

Tags

agile agile conversations better value sooner safer happier business business agility business leadership case study cloud continuous delivery devops DevOps Advice Series devops case study devops enterprise forum DevOps Enterprise Summit devops handbook digital transformation dominica degrandis douglas squirrel enterprise Gene Kim incident management information technology IT jeffrey fredrick jez humble John Willis Jonathan Smart leadership lean making work visible manuel pais mark schwartz matthew skelton nicole forsgren operations Project to Product project to product tranformation seven domains of transformtion software software delivery Sooner Safer Happier teams team topologies the idealcast WaysofWorkingSeries

Recent Posts

  • Model Life-Cycle Management at Continental Tires
  • Flow Engineering
  • Value Stream Management and Organizing Around Value
  • Don’t Just Survive Your Audit, Thrive In It
  • Exclusive Excerpt from The Value Flywheel Effect

Privacy Policy

Featured Book

Featured Book Image

Events

  • DevOps Enterprise Summit Virtual - Europe
    Virtual · 10 - 12 May 2022
  • DevOps Enterprise Summit US Flagship Event
    Las Vegas · October 18 - 20, 2022
  • DevOps Enterprise Summit Virtual - US
    Virtual · December 6 - 8, 2022
  • Facebook
  • LinkedIn
  • Twitter
  • YouTube
Copyright © 2022 IT Revolution. All rights reserved.
Site by Objectiv.