The Blame Game: Story #2
This reminds me of a separate story. I was at an organization where an incident had occurred at 3:00 AM. That’s when all the bad incidents occur right? I came into the office the next day and I was tasked to lead the investigation of this highly visible incident after the fact, this was something that made the news.
But a senior engineering leader pulled me aside in the office the next morning and said something along the lines of, “Hey, Nora, I don’t know if this incident is actually all that interesting for you to analyze. I feel like maybe we should just move on.”
I asked why.
And they said, “Well, it was all…I know I’m not supposed to say this, but it was human error. Kieran didn’t know what he was doing. He wasn’t prepared to own the system. He didn’t need to respond to that alert at 3:00 in the morning. It could have waited until he was in the office and he could have gotten help with it.”
I was shocked. This was an organization that thought they were practicing blamelessness. We’ve all heard about blameless postmortems, but yet we all use it a little bit incorrectly. They thought they were practicing this without a deep understanding of it, and when something like this happens, a Kieran makes an error, it’s usually met with instituting a new rule or process within the organization without publicly saying that you thought it was Kieran’s fault. Yet everyone, including Kieran, knows that folks think that.
That’s still blameful. It’s not only unproductive, it is actually hurting your organization’s ability to generate those new insights from that equation we looked at earlier, and build expertise after incidents. And so you’re actually harming your organization’s ability to improve your performance.
Lesson: Spotting Errors versus Encouraging Insights
I get it. It’s easier to add new rules and procedures. It’s easier to add in gates. It’s easier to update a runbook and just move on. It allows us to emotionally move on, and we need that as humans, we need to feel like we’re done with the thing.Adding in these new rules and procedures actually diminishes the ability to glean new insights from these incidents. Click To Tweet
But these implementations of new rules and procedures don’t actually usually come from the folks on the frontline either. And that’s because it’s much easier to spot errors in hindsight, especially from a management perspective.
It’s much more difficult as leaders to encourage insights.
But unfortunately, adding in these new rules and procedures actually diminishes the ability to glean new insights from these incidents. You’re not giving people the space and time they need to glean these new insights. Because what Kieran did, someone else is going to do in the future, even if you add those guard rails up.
The Post-Incident Interview
So despite all that, I still decided I wanted to talk to Kieran and I wanted to figure out what happened. So according to the organization, Kieran had received an alert at 3:00 AM that, had he spent more time studying the system he was on call for, he would have known could have waited until business hours to fix. I came into a conversation with Kieran completely blank and I asked him to tell me about what happened.
“Well,” he said, “I was bugging a Chef issue that started at 10:00 PM, and we finally got it stabilized. I went to bed at around 1:30 AM. At 3:00 AM I received an alert about a Kafka broker being borked.”
Finding #1: Who Is On Call and Why?
Interesting finding number one, Kieran was already awake and tired and on-call from debugging a completely separate issue.
That’s interesting to me. I wonder why we have people on call like that for two systems in the middle of the night, and we’re not keeping an eye on them.
I asked him what made him investigate the Kafka broker issue.
He said, “Well, I had just gotten paged for it, my team just transferred this on-call rotation for this Kafka broker about a month ago.”
I asked if he had been alerted for it before.
He said, “No, but I knew this broker had some tricky nuances.”
Finding #2: Which Team Owns What?
That led me to interesting finding number two, Kieran’s team had not previously owned this Kafka broker. And I wondered, at this organization, why did they get transferred, the on-call for this Kafka broker? And how do on-call transfers of expertise work? Who originally held the expertise for this Kafka broker if not this team?
I then asked him how long he’s been at this organization. He said five months.
Finding #3: Truth Bubbles to the Surface
Interesting finding number three, Kieran was pretty new to the organization. And we had him on call for something like this, for two separate systems in the middle of the night, and I don’t really feel like this is Kieran’s fault so much anymore. I’m starting to think that this really wasn’t human error.
If I was in Kieran’s shoes, I would have absolutely answered this alert at 3:00 in the morning. I’m new to the organization, it’s a new team that I’m on call for, and I know this broker has tricky nuances. It makes sense.
But yet if we hadn’t surfaced all these things, and we hadn’t had the opportunity to have a good incident review with Kieran, we wouldn’t have surfaced this. We would have kept repeating those hacky on-call transfers. We would’ve kept putting new employees on call when they maybe weren’t ready yet, or maybe when we hadn’t trained them yet.
And so by digging into this a little further, we were able to surface these things. But if we had just implemented a new rule or procedure, this kind of stuff would just get repeated again. Maybe not with this Kafka broker, but with another on-call system in this org.
Importance of the One-on-One
So let’s go back to this point. Incident reviews are important, but they’re not good. And what’s worse is when an incident or event seems to have a higher severity, we actually end up giving our engineers even less time to figure out what happened. Sometimes it’s due to SLAs that we have with customers, but it’s important that the time and space that is given after that customer SLA is met, to come up with actually good action items. To come up with the how of how things got the way they are. Give your engineers space to work through them, especially if it was an emotionally charged incident.
So when you do an analysis of an incident, Slack channels or Zoom transcripts or chatting with people, you can talk to people one-on-one like I did with Kieran. We call this an interview or a casual chat. And these individual interviews, prior to the bigger incident review, can determine what someone’s understanding of the event was. What stood out for them as important. What stood out for them as confusing or ambiguous or unclear. And what they believe they knew about the event and how things work that they believe others don’t.
Especially with emotionally charged incidents, we should set up some one-on-one individual chats like this. If I had asked Kieran the questions in the incident review meeting myself, it probably wouldn’t have revealed all the things that he revealed to me in that one-on-one chat.
How to Ask the Right Questions
Now, there are certain ways we can ask questions and we call these cognitive questioning or cognitive interviews. Now knowledge and perspective gleaned in these early interviews, or the way we ask these questions, can point to new topics to continue exploring. They can point to some relevant, ongoing projects. They can point to past incidents. They can point to past experiences that are important for the organization, important historical context to know, to help level everyone else up.
There’s a bunch of sources of data that we can use to inform this incident review and we can iteratively inform and contrast the results of cognitive interviews with these other sources of data. Like pull requests and how they’re being reviewed, or how Slack transcripts are going, or docs and architecture diagrams, or even JIRA tickets where the project got created.
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.