Skip to content

May 10, 2022

Learning Effectively From Incidents: The Messy Details

By IT Revolution

Much has been written about “organizational learning” and “learning organizations.” This continued and growing attention on these topics in the software world is encouraging and warranted! However, creating conditions for people to genuinely and effectively learn from the incidents they experience is difficult to do, never mind sustain over time. The frequency, severity, and even absence of these events do not represent what is learned, who has learned what, or how learning might be taking place in an organization. This transcript from the 2021 DevOps Enterprise Summit presentation by John Allspaw, is about the “messy” and practical realities of learning effectively from incidents, including a number of paradoxes and ironies that technology leaders face as they work to make progress in their organization.


My goal today is to describe one of the most effective ways we can learn from incidents. It might not be intuitive, but at the same time, it might be very intuitive, just unclear on how to do it. Before we jump into it, I want to start out by giving you the summary. I want to give you the conclusion of this talk before I start. So here’s, too long, didn’t watch slide. The main gist of what I want to get across. The first is learning is never not happening. It is what humans do. It’s an integral human activity. The second is that it requires remembering, learning and remembering are inextricably linked. That means that learning from incidents effectively means discovering and highlighting aspects and qualities of the story of an incident that makes it more likely to be remembered. So what are those aspects? Well, there are elements of surprise, difficulty, misunderstandings, dilemmas, and paradoxes. This is what makes for good stories. This is what makes for stories that can be remembered. If you can remember it, it’s likely that learning is going on.

So something that I want to mention is that I’m using the phrase, messy details in the title of this talk very deliberately. You may have heard me reference this phrase before. It’s a reference to this paper and it captures quite eloquently in my opinion, that when it comes to working in complex domains, the details of what people do and how they do it is what matters. Almost more than anything else. These details are easy to miss and they’re not often looked at closely. So that’s what I mean when I say the messy details. If you haven’t taken a look at this paper, I can’t recommend it more highly just because the topic surrounds the domain of healthcare doesn’t mean that it’s not applicable to other domains.

So I want to first start with the story from my time at Etsy company here in Brooklyn, New York, where I am right now, I worked for a number of years. So the story of this incident is that an engineer on the job for no more than a couple of weeks made a change that brought all of down and the site wasn’t just slow or degraded, it was down hard. This is sort of what a typical write-up looks like about incidents. A recently hired engineer made a change to production, blah, blah, blah, blah, blah. It was about an hour and 10 minutes for them to figure out what was going on and what to do. So nothing catastrophic, but nothing, nothing either. So I want to park that for a second. We’ll come back to this story.

Let’s talk about learning in general, in learning from incidents. As I mentioned before, people are always learning. It’s difficult to prevent people from learning. The question isn’t whether they’re learning or not learning. The question is what are they learning and how useful or how productive is what they’re learning, going to be to help them do their work in the future. The challenge isn’t getting people to learn. It’s about creating conditions where a couple of things can happen. We want to create conditions where people at every level of the organization have opportunities to discover new things they didn’t know or revisit things that they thought they already knew, but either we’re wrong or slightly dated in their knowledge, that sort of thing. It’s also about creating conditions where experts are supported in describing and teaching others, telling stories about what they know and how they know it.

This is actually a lot more difficult than just getting something on the calendar and asking somebody, tell me what you know? What we know about studies of expertise is that experts are not necessarily experts at describing what makes them an expert, but rich stories are valuable. You want to create conditions where they are viewed as assets, just like any other valuable asset to the success of a business. As mentioned before, in some other talks is that learning is not the same as fixing. Often, especially in the industry at the moment, they sometimes seem to be confused or swapped for one or the other way to say about learning and what’s important if you can’t remember something, you can’t say you’ve learned it.

So analyzing incidents, therefore, means finding what made the incident surprising or difficult. These are what make for memorable stories. What if the incident analysis is less about solving the problem that the incident responders responded to and more about understanding how the incident responders understood and experienced handling and working through the incident. So we work with a lot of different organizations. One of the first questions that we always ask when we first talk to them. Do you have any stories about incidents and that’s it? That’s the prompt. That’s the prompt that we give them, we don’t give them anything else. Do any incidents come to mind? Could you tell us a story? A couple of things show up for us that are always true. First is they’re really enthusiastic. They always respond. Oh, yeah. Well, oh God, let me tell you about this one. They use their hands. They clearly sometimes almost flip into a different mode when they tell the story.

They tell the story in suspenseful ways. They know what’s going to happen. That’s how they have the story we don’t. And whether they know it or not they lay out telling the story using what scholars and narrative composition we call, suspense structures. Even if they don’t know what a suspense structure is, they include what was surprising. They include what was weird, strange, and difficult. They give us a backdrop. Oh, so you got to remember, this was the day, right about this point was when our CEO took the stage at a thousand-person conference or something along those lines. They’re giving context for when in time, sometimes in space, the story took place. Then they can tell it in detail. And this is what is the most gratifying and fascinating thing for us. Even if it’s been years since it happened, they can come up with all sorts of esoteric. They could even write on the whiteboard, what this little piece of code looked like.

So after they’re done telling these stories, we always ask them, hey, so this is an amazing story. Is there a place where I could read about this or when somebody joins the company, could they go read about it? And they always respond in some form of, oh yeah, well, we’ve got a postmortem document, hold on, let me see if I can get it. And we’ll take a look at it. And here’s, what’s interesting to us is that the story that they tell is always different than the official write-up. And we wonder why that is. Does it have to be that way? So I’m going to tell you a little bit of an example. I’ll give an example of how the telling of a story can or cannot reflect the richness of an event or series of events.

And here it goes, one sentence, a high school senior in Illinois led their classmates on an 11-hour crime spree, committing fraud, grand theft auto, and cybercrimes. That’s the story. So unclear whether you picked up, what I’ve described is Ferris Bueller’s Day Off, and to be fair, it’s not wrong. It’s all of the facts in this sentence, all of the statements are true. It’s just incomplete. And not only that, it may take some liberties in some of the descriptions or not. It can be true and also be pretty anemic as far as the story is concerned. If I give you that sentence and you go see the movie, you’ll see that they’re quite different. This is that richness and those messy details that I want to get across, such as Abe Froman for those who’ve seen the movie.

So let’s revisit this incident that I was telling you about before. September 2012 afternoon, this is a tweet from the Etsy status account saying that there’s an issue on the site. Give you a little bit of some flavor of what was going on in chat. People said, oh, the site’s down. People started noticing that the site is down. Couple of observations that we’re coming out of memory errors all over the place. More are observations, signals said that there’s something about memory going on. Seems like some templates were rebuilt on the last deploy. Interestingly, there was a deploy, but it was actually spaced in time. Usually at least back then, if a deploy had an issue, there was some sort of bug or anomalous sort of behavior. It would not take very long. It was as long as soon as the code was out there, it was relatively, it wasn’t five minutes and this was roughly about five minutes. Things seemed fine. So that was of interest. And anyway, whatever was in the deploy, still wasn’t clear.

People said, oh, well, maybe there was some sort of template-related thing and people said, well, it looks like we need to actually get a restart. And somebody said it’s really hard to even connect to some of the web servers. Meanwhile, people who are making the change or making the changes trying to work out what was going on said, oh, well, we can deploy this. We can deploy that. And people said, well, actually it’s going to be hard to even deploy because we can’t even get to the servers. And people said, well, we can barely get them to respond to a ping. We’re going to have to get people on the console, the integrated lights out for hard reboots. And people even said, well, because we’re talking about hundreds of web servers. Could it be faster, we could even just power cycle these. This is a big deal here. So whatever it wasn’t in the deploy that caused the issue, it made hundreds of web servers completely hung, completely unavailable.

People said that deploying this is even deploying even if we knew what was going on is going to be pretty hard to do until we can power cycle everything. Somebody pointed, well, we’re going to have to actually disable the load balancer a bit or disable traffic coming in because we don’t want them to come back up after we power cycle them because they’re still going to have the code, whatever’s going on, it’s only going to happen again. So we’ll block all the traffic, reboot all the boxes, deploy the change whatever that is. We don’t even know that is yet. But we had hundreds of web servers, so people were fanning out, oh, you get this number and you get this number. I’ll get web one through 10, you get 11 through 21. So on and so on and so on. They were in reboot Fest. At some point, they get to a spot where they walk through all of those steps. A lot of people ran a lot of commands in a very short period of time to get these boxes up and running.

They finally got it up. What’s interesting about this. Well, let’s go back to one of the changes seen that there was something about templates, what they had worked out afterward, there was a ticket. Hey, one of the tickets was for this newly hired engineer who was on a boot camp. At Etsy, you would start in your first week. You’d spend a week in this team and then you’d spend a week at another team and spend a week at another team, we call it boot camp, lots of organizations do this. And then you’d finally land at the team that you’re going to be part of more, more permanently. So it is like getting a bit of a tour? And one of the tasks was with the performance team and the issue was old browsers. You always have these workarounds because the internet didn’t fulfill the promise of standards. So, let’s get rid of the support for IE version seven and older. Let’s get rid of all the random stuff.

And so now Etsy, if you don’t know, written in PHP might still be, we used a template engine to help sort of put together composed pages called smarty. And in this case, we had this template-based template used as far as we knew everything, and this little header-ie.css, was the actual workaround. And so the idea was, let’s remove all the references to this CSS file in this base template and we’ll remove the CSS file. And this had been tested and reviewed by multiple people. It’s not all that big of a deal of a change, which is why it was a task that was sort of slated for the next person who comes through boot camp in the performance team. So they’ve made this change.

And like I said, some time passed, what would happen? They figured it out later. The request would come in for something that wasn’t there, 404 would happen all the time. The server would say, well, I don’t have that. So I’m going to give you a 404 page and so then I got to go and construct this 404 page, but it includes this reference to the CSS file, which isn’t there, which means I have to send a 404 page. You might see where I’m going back and forth, 404 page, fire a 404 page, fire a 404 page. Pretty soon all of the 404s are keeping all of the Apache servers, all of the Apache processes across hundreds of servers hung, nothing could be done. The team looked at how many servers they had and then when they split up, it became clear that they had to power cycle. They’d take 10 at a time so that lots of folks could reboot them quicker and parallel. That’s a little bit more of the story than what I first gave you. So, I just want to be clear on something.

In this story, I’m hoping that many people who worked at Etsy at that time ceased this talk where one CSS change could not only break but break it so spectacularly that its entire fleet of web servers required hard power cycling. I’m going to go out on a limb. It’s one that I don’t believe anyone who was there will ever forget. It is very memorable. Side note on this particular case. This is the case that led us to build an award that we gave every year called the three-armed sweater award. I’ll leave that for a different talk. There are other talks about it. So what I’m trying to get across here in this story is we need to make effort to highlight these messy details. What was difficult for people to understand? What was surprising for people about the incident? How did people understand the origins of the incident? When the people first in the CSS case went looking, they dismissed the change that had just been made as being relevant because some time had passed and that was a very reasonable thing to do.

What mystery remained for people? There are some details of this story that, and I was there that I’m still not clear on. The goal of effective incident analysis is to capture the richest understanding of the event, represented for the broadest audience possible. This means multiple tradeoffs at different levels. You don’t want to capture in written form something that’s so technically detailed that you’ve lost a whole bunch of readers. You also don’t want it to be so vague and hand-wavy as to basically tell you nothing like that first slide that I showed you of the CSS case, it didn’t really say much. Just one quick note on what I would say is the toughest, there are many barriers and many challenges to getting this done well. First is hindsight and hindsight bias, or the I-knew-it-all-along effect. This is a tendency to simplify these complex messy details of the event down to the one true story.

As a result, this tendency can basically produce a story where all of these details, these multiple perspectives, all get sort of wiped away in favor of a story that makes sense to me, the person looking back and we want to do it to be efficient and crisp, but that’s lossy. It means that smoothing out this messiness and boiling it down to how long the incident took, an hour and 10 minutes, is an hour and 10 minutes the most interesting part of that story. What you want to do in capturing you’ll note that I haven’t told you how to do it because that’s much more beyond a talk. You want to support the reader regardless of how you do it. You want to write incident descriptions to be read, not just to be filed. You want to describe the data that you relied on in your analysis.

Was it just Joe who responded to the incident? And I don’t know, it took 10 or 15 minutes to fill out a template. I don’t know. Maybe there’s more than just Joe’s view on it. You to make it easy for readers to understand terms or acronyms that they’ve not seen before, and you could use this and this is a proprietary knowledge trick here. You could use hypertext linking technology, look it up. It’s amazing. You want to have connections. Incidents are not these extra side distractions. They are a part of the work that you are doing. Remember you’re preventing them all the time. You want to increase the amount of preventing them all the time and use diagrams or other graphics to describe complex phenomena. Don’t be afraid of using pictures, make it easy for others to link to the write-up document. So how can you know if you’re making progress? Well, describe some of these before.

Here are some signals that can tell you that, eh, you’re making progress in the right direction. More people will actually read post-incident write-ups because you’re tracking them. More people will voluntarily attend post-incident group review meetings and they’ll participate. They’ll talk about their view, their perspective, what happened to them, and what was surprising to them. More people will link to these write-ups from code comments and commit messages, architecture diagrams, other related incident write-ups, and new hire onboarding materials. I can say now, after working with a number of organizations for a couple of years, this happens, there are companies where voluntarily, I know of one organization where voluntarily 80 engineers showed up to a group review meeting and a huge majority of them added and calibrated and helped modify their understanding collectively about the incident. Months after the incident, a write-up has been written. Still, people are commenting on it. Still, people are linking to it. Still, people are reading it. Tens of people a day are reading it and sharing it with their colleagues.

I mean, this is difficult. Organizations that we know of are doing it. I will say this, your competitors are hoping that you won’t pay attention to any of it. These are markers of progress, I just wanted to point out here. I literally asked you and challenged you to pay attention to these things last year in the talk that I gave at the DevOps enterprise summit. So my snarky response now is how’s that going? So here’s the help that I would like. I’d like in the conference slack channel, I want people to offer up their stories. I want people to challenge me on things that I’ve said in this talk, and want people to keep the conversation about these messy details alive and moving and evolving forward. This is how we become better at learning from incidents.


- About The Authors
Avatar photo

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

Follow IT Revolution on Social Media

No comments found

Leave a Comment

Your email address will not be published.

Jump to Section

    More Like This

    Go from Unplanned Work to Planned Work with Integrated Auditing 2.0
    By Clarissa Lucas

    The Scenario Picture this... You're starting your work week off just as you do…

    Announcing the Spring 2023 DevOps Enterprise Journal
    By IT Revolution

    We are delighted to announce the publication of the Spring 2023 DevOps Enterprise Journal…

    Not Just for Auditors
    By Clarissa Lucas

    Scroll through my list of LinkedIn connections or the subscribers to my blog, and…

    From Checklist Auditors to Value-Driven Auditors
    By Clarissa Lucas

    Have you ever had your auditors show up with a checklist or a scope…