John Allspaw is Co-founder of Adaptive Capacity Labs and former Chief Technology Officer of Etsy. As an engineering leader and researcher with over 20 years of experience in building and leading teams engaged in software and systems engineering, Allspaw has spent the last decade bridging insights from Human Factors, Cognitive Systems Engineering, and Resilience Engineering to the domain of software engineering and operations.
Also the author of two books, “The Art of Capacity Planning: Scaling Web Resources” and “Web Operations” (O’Reilly Media), Allspaw continues to contribute to the IT and DevOps communities through speaking and collaboration on new, exciting research.
In fact, we were lucky enough to host John at the last DevOps Enterprise Summit in San Francisco, where he took to the stage to talk about “How Systems Keep Running Day After Day.”
Below, we’ve transcribed the key takeaways and main highlights of Allspaw’s presentation, enjoy!
John Allspaw at DOES17 San Francisco
How Your Systems Keep Running Day After Day
What I want to talk about is new. It is different, and I feel very, very strongly about this.
To help set the stage, my thesis for my degree in Human Factors and System Safety was “Trade-Offs Under Pressure: Heuristics and Observations Of Teams Resolving Internet Service Outages.”
Some of you may have heard of this, what’s called the Stella Report.
At a high level, this report is the result of a year-long project of a consortium of industry partners. IBM, Etsy, and IEX, trading company, a trading exchange in Manhattan. Over this year, folks from the Ohio State University Cognitive Systems Engineering Lab, David Woods, Richard Cook, and a number of other folks looked deeply at an incident in each of those organizations.
They found these six themes and were common across all of them.
Certainly the results are quite important. It’s how that research was done that I want you all to take a look at.
Here are my main takeaways from the report:
- We have to start taking human performance seriously in this industry. If we don’t, we will continue to see brittle systems with ever-increasing impacts on our businesses and on society.
- We can do this by looking at incidents going beyond what we currently do in postmortems or post-incident reviews or after-action reviews
- There do exist methods and approaches from the study of resilience in other domains, but they require real commitment to pursue. Doing this is both necessary and difficult, but it will prove to be a competitive advantage for businesses who do it well.
First, I want to start with a little bit of a baseline, a bit of a vocabulary that’s going to be important as I sort of walk you through this. I’m going to describe a sort of picture, a representation, like a mental model of your organizations, and it’s going to have an above-the-line region and a below-the-line region.
If you imagine what we have depicted here, this is your product, your service, your API, or whatever your business derives value from and gives to customers. Okay? Inside there, what you see is your code. You see your technology stack. You see the data and some various ways of delivering this, right? Presumably over the internet or some other sort of way. But if we stay here, nobody’s going to believe me that that’s what we call the system, because it’s fine, but it’s not really complete.
What’s really connected, and what a lot of people have been talking about here in the DevOps Enterprise Summit community is all the stuff we do to manipulate what goes on in there, and so we have testing tools. We’ve got monitoring tools. We’ve got deployment tools and all of the stuff that’s sort of wired up. These are the things that we use. You could say that this is the system, because many of us spend our time focused on those things that are not inside the little bubble there, but all of the things that are around it, but if we were to stay just with this, we won’t be able to see where real work happens.
What we’re going to do here is, we’re going to draw a line that we call the line of representation, and then dig a little deeper. What we see here is you. All the people who are getting stuff ready to add to the system, to change the system. You’re doing the architectural framing. You’re doing monitoring. You’re keeping track what it’s doing, how it’s doing it, and what’s going on with them.
Now, you’ll notice that each one of these people have some sort of mental representation about what that system is. If you look at it a little bit more closely, you’ll see that none of them are the same. By the way, that’s very characteristic of these types of roles. Nobody has the same representation of what is below the line.
To summarize, this is our model of the world, and it includes not just the things that are running there, but all of you, the kinds of activities you’re performing, the cognitive work that you’re doing to keep that world functioning. If we play with this a little bit more, we end up with this kind of model. This model has a line of representation going through the middle, and you interact with the world below the line via a set of representations.
Your interactions are never with the things themselves. You don’t actually change the systems.
What you do is that you interact with the representation and that representation is something about what’s going on below. You can think of those green things as the screens that you’re looking at during the day, but the only information that you have about the system comes from these representations. They’re just a little keyhole. Right?
What’s significant about that is that all the activities that you do, all of the observing, inferring, anticipating, planning, correcting, all of that sort of stuff has to be done via those representations, so there’s a world above the line and a world below the line, and although you and we mostly talk about the world below the line as if it’s very real, as if it’s very concrete, as though it’s something that that’s the thing, here is the surprise.
Here is the big deal – you never get to see it.
It doesn’t exist. In a real sense, there is no below the line that you can actually touch. You never, ever see code run. You never, ever see the system actually work. You never touch those things.
What you do is that you manipulate a world that you cannot see via a set of representations, and that’s why you need to build those mental models, those conceptions, those understandings about what’s going on. Those are the things that are driving that manipulation. It’s not the world below the line that’s doing it. It’s your conceptual ability to understand the things that have happened in the past, the things that you’re doing now and why you’re doing those things, what matters, and why what matters actually matters.
Once you adopt this perspective, once you step away that the idea that below the line is the thing you’re dealing with, and understand that you’re really working above the line, all sorts of things change.
What you see in the Stella Report and that project and other projects that we’ve been engaged with is taking that view, and understand what it really means to take the above-the-line world seriously. This is a big departure from a lot of what you’ve all seen in the past, but I think it is a fruitful direction that we need to take.
In other words, these cognitive activities (see below) in both individuals and collectively in teams up and down the organization are what makes the business actually work. Now, I’ve been studying this in detail for quite a while here, and I can tell you this. It doesn’t work the way we think it does.
Finally, to set this frame up, the most important part of this idea is that all of this changes over time. It is a dynamic process that’s ongoing. This is the unit of analysis. Once we take that frame, we can ask some questions. We can ask some questions about above the line like this.
“How does our software work really, versus how it’s described in the wiki and in documentation and in the diagrams? We know that those aren’t comprehensively, they’re not comprehensively accurate.”
“How does our software break really, versus how we thought it would break when we designed safeguards and circuit breakers and guardrails?”
“What do we do to keep it all working?”
Question: Imagine your organization. What would happen if today at six o’clock all of your companies took their hands off the keyboard? They don’t answer any pages. They don’t look at any alerts. They do not touch any part of it, application code or networks or any of it. Are you confident that your service will be up and running after a day?
The question then is how to discover what happens above the line. Well, there’s a couple things. We can learn from the study of other high-tempo, high-consequence domains, and if we do, we can see that we can study incidents. (Note: when I say “incidents,” I mean outages, degradations, breaches, accidents, near-misses, and glitches – basically untoward or unexpected events).
What makes incidents interesting? Well, the obvious one is lost revenue and reputation impacts on a particular business. I want to assert a couple of other reasons why incidents are interesting. The one is that incidents shape the design of new component subsystems and architectures. In other words, incidents of yesterday inform the architectures of tomorrow. That is, incidents help fuel our imaginations on how to make our systems better, and therefore what I mean is, incidents below the line drive changes above the line.
That’s the thing. This can cost real money. Incidents can have sometimes almost tacit or invisible effects, sometimes significant. Right now, a lot of people are splitting up a monolith into micro-services. A lot of people do that because it provides some amount of robustness that you don’t have. Where do you get that?
You’re informed by incidents.
Another reason to look at incidents is that they tend to give birth to new forms of regulations, policies, norms, compliance, auditing, constraints, etc. Another way of saying this is that incidents of yesterday inform the rules of tomorrow, which influence staffing, budgets, planning, roadmaps and more. Let me give you an example: In financial trading, the SEC has put into place Regulation SCI. SCI, is probably the most comprehensive and detailed piece of compliance in modern software era. The SEC has gone and been very explicit. We have this as a reaction to the flash crash of 2010 to Knight Capital, BATS IPO, Facebook IPO. It is a reaction to incidents.
Even if you go back a little bit further, it’s often cited that PCI DSS came about when MasterCard and Visa compared notes, realized they lost about $750 million over 10 years, so incidents have significant, and by the way, I can, as a former CTO of a public company, I can assure you that this is a very expensive, distracting, and inevitably a burdensome albatross for all of your organizations. Incidents are significant in this way too, but if we think about incidents as opportunities, if we think about incidents as messages, encoded messages that below the line is sending above the line, and your job is to decode them, if you think about incidents as things that actively try to get your attention to parts of the system that you thought you had a sufficient understanding of but you didn’t, these are reminders that you have to continually reconsider how confident you are about how it all works.
Now, if you take this view, a whole bunch of things open up. There’s an opportunity for new training, new tooling, new organizational structures, new funding dynamics and possibly insights that your competitors don’t have.
Incidents help us gauge the delta between how your system works and how we think your system works, and this delta is almost always greater than we imagine. I want to assert perhaps a different take that you might be used to, and it’s this. Incidents are unplanned investments in enterprise, in your company’s survival. They are hugely valuable opportunities to understand how your system works, what vulnerabilities in attention exist, and what competitive advantages you are not pursuing.
If you think about incidents, they burn money, time, reputation, staff, etc. These are unavoidable sunk costs. Something’s interesting about this type of investment, though. You don’t control the size of the investment, so therefore the question remains, how will you maximize the ROI on that investment?
When we look at incidents, these are the type of questions that we hear, and it’s quite consistent with what researchers find in other complex systems, domains. What’s it doing? Why is it doing that? What will it do next? How did it get into this state? What is happening? If we do Y, will it help us figure out what to do? Is it getting worse? It looks like it’s fixed, but is it? If we do X, will it prevent it from getting worse, or will it make it worse? Who else should we call that can help us? Is this our issue, or are we being attacked? This is consistent with many other fields. Aviation, air traffic control, especially in automation-rich domains.
Another thing that’s notable is that the beginning of any incident, it’s often uncertain or ambiguous about whether or not if this is the one that sinks us. At the beginning of an incident, we simply don’t know, especially if it contains huge amounts of uncertainty and huge amounts of ambiguity. If it’s uncertain and ambiguous, it means that we’ve exhausted our mental models. They don’t fit with what we’re seeing, and those questions arise. Only hindsight will tell us if that was the event that brought the company down or if it was a tough Tuesday afternoon.
Incidents provide calibration about how decisions are focused, about how attention is focused, about how coordination is focused, about how escalation is focused. The impact of time pressure, the impact of uncertainty, the impact of ambiguity, and the consequences of consequences. Research validates these opportunities.
“We should look deeply at incidents as, “non-routine challenging events, because these tough cases have the greatest potential for uncovering elements of expertise and related cognitive phenomena.”
– Gary Klein, the originator of naturalistic decision-making research.
There’s a family of well-worn methods, approaches and techniques. Cognitive task analysis. Process tracing. Conversational analysis. The critical decision method. How we think postmortems have value looks a little bit like this:
An incident happens. Maybe somebody will put together a timeline. We have a little bit of a meeting. Maybe you’ve got a template, and you fill that out, and then somebody might make a report or not, and then you’ve got, yeah, action items, finally. We think that the greatest value, perhaps maybe the onliest value, is where you’re in a debriefing and people are walking through the timeline and you’re like, “Oh, my God. We know all this.”
This is not what the research bears out. The research bears out that if we gather subjective and objective data from multiple places, behavioral data, what people said, what people did, where they looked, what avenues in diagnosis did they follow and weren’t fruitful? Well-facilitated debriefings get people to contrast and compare their mental models that are necessarily flawed. You can produce different results, including things like bootcamp, onboarding materials, new hire training. You can have facilitation feedback if you build a program to train facilitators. You might make roadmap changes, really significant changes based on what you learn.
I can tell you this from some experience. There is nothing more insightful to a new engineer or an engineer just starting out in their career than being in a room with a veteran engineer who knows all of the nooks and crannies explaining things that they may not have ever said out loud. They have knowledge. They may draw pictures and diagrams that they’ve never drawn before because they think everybody else knows it. Guess what? They don’t. The greatest value is actually here, because the quality of these outcomes depend on the quality of that, that recalibration. This is an opening to recalibrate mental models.
From the Stella Report, it “informs and recalibrates peoples’ models of the how the system works, their understandings of how it’s vulnerable and what opportunities are available for exploration.”
In a lot of the research, in all of the research contained in the Stella Report, and it fits with my experience at Etsy as well, one of the, the reflection’s strongest from people who do this in a facilitated way to do this comparing and contrasting. “I didn’t know it worked that way.” Then there’s always other, “How did it ever work?” Which is funny until you realize it’s serious. What that means is, the way not only I thought it worked a different way. Now, I cannot even imagine, I can’t even draw a picture in my mind of how it could have possibly worked. That should be more unsettling. By the way, I want to say this is not alignment. Like I said, via representations, we necessarily have incomplete mental models. The idea is not to have the same mental models, because they’re always incomplete, because things are always changing, and because they’re going to be flawed. We don’t want everybody to have the same mental model because then everybody’s got the same blind spots.
“Blameless” is table stakes. It’s necessary, but it’s not sufficient. You could build an environment, a culture, an embracing, a sort of welcoming organization that supports and allows people to tell stories in all of the messy details, sometimes embarrassing details, without fear of retribution, so that you could really make progress, and in understanding what’s happening, you can set that condition up and still not learn very much. It’s not sufficient. It’s necessary, but not sufficient. What I’m talking about is much more effort than typical post-incident reviews. Right? This is where an analyst, a facilitator can prep, collating, organizing, analyzing behavioral data. What people say, what people do. There’s a raft of data that they can sift through to prep for debriefings, a group debriefing, or a one-on-one debriefing, going beyond … Postmortems hint at the richness of incidents. Following up on this takes a lot of work.
By the way, everyone’s generally so exhausted after a really, a stressful outage or incident or event that sometimes everything becomes crystal clear. That’s the power of hindsight, and because it seems so crystal clear, doesn’t seem productive to have a debriefing, because you think you already know it all. The other issue is that postmortem briefings are constrained by time as well. You only have the conference room for an hour or two. Everybody is really busy, and the clock is ticking, so this is a challenge for doing this really well, even given those research methods.
The other issue, especially if you build a debriefing facilitation training program like I did at Etsy, there’s still challenges that show up. What I like to call it is, “Everyone has their own mystery to solve,” or, “Don’t waste my time on details I already know.” In a cartoonish way, you can think about it as this way:
Because you may only have an hour, you need to extract as much learning as you can. All work is contextual. Your job to maximize ROI is to discover, explore, and rebuild the context in which work is done in an incident, how work and how people thought above the line.
Assessments are trade-offs, and those are contextual.
In closing, all incidents can be worse. A superficial view is to ask, “What went wrong? How did it break? What do we fix?” These are very reasonable questions. If we were to take a deeper level, and we could ask, “What are the things that went into making it not nearly as bad as it could have been?” Because we don’t pay attention to those things and don’t identify those things, we might stop supporting those things.
Maybe the reason why it didn’t get worse is because somebody called Lisa, and Lisa knows her stuff. Something from research is that experts can see what is not there. If you don’t support Lisa, and you don’t even identify that the reason why it didn’t get worse is because Lisa was there. Forget about action items for fixing something for a moment. Imagine a world where Lisa goes to a new job.
Useful at a strategic level is a better question. “How can we support, encourage, advocate, and fund the continual process of understanding in our systems? And really take “above the line” in a sustained way?
Where do we go from here? I’ve got some challenges for you:
- Circulate the Stella Report in your company and start a dialogue. Even if you’re too busy or you’re not in a position to read it yourself, give it to people who do. Ask them what resonates. Ask them what doesn’t make sense. Ask them, start a dialogue.
- Look deeply at how you’re handling post-event reviews. Most importantly, go find the people who are the most familiar with the messy details of how work gets done and ask them this: “What value do you think our current post-incident reviews really have?” and listen.
- Take the responsibility to learn more and faster from incidents than your competitors. You’re either building a learning organization or you’re losing to one who is.
- We need to take human performance seriously. This discussion is happening. It’s happening in nuclear power. It’s happening in medicine. It’s happening in aviation, air traffic control, in firefighting.
The increasing significance of our systems, the increasing potential for economic, political, and human damage when they don’t work properly, and the proliferation of dependencies and associated uncertainty all make me very worried. If you look at your own system and its problems, I think you’ll agree that we have to do a lot more than acknowledge this problem. We have to embrace it. What you can help me with, please spread this information, these ideas and my presentation from DevOps Enterprise Summit San Francisco 2017.
I want to hear from you. What resonated with you about this? What didn’t? What challenges do you face in your org along these lines? Come tell me. I’m on Twitter.