The following is an excerpt from a presentation by Dr. Richard Cook, Research Scientist at The Ohio State University, titled “Working at the Center of the Cyclone.”
You can watch the video of the presentation, which was originally delivered at the 2018 DevOps Enterprise Summit in London.
There are some problems that we have in the center of the cyclone, that being the place where people are trying to keep the systems that you build running.
And in relation to those, here are the three things that I want to talk about today:
- Clarifying the system. By describing what is the system, we get a completely different view of the world. I’m hoping that this will change everything for you, it’s something we’ve been working on for a while.
- The idea is that experience with the incidents that we’re having doesn’t fit our paradigm. This problematic nature of the incidents that we’re having should inform us about how to approach systems differently. Unfortunately, there’s a whole bunch of problems with that.
- A theoretical question. I’m going to ask you to think about whether or not one of the things that we derive from this view of the system might actually be true.
What I would like to share with you is a set of research results that have been going on since we began this work around 1987. We’ve done this work in a whole bunch of different areas. We’ve done it in nuclear power plants, aviation, military systems, semiconductor wafer fabrication and a whole bunch of different places.
The results are basically the same across all these industries, including yours.
It’s not surprising that your system sometimes fails.
What’s surprising is that it ever works at all.
If you were to look inside the systems that you folks build, the conclusion that we would come to is none of that should to work.
The way you do it, the way you build it, what you’re building it on, it should never work at all. In fact, it’s quite surprising that it works at all. So, it’s not surprising that it sometimes fails. Failure is the normal function of your systems, not the abnormal one.
Systems fail— they fail all the time. For each one celebrated failure that you could read about in the press, there are hundreds or even thousands of failures of the systems happening all the time. In fact, I’ve been unable to find any large enterprise that doesn’t experience at least one significant outage a day.
That doesn’t mean that the customer notices it. But it does mean that there are people scrambling around in the background trying to keep the thing alive before the customer does. Anybody in here who doesn’t think that is happening in your system should probably look a little more closely at what’s going on.
There is some good news.
The good news is that every outage could’ve been worse. Even the outages where you’re down for several hours. Those could’ve been worse.
The not so good news is something is always broken.
There’s never a time when you don’t have broken stuff. Rather than imagining that you could somehow build a system that doesn’t break down and is somehow protected against that, what you might do is think that the systems that you’re dealing with are themselves intrinsically failing and failure prone.
And of course, all of the bits of your system are constantly changing. It’s never static. It never stops. It’s complex, but complexity and change are actually the same things. There is no difference, truthfully, between complexity and change. Anything that is complex will necessarily be changing. Anything that is continuously changing will necessarily be complex. They’re synonyms.
The big problem is that nobody knows what’s going to matter next. Everybody is trying to figure out what is going to be the problem next. Nobody knows. People who stand up and say ‘this is what the next problem is going to be’ are inevitably proven wrong.
For example, a lot of you are very focused on continuous deployment. I think continuous deployment is a great idea except but it requires continuous attention, scrutiny, and recalibration. That’s the downside, the dark side of continuous deployment.
But recalibration is actually the critical idea, it goes to the core of what your system is. I’m going to talk a little bit about recalibration, why we need it, how it’s being done, how can we afford to do it (because it’s very expensive,) and how can we do it better.
Other domains have similar problems.
There are lots of domains other than IT that are confronting complexity and change at constant failures, like medicine. However, there’s also a lot of resilience in your system that’s already present and being used. Resilience is not something you’re going to go out and buy, it’s already there
Then there are diverse efforts underway to try and make systems work better. There’s a group of us who have been studying this, we call them the ‘SNAFU Catchers.’ It includes companies like SalesForce, IBM, New Relic, Key Bank, IEX (which is an exchange company,) Etsy, etc. SNAFU catchers come from a term that was used in the second world war, which stands for Situation Normal All Fouled Up.
Quick backstory: In World War II there was a group of people who flew aircraft to pick up down sailors and airmen called the SNAFU snatchers, who would fly these flying boats out as rescue planes.
This idea of SNAFU snatching or catching is a very important one because it illustrates the theme that I’m trying to get at very well. So, the boat itself is a kind of tooling, but the important thing about SNAFU snatching was that in order to make use of this tooling, it was necessary to have a whole organization and group that was tuned to this mission. It was built over the space of six months, functioned very well for about three years, recovered huge numbers of air people who had been downed in the Pacific Ocean, it was enormously successful.
But here’s the key, you build things differently when you expect them to fail. Let me be very clear about this.
You build things differently when you expect them to fail.
Failure is normal. The failed state is the normal state. The way you deal with failure is not by pretending that you’re going to build systems that are completely defenseless against failure, but by building an organization that is able to recover from these failures and restores the functionality.
The bottom line here is…
There’s a lot of stuff that’s happening on the other side of deploy.
There’s always another story which is on the other side of deploy. What’s happening over there? What’s going on after deploy that makes these systems functional?
The first thing that you realize when you look at this is that we don’t have a very good description of what is our system. Most of the time when you say to someone tell me what your system is, they’ll give you a description like this:
“We’ve got some internally sourced stuff. The software that we run, our applications, we’ve got some delivery stack that’s stuff that delivers this thing to the outside world. And we’ve got this externally stored stuff, databases, and things like that. That’s the system.”
But, if you step back from it for just a moment, what you see is there’s a lot of other stuff going on.
There’s code generating tools, code libraries, test cases, deployment tools, organizing tools, container tools. As well as some external services and monitoring. Really we ought to include that in the system because that’s how you make it happen, so that is part of The System.
As soon as you do this, though, you realize that you have other things going on as well. You have people who are writing the code. They’re making it and getting it ready to run. They’re framing and building architectures. They’re keeping track of what the system is doing. They’re engaged in all of this, and they’re doing it continuously and intimately in ways that don’t allow you to really separate that from the code.
By the way, there’s lots of communication going on between those people. They’re engaged in a bunch of thinking about what’s going on. They’re building mental models of the system.
Those mental models are unique. Sometimes they’re deep, sometimes they’re shallow. They’re always incomplete. They’re constantly being tested and they’re always being recalibrated.
Another thing that’s of interest here is that these people are seeing the system through a bunch of representations. That is if you think about these green things here at the top of the code generating and all the rest of these, those are all representations of the system.
They’re the screens, they’re what people are actually seeing. They’re not actually seeing the system. What are they doing? They’re observing, they’re inferring, they’re anticipating, they’re planning and diagnosing. That’s the work which we see as a group of activities when you and I look at them, but what they’re doing is something else.
What I want the key message here to be is, these representations form a kind of line. I call it the Line of Representation.
There’s what is ‘above the line’ and ‘below the line.’ What’s significant about this is you never get to see what’s below the line. ‘Below the line’ is completely invisible. The only things that you can see are the representations of what is down there. This is a challenge to you because many of you believe that you know what’s down there, but you never see it, you can never touch those, they are invisible to you. All you can see are the representations.
All the action in your system is ‘above the line.’ Everything that happens, is in fact, ‘above the line.’ This is a really difficult thing to get your mind around. But from this, you will realize that incidents, when they occur, are occurring up here, not down here. An incident is something that occurs in the mind of the person who’s looking at the representation. Not something that’s occurring in the system ‘below the line.’
This is a really different way of looking at the world.
I think if you test it for a moment, you’ll realize that it’s correct. We never ever see what’s ‘below the line.’ We simply infer its presence from the representations that we make and manipulate.
As a consequence, the mental models that people have are not the same model. They’re very much different because people are looking at different representations of the system and forming those models in different ways. There is no privilege mental model. There’s no model that is The System. There are only different mental models which can be compared and contrast and tested in a variety of ways.
This brings us to this idea about incidents being different from the paradigm.
The way we understand about systems is not by asking you how they work, it’s by watching how you deal with incidents.
The study of incidents is the revealing thing about how systems actually work and what’s going on. We use incidents to tell you, and there are a few different kinds of incidents.
- There’s the ‘this might be an incident,’ incident.
- Then there’s the ‘incident,’ incident.
- Then there’s the incident, which might be the ‘big incident.’
- Then there’s the ‘OMG incident,’ which is thankfully fairly rare.
And all the incidents look the same at the beginning. The ‘OMG incident’ looks like the ‘this might be an incident,’ incident at the very beginning.
Ordinary firms are experiencing one to five acknowledged incidents per day.
That’s one to five events that occur where people have to figure out, is this a minor thing, is this an incident, is this an important incident, is this an OMG incident, and respond to it. This is normal.
And managing incidents has become a thing. In some cases, we found as many as 40 people responding to an incident, all checking into the slack channel after the incident is declared to say, ‘I don’t know what’s going on.’ You have multiple channels in our RC now. It’s not just the war room channel. It’s the ‘customer channel.’ It’s the, ‘What are we going to tell the boss?’ channel. It’s the, ‘Well we don’t really know what’s going on but we have to put something out for the media’ channel.
In fact, the structure of your incident channels is very much a map of the functional organization.
- You’ve got rules of behavior. ‘Don’t talk about this in the channel. Talk about that in the other channel.’
- You’ve got formal escalation policies. ‘If the incident has lasted for more than 36 minutes, it’s going to be declared a category one severity and then the heat will really be on. Let’s get it fixed before then.’
This is the way it’s really being done out there.
All the while, almost everybody is involved in some sort of automation experiment. Everybody is building bots or applications to manage the incidents. Automatically when you declare an incident of a particular severity, if a certain period of time goes by without it being fixed, the robot will say, ‘oh, well, time to go up to the next level.’ And all of a sudden, you’re at the next level.
The interesting thing about this is that I’m unlike the deploy side, they’re very few pieces of automation over here. It’s as though you ignored the other side of deploy.
Why? Well, because you keep imagining that at some point you’re going to have this perfect development process which is going to just make all that stuff go away.
Learning from incidents is actually essential but it’s really quite hard.
Maybe you do some kind of Post Incident Review, which are usually regarded as chores by the people in your organization. People don’t want to be in those incident reviews, they only go to them because they have to. The focus is also what I call micro fracture repair, “This is broken, fix it, now go on.’ They’re virtually no deep lessons learned from this.
It doesn’t have to be this way, but this is the way that it functionally is
It’s largely because the pace of incidents is so large that nobody has time to do anything except the microfracture repair.
“As the complexity of the system increases, the accuracy of any agent’s model of that system decreases.” This is a statement from David Woods. It’s a really important one because it I think goes to the core problem.
Our ability to understand the systems that we have is being stripped away by the increased complexity of the systems that we are building. That means that no individual agent can have an accurate model of the system. All we have are these models, and we have to exert constant effort in order to keep those models up to date and keep them coherent across the group.
It turns out that this is a really fundamental problem, it’s what we call the recalibration problem. Which is that as we have experience with this system, as it generates behaviors that we can understand ‘below the line’ we have to be constantly updating those mental models ‘above the line’ so that we get a representation that is useful to us in the next round of dealing with the system.
There is a wisdom in incidents.
The wisdom in incidents is that they point to the specific places where the mental models of the system are out of phase with what is functioning ‘below the line.’
If I tell somebody who’s coming into the world that I work in, ‘I want you to know the system by going and reading all these manuals,’ we’re never going to see that person again. In fact, it’s an impossible task. The most important information about where people are uncalibrated about what the system is doing, are the incidents that you are having. They are unsigned pointers that point to areas of the system that are of interest. They’re untyped. When you point to that area, you’re not pointing to anything specific. Your job is to figure out what is happening there in order to interpret that result.
Incidents are messages sent from this thing ‘below the line’ about how the system really works. They’re the only important pieces of information that can lead to a recalibration of a stale and accurate mental model. They are also the props to engage in calibration, recalibration activities, but we’re not using them that way. We’re using them as signals of microfractures that need to be repaired.
The problem that we encounter now is that you’re shifting the kind of role that you’re in and making it harder in a variety of ways. There are ever more dependencies that aren’t yours to deploy. What I’ve diagrammed as ‘external services’ is becoming a bigger and bigger part of the world that you work in. Many of the incidents that we see now are not incidents related to the code that was written by people who work for you, but the functions of the systems that are provided by things outside. The most obvious one that people had trouble with is the single sign-on or identity management, but there are lots and lots of others.
We are finding fewer and fewer deploy related events. That is this idea that we had, that if we get these changes small or we make them fast enough, they’ll break the system immediately and we’ll immediately know what was wrong and be able to back it up. It’s no longer true.
What we’re doing is injecting new modes of vulnerability.
New kinds of failures into the system that play out over longer periods of time.
As a consequence, we’re finding it more and more difficult to build mental models of the system that accurately represent what’s going on. The kind of things that used to work when we were responsible for the whole picture and could break the system in these little tiny bits are no longer playing out. It turns out that less than half of the incidents that we see are related to the last deploy, which completely throws the idea of roll back out the window. If you think rollback is keeping you safe, you better think again.
The step change in complexity is not being met by the monitoring tools that we have been building into the systems. We are building in complexities at a rate that is much higher than would be the case if we were writing the code ourselves.
The ‘look back—roll back’ technique is becoming less useful. You’re reaching a crisis state. “As the complexity of a system increases, the accuracy of any agents model that system decreases.” It is becoming harder and harder to keep the mental models refreshed.
Complexity is change.
This is a hypothesis that I have for you.
- Structure and function above and below the line are interwoven. They’re not separate things. The system includes ‘above the line’ and ‘below the line.’ If you don’t believe that, you are absolutely sunk. You won’t be able to get anywhere. The system is what’s ‘above the line’ and ‘below the line.’ as they’re interacting constantly.
- Any analysis that is based on looking at one side or the other, will fail. Regardless if it’s a management focused analysis or a technical system focused analysis, it will fail. Only analysis that takes both and looks at the interaction will be successful.
- The changing pattern of incidents is pouring into the places where recalibration is likely to be valuable. That’s the value of incidents. Incidents are bits of wisdom and you ignore them at your peril.
- Change and complexity are the same entity. Complexity is impossible without change. Change flows from complexity. They are not distinct things. This is a dynamic world. Any kind of static analysis you do is wrong. Anything you do that is not dynamic is wrong.
- All the distributed system behaviors that you find ‘below the line,’ you will also find ‘above the line.’ All the distributed system qualities that you know about that exist ‘below the line,’ you’ll find ‘above the line.’