The following is an excerpt from a presentation by Sidney Dekker, titled “The Human Factor: Inspiring the Pursuit of Success & Averting Drift into Failure.”
You can watch the video of the presentation, which was originally delivered at the 2017 DevOps Enterprise Summit in San Francisco.
I’m going to show you my career in a snapshot.
Now, this is not all my fault, but I was on the backend of a lot of this, so I know what chaos looks like. This is chaos, pain, hurt, suffering, dead people, lots of dead people. And so, what do I do about it?
Lots of dev.
I write a lot of books, I made a film, and then I decided, “You know what? That’s all dev and no ops.”
After you produce so much stuff, they make you a professor and you go, “Oh my word I have to do this for another 34 years!” Which created a kind of an existential issue, and so I went all ops.
This is all ops. Somebody develops the airplane, like 50 years ago. Somebody maintains it, throws it over the wall and says “go fly!”
But I learned a couple of things in this pilot seat, which I’m going to share today, and it’s been confirmed by a lot of the research.
- The first is, anything that puts downward pressure in your organization on honesty, disclosure, openness, and learning is bad for your business. You’re going to ask for trouble.
- The second one is that any field of practice has a sweet spot when it comes to rules and standardization. But where is that sweet spot, and where is it for you? Where are you relative to it?
- And the third one is, and that is really quite an important one, this fascination with counting and tabulating little negative events as if they are predictive of a big bad event over the horizon, is an illusion. We should be doing something quite different if we want to understand how your complex system is going to collapse and fail.
Now, when it comes to driving this jet, one of the things that I’ve become sensitive to being ops in this, is that you can throw stuff over the wall, but if you’re ops, you own the problem.
As a pilot, you have the privilege of arriving as the first person at the scene of the accident. It focuses the mind somewhat.
Who’s got this hanging off the wall?
Well, who does C++ anymore, you know? This is a really bad idea, all right.
But then I go into a warehouse, and I see this nonsense, you know.
Somebody’s counting this stuff, right? You go, oh, ‘but that’s really good, because we pursue excellence and we hold our people accountable for excellent outcomes, and excellent outcomes are the same as zero errors, and zero screw-ups and zero missed days,’ or whatever your KPI might be.
But THIS is an invitation to a big blow up just around the horizon, and I’ve got lots of data to show that, that’s the case.
Here is one example
This is a DuPont company, which is very, very safe, supposedly. But these guys killed four people in a gas release in LaPorte, Texas, not that long ago. Two of them are brothers. You’ve got three families, but one family’s got to bury two sons in the same week.
Now, the issue is how have they been managing their safety, their errors. If you look around the picture, you might see what they’re concerned about. It says, “Please take extra precaution when driving and walking.”
But, who got killed driving and walking? Somebody got killed in a gas release.
This focus, this obsession on “C++ mishap free days” doesn’t predict disaster at all.
Now here are a lot of statistics from MIT
They want to be very intelligent people or come across as so. They make their statistics very complicated, but it shows something really interesting, which is the airline that seems to be less safe, actually won’t kill you. The airline that reports more incidents, has a lower passenger mortality risk.
Now, that’s fascinating. We replicated this data across various domains, construction, retail, various other domains, and we see that there is this inverse correlation between the number of incidents reported, the honesty, the willingness to take on that conversation about what might go wrong, and things actually going wrong.
If you’re going to fly back from San Francisco, find the airline that’s got the most incidents, and you get on the other end alive, okay. That’s the lesson.
Now let me take you to another really dangerous world
You think you’re sort of wonky, right? And you go to the hospital, but at the hospital 1 in 13 patients that walks in the door, comes out sicker than they went in if they come out at all.
You get harmed by seeking care. 1 in 13, that’s about 7%.
Now the question is, what do you do with that one?
Well, you investigate it, and what do people find?
Typical, human errors, guidelines not followed, communication failures, miscalculations, procedural violations. And you go, “okay, fair enough. If that is how you get hurt, then let’s try not to do that.” Very simple.
What do we do? We declare a war on error. Clearly, error is a dangerous thing, we need to declare the war on error, which is both cute, and riling, and vexing at many levels.
But then we asked, do you know why the other 12 go right? Because it’s nice to find out why the one goes wrong, but why did the other 12 go right? Do you know that?
And the answer was, ‘Uh, no. We have no idea.’
Probably because there are no human errors, and no communication failures, and people follow all the procedures, and all the guidelines are followed, and they don’t miscalculate. The decimal point goes exactly where it needs to be. That’s why we get 12 good outcomes.
We started studying this, and there is no substitute, you have to go native. You have to go to ops. And this is what we did, we went ops in this hospital. And you know what we found? In the 12 that go right…
There was no difference. Human errors, guidelines not… the same stuff shows up. And yet, you survive. What’s the difference?
And this cuts across domains, and we’re discovering that in these patterns it’s not in what people aren’t doing… It is in what they are doing.
The distinction between screwing something up, and not screwing it up, is in the presence of positive capacities, not in the absence of negatives. That’s the distinction.
- Are there people who say, this is not a good idea, stop, even in the face of acute production pressures?
- Is the team taking past success as a guarantee of today’s success? In a dynamic complex system, that is an extraordinarily bad idea. Because past success is not predictive of success today.
- How diverse is the team?
- Is there a willingness to accommodate dissent? People say I disagree.
- Is there an ability to listen to that?
ATTEND THE DEVOPS ENTERPRISE SUMMIT
Having sat in that cockpit, I have experienced that the reward for not speaking up is much greater than the reward for speaking up. The reward for speaking up is namely uncertain and delayed. I may not even know that speaking up saved my own life.
The reward for shutting up is immediate, direct and certain. As in, I don’t get into social and reputational trouble with my captain.
But this is not the only thing we need to talk about
Let’s talk about rules and regulations.
Now, as I said, there is a sweet spot when it comes to rules.
This is a square that used to have all these lights, thankfully now it’s all gone. But something fascinating happened here. This square had pretty bad accidents, about 10 a year, and some pretty bad outcomes. Now, a traffic engineer says, let’s take out all the rules, all the lights, and everything. After doing that, now they’re down to one accident, this has become a system in which people actually try to divine each other’s intention.
And spontaneously, they all slow down to the slowest common denominator on the square. No rules. Because nobody’s telling them to do that, there’s no sign.
This is literally horizontal coordination, the beautiful that we see in a complex system. I just wanted to show you those pictures to think about the sweet spot. Full autonomy, like here, or completely clocked with rules. There is a sweet spot. Aviation has long overshot that sweet spot, right, clogging itself with more rules.
If we want to understand, in complex systems, how things really are going to go badly wrong, what we shouldn’t do is try to glean predictive capacity just from the little bugs, and incidents, and the error counts that you do.
No, we need to understand how success is created, and I’m going to try to explain how that works.
A good colleague of ours in the trade tries to paint it like this. He says, “Much more goes right than goes wrong.” And this is probably true for all the work that we do. Much more goes right. And then when things go wrong, then we do post-mortems, then we send in the hoards to try to find out what happened, but we can learn from what goes right.
Now, my claim is going to be that not only should we be doing that because there’s lots of data to learn how things go right, but also because for us to know how things will really go wrong, we need to understand how they go right, and here’s how.
let’s first talk about Abraham Wald
Abraham Wald the father of operations research.
He was born in 1900, Austrian Hungarian Empire, and which for Jewish boys, not a good place to be to try to go to university in the 20s, 30s. And so, he immigrated, got to the US. Armed forces understood that he was very good at math, in particular statistics. They send him back to England, to solve the following problem. Bombers are coming back from Germany and they’ve got holes in them. Let’s call them ‘bugs’, all right. These bombers come back with lots of bugs. If you’re a pilot, that’s not cool.
And so, what they wanted was some predictive capacity for where to put some extra armor on these airplanes. Now, armor and airplanes are not good bedfellows. It’s not payload, it’s not fuel, and so it’s just dead weight, literally. You want to be very judicious with where you put it.
The question to Abraham was, where should we put the armor? Well, he says, let me do, get the data. And so, he measures, and he calculates, months go by. People get very impatient.
They want to put the armor where the holes are most likely to show up, right?
But Abraham says “Nein, we need to put the armor where there are no holes because those are the ones that don’t make it back.”
It’s such an important lesson, colleagues.
Put the armor where there are no holes because those are the ones that don’t make it back, those are the ones where the server will not come back.
Then I go out to the Australian outback.
His name is Nick, and he’s from Nottingham. There are lots of stickers on his hard hat, and there’s a little sticker on the back of his hard hat that says GSD.
I’m walking to another guy for the Midlands, and I say, “what’s this GSD thing?”
And he says, “Oh no, that just means get stuff done.”
And so, when I want to understand where the next fatality in that world is going to come from, you think that I’m going to look at the incidents, and the errors, and the bugs?
No, I’m going to look at the place where there’s no bugs and holes because that might be the one that won’t make it back.
I want to understand how Nick creates success because that is where failure is going to hide. Death will hide in his successes. I need to understand how Nick gets stuff done under resource constraints, goal conflicts, limited time, pressures, supervisory pressures, things that he needs to manage and control every day.
How does he still get it done? Because something has to give, and what is that?
I want to understand how Nick creates success.
Now let me take you to Greece
Here’s a diagram of one of their runways, which brings up the question of how does this airport look to you?
So, the big black thing down the middle is the runway. And there is that little white bar across the top near B, that is actually what we call the displaced threshold. You’re not supposed to land before that thing.
Because there is lots of geology north of the runway, rocks. So, lots of rock, which you need to overfly in order to make it to the runway safe.
However, and here’s the issue, you want to pull off at Charlie, at the taxiway, because you want to go to the terminal building because there’s always pressure on turnaround times. Airplanes on the ground, lose money. Airplanes in the air, make money.
And so, you want to get them up. But taxi time is the worst thing in the world in order to get airplanes to turnaround quickly, and you know this from all the airports you fly out from.
You have to taxi all the way down to the little turning circle at the end of the runway, all the way back up, and then into the ramp.
Now, that takes five, six, seven minutes right there. If you have to turn around a 737 in 30 minutes, 25 minutes, that’s going to hurt so bad.
You designed the system like this, you put that pressure on, and you know what behavior you get?
This is not photoshopped, and the other beautiful thing is, the traffic light is not controlled by the tower. It is pure chance.
What’s really cute is the little warning light on top of the traffic light!
But, do you think that this will be reported as an incident by the people up front? No. This is normal work, this is GSD.
They might have that imprinted on the back of their little caps. “I’m a GSD pilot, I get stuff done.”
However, if I want to understand how people are going to die, I am not going to look at the bugs, I’m going to listen to Abe Wald, and not look at the holes, and the bugs, and the little incident reports about this, that or the other irrelevant thing.
I need to study success. I need to understand how stuff gets done, and that goes for you, as well, both in dev and in ops. How does stuff get done despite the constraints?
How can this be safe for years? What is it that these people are doing to make it work despite the constraints and obstacles? That’s the question.
Understand how success is created, and it will take you to where failure is going to come from.