The following is an excerpt from a presentation by Brent Chapman, Principal at Great Circle Associates, titled “Mastering Outages with Incident Command for DevOps: Learning from the Fire Department.”
You can watch the video of the presentation, which was originally delivered at the 2018 DevOps Enterprise Summit in Las Vegas.
A little bit about my background
I’m a system’s sysadmin, networking, programmer, architect, etc. I’ve been in Silicon Valley for about 30 years, and worked at places like Xerox PARC, Telebit, Silicon Graphics, Covad, Tellme Networks.
I spent six years at Google as an SRE, Site Reliability Engineer, and SRE manager managing the Google Fiber SRE team.
One of the other things I did at Google was to develop their incident management protocol practice, which is what I plan on sharing with you here.
Throughout this time, though, I’ve always worked on the side as a volunteer in emergency services. I started off as a search and rescue pilot, with the Civil Air Patrol, which is the civilian auxiliary of the U.S. Air Force. By the time I left that organization, about 10 or 15 years later, I had worked my way up to being 1 of 30 people in the state of California that was considered fully qualified to manage a search for a missing aircraft.
Afterward, I was involved with community emergency response teams in various cities in California. Mountain View (where I lived and worked for many years,) San Francisco, Alameda— and now these days most of the emergency volunteer work I do is with the Black Rock City Emergency Services Department.
If you don’t know, Black Rock City is Burning Man, which means, I go out before a month before the event to help build the 911 Call Center for a city of 75,000 people that only exists for a month a year, in the middle of a dry lake bed located 120 miles from the nearest major city.
In the end, we handle about as many calls per year, as any other 70,000-person city in Nevada does. We just do it all within a two-week period, when the event is fully up and running. At our busiest time, we’re running at about the same calls per hour as the city of Boston.
Once the 911 center is built out there every year, I switch over to being one of the 911 supervisors. This year, for the first time, I was honored to serve as the Battalion Chief for one of our shifts, which is the on-duty operational lead of a 150-200 person department of firefighters, paramedics, etc.
So emergency services are what I do for fun.
What I’m here to share about are some lessons from the fire and emergency services world that we can apply to the tech world.
1. Incident response is still a critical capability
We have building codes. We have sprinkler systems. We have fire alarm systems. We have fire escape ladders. We have building inspectors. And yet, we still need a fire department to respond to the unexpected. To respond when things get beyond what those automated systems can do.
It’s the same thing in the tech world. We still need somebody on call for when things go wrong. Because things are going to go wrong, no matter how much we automate, no matter how much we apply DevOps principles, and SRE principles, etc. That just means that what goes wrong is going to be that much more complicated, and trickier to deal with, so we need those capabilities for responding to emergencies.
Let’s think about your typical incident.
When something happens or goes wrong, it takes you a little bit of time after that to detect that something has happened. It’s not just a glitch, it’s not just a blip, we’re going to raise an alert about it.
Then it takes a little bit of time after that for someone to respond to the alert, to acknowledge the page, and get out their laptop. Then it takes even more time from when they respond to the alert until they’ve mitigated the problem. Mitigation means it’s no longer a problem, from the end user’s point of view. We may still be dealing with it as an emergency behind the scenes, but it’s no longer a problem for them.
Then, finally, you repair the problem. You return the system to some normal operational state. Maybe the same state it was in before the problem, maybe a different state, but things are back to normal. The emergency is over.
There are several parts of this process that I want to point out.
First, your customers, your end users, whether they’re internal or external, see the duration of the impact all the way through to the mitigation. That’s their perception of this outage, or of this problem.
You on the other hand, as the person responding to it, didn’t even know there was a problem until somebody was alerted. So there is already a mismatch between your perception of the duration of the emergency, and the user’s, or the customer’s perception.
There’s a mismatch on the back end, as well. From the user’s point of view, the issue is over when you’ve mitigated the problem. But you may still be responding to it as an emergency, for a much longer amount of time until you’ve managed to repair it. This mismatch is very important to be aware of, between your perception, and your customers’ or your user’s perception.
What happens from when you respond, until when you resolve the incident, is when incident management comes into play. The reason I think it’s important is if you do incident management well, you bring in those mitigation and resolution times. You can make the outage shorter and less impactful, both from your user’s / customer’s point of view, but also for yourself and your team.
2. Draw a distinction between normal operations and emergency operations
Make it clear when you are dealing with an emergency, and when you are following emergency procedures, and emergency rules, etc— clear not only to yourselves but to anybody you might normally interact with.
Think about this: if you’re in your city, and you see a fire truck at the grocery store with the firefighters getting groceries, it’s perfectly okay to walk up and say, “Hey, can my kid sit in the fire engine?” They love that stuff, it’s great.
On the other hand, if you see them somewhere and the red lights are on, and their jackets and the helmets are on, then they’re working. And 99.9% of the general public knows better than to bother them then because they’re dealing with an emergency. They have that clear visual distinction to people when they’re dealing with an emergency and when they’re not.
They also get to follow different rules during an emergency. They can have the lights and sirens on, they can blow through traffic. Everybody else has to pull over and get out of the way.
In the IT world, we could benefit from adopting similar practices, and similar distinctions between emergencies and non-emergencies. We should be doing things differently in an emergency, in order to get through them as quickly and effectively as possible and to get back to our normal way of doing things.
Now, I’m not criticizing our normal way of doing things, because it’s great for day to day stuff. It allows our companies to do the amazing things we do but doesn’t work well during an emergency. There are different ways to work together in an emergency, that will get you through that emergency more quickly, and with less impact, so that you can get back to your normal way of operating.
There’s something developed by the fire department in the early 1970s in Southern California called the Incident Command System.
In the United States, fire departments are typically a local government function, and every city has its own fire department with its own budget, politics, policies, practices, terminology, sets of equipment, etc. Each department is, in many ways, totally incompatible with another, or at least they used to be. We’re getting better about it, and the Incident Command System is a very large reason why.
In the 1960s, all of these cities in Southern California, around Los Angeles and San Diego would have to come together to fight wildfires every summer and fall. But they realized, they were not acting as effectively as they could.
For example, the terminology in one department would not be used by another department. What someone would call a truck, another department would call an engine, and another would call a pumper, etc. They didn’t even have compatible terminology for the types of equipment. That’s a problem when you call over the radio and say, “Hey, I need a truck. Emergency at Fourth and Main,” expecting to get something with hoses and a water tank, and instead, you get a hook and ladder truck, because that department that answered the call didn’t use the same terminology as you.
3. Incident Command Systems work well for IT
This is one of the problems they addressed with the Incident Command System. They created a set of principles that these departments could use to work together better when they needed to in an emergency.
One of the most important points here is that it had to be modular and scalable, and that applies very directly to what we do. You often don’t know at the start of an incident how big it’s going to get, how many teams are going to be involved, how many different experts, or how many different specialties you’ll need.
You need a way of managing that incident that will scale, as your understanding of the incident unfolds, as your resources assigned to the incident grow and shrink over time. You need a flexible mechanism for managing all of those people, and the Incident Command System gives you that.
4. A few tweaks make ICS work even better for IT incidents
Going through these ICS principles, there are a few tweaks that we can adopt, to make it work even better for our types of incidents or outages.
Start with the standard ICS-style org chart, as applied to IT incidents.
You start with two people: the Incident Commander, or IC, and the Tech Lead (some organizations call this the Ops Lead.)
The Tech Lead’s job is to solve the problem at hand. The Tech Lead is usually the on-call engineer, who got paged when this incident occurred. They do a little bit of investigation. They decide that it’s more than they can deal with on their own, and they need to launch a full-blown incident response.
The Incident Commander’s job is to deal with everything else so that the Tech Lead can focus on solving the problem. They get more help and inform the executive team about what’s going on.
Now, when the Tech Lead themselves needs more help, just more hands on the keyboards, or more specialized knowledge, where do they go? They pull in a series of Subject Matter Experts. Your database team, your app team, your networking team, whatever they need.
One of the IC’s responsibilities is communicating with the rest of the organization, and if necessary, the rest of the world, which can take on a life on its own and turn into a big job. In this case, the Incident Commander may designate a Communications Lead to help them with that.
Likewise, the responders typically communicate pretty well to each other, but they often communicate poorly to the rest of the organization when the incident is in progress, so having somebody designated to handle communications can help there.
Another thing you may need to need is what some organizations call a Scribe. Basically, somebody to gather all of the documents and the data, and make sure that the recording is turned on, on the slack channel, gathering all of the artifacts that other people involved in the response are producing.
Then, finally, the other role that’s often helpful in these incidents is the Liaison. A Liaison is a representative of some other group, who are affected by the response, but who is not necessarily a part of the response. For example, your Call Center. Your executive team. These are people that want to know what’s happening with this response, who is impacted by it, they have input on it, but they’re not part of the response itself. In this case, the Liaison is their representative.
It’s typically someone from that outside group that is a representative of that group within the response, and it’s a two-way street. They’re carrying information from the responders to that group, and also taking information from their group, and feeding it into the response as appropriate.
5. Org chart grows as an incident unfolds
Now, following these principles of the unity of command and span of control you can just keep scaling this up, as the incident unfolds, as you get more people, to where you have multiple DBAs, multiple networking teams, both local and wide area storage, customer care, etc.
You can develop an org chart on the fly, for that particular incident, based on the situation and the resources available at that time. Every incident is going to have a different org chart. People may play a different role in one incident then they do on the next incident. I might have been the Incident Commander yesterday, but today I may be one of the Subject Matter Experts. It’s important that people are clear on that.
It’s also important, when developing your incident management protocol in this way, to focus on the roles, and on developing a pool of people who can fulfill each of those roles. Talk to the role, not to the individual. Talk to whoever today’s Incident Commander is, and get people into the habit of seeing that the role varies from incident to incident. Then, you need to train everybody, and I mean everybody. Not just the responders, but anybody who wants to interact with the response.
Finally, your role on an incident may only have a loose relationship to your everyday status in the organization. Your Incident Commander may be a mid-level engineer or a project commander, and the Comms Lead may be the senior vice president.
Everybody needs to know that our day to day organizational structures and ranks go out the window for the purposes of incident response. That we deal with the response independently of all of that. People need to get comfortable with that idea, and that takes, in some organizations, a lot of doing.
6. Practice, practice, practice and then practice some more
This next lesson is from the fire department, and it’s a foreign way of working for most people and most organizations. It also takes some getting used to. The terminology is different, the principles are different, the protocols are different, and so its a little uncomfortable. The first few times you go through it, it’s awkward, so the best thing you can do to be prepared is to practice.
If you can do an actual live drill, that’s great, but don’t be afraid to work your way up to that. One of my favorite ways of practicing is to take a planned event that’s not an emergency, like a building move, or a data center bring up and organize as if it were.
Use the org chart, use the terminology, use the communication tools that you would use in an emergency. Go through the process, as if it were in an emergency, and get people used to these concepts.
7. Senior management can inadvertently disrupt incident response
One last point I want to leave you with is that it’s really easy for senior managers, directors, VPs, etc. to totally disrupt an incident response in progress, by accident. They don’t mean to, they may not even realize they’re doing it. But just by showing up, and especially, just by asking questions, you can just totally disrupt a response in progress.
If the senior vice president shows up on the slack channel that the responders are all using to talk to each other about what’s going on and says, “Hey! What’s going on!” all that other debugging activity grinds to a halt, while everybody pivots to this very important person, to figure out what they want. Because they’re the ones that set the bonuses, and the performance review, etc. and you’ve just totally derailed that incident response progress.
My advice to people who are senior managers and executives is if you want to interact with incident response in progress — do it behind the scenes. Do it directly with the Incident Commander, and do it in a private channel, or a direct message. Not in the main comms channel or phone bridge or whatever your organization uses for incident response. Things will go much, much smoother, and you’ll get better responses out of it.