An outage can be one of the most horrifying and stressful experiences for many tech organizations. Most ops engineers have at least a few horror stories in their pocket. But does an incident have to be this way?
We’ve seen the proposed Incident Management Framework presented in a recent white paper and explored on our own blog. In this post, we share a case study of one organization that used some of these new methodologies to turn their biggest outage into powerful learning opportunity.
CSG is North America’s largest SaaS-based customer care and billing provider, with over 65 million subscribers and a tech stack that covers everything from Java to mainframe.
At DevOps Enterprise Summit Virtual – US 2020, Erica Morrison, Vice President of Software Engineering, shared the story of CSG’s worst outage—the result of a complex system failure that pushed CSG beyond the limits of its response systems, processes, and culture.
But in the face of that adversity, they were able to find opportunity and use the lessons they learned to improve how they understand incidents, respond to them, and prevent them in the first place.
The Incident: The 2/4 Outage
The story starts on February 4th, 2019 with the 2/4 Outage, as it later came to be known. The outage lasted thirteen hours. It started abruptly and large portions of CSG’s product were unavailable.
On the initial calls as the outage began, the team was troubleshooting blind as they had trouble accessing the tools they normally use, including their system health monitoring system and server access. With the number of vendors and customers involved, the initial calls were particularly chaotic.
As the hours went on, the teams kept testing different theories for the outage, but again were hampered by the tool access problems. They’d see a little relief for a few minutes, only for everything to fail again. This led to a feeling of hopelessness within the teams. As the day continued, they took more and more drastic action, killing VLAN by VLAN. Finally killing one VLAN showed instant results and they knew they were finally onto something.
In the end, it would take several days to figure out what had actually happened by reproducing the outage in their lab. The issue started with routine server maintenance on an OS that was different than most of the servers they ran. When that server rebooted, it put out an LLDP packet out on the network. Due to a bug, CSGs network software picked it up and interpreted it as a spanning tree. It broadcast it out to the network and then it was picked up by their load balancer. Due to a misconfiguration in their load balancer, this got rebroadcast to the network, creating a network loop and taking the network down.
This is a great example of complex system failure: multiple failures in the system had to happen, there were latent failures (some had been in the system for months), and then the failures were changing throughout the day. In fact, when they first looked at this particular maintenance, its timing fit. But when they looked into it and troubleshot, they decided it was just a victim of the larger outage. Only days later recreating it in the lab were they able to pinpoint it as the cause.
The Incident Aftermath
The aftermath was severe. The extent of angry customers required leadership to pivot their focus from their planned work (strategic initiatives, etc.) to focus just on this outage. Throughout the company, there was also a large sense of loss and heartbreak over having failed their customers so severely. Morale was extremely low, and everyone had open wounds and strong emotions. Hurtful things were said, like “DevOps doesn’t work.”
They knew they wanted to respond to this failure differently. They needed to maximize learnings while also reducing the likelihood of an incident like this happening again.
Their first step was incident analysis. Their standard incident analysis was a structured process to help them understand what happened and identify opportunities for improvement. They did this through a series of questions: understanding the timeline of the incident; asking what happened, how can we detect sooner, how can we recover sooner, what went well; understanding system behavior; and maintaining a blameless culture by avoiding finger pointing.
But with the severity of this incident, they knew they needed to up their game. They reached out to Dr. Richard Cook and John Allspaw of Adaptive Capacity Labs to analyze the incident. Through two weeks of intense interviews and research, they gained a more thorough understanding of the events. And in particular, they learned the different perspectives of the people who were working on the outage.
They created an operation improvements program broken into four categories: incident response, tool reliability, datacenter/platform resiliency, and application reliability.
Incident Response Training and Learning
First they adopted the National Incident Management System used by government agencies like FEMA. It’s key components include the following:
- Clear established roles, including incident commander, Scribe, Subject Matter Expert, and LNO (or liaison officer responsible for communication). It’s important to understand that these roles go into effect during an incident. A person’s other, day-to-day role, no longer matters in the face of an active incident.
- Clear communication cadence and format.
- Set of expected behaviors for participants.
- Common terminology.
- Management by objective, the call is only focused on restoring service, not on cause or what to change later.
To role out this new Incident Management System at CSG, they brought in a team to help train more than 130 incident commanders through a series of training sessions. But in addition to training incident commanders, they wanted to make sure all key personnel were trained in this new way of incident management, including executive leadership, internal client reps, and even some customers and technicians. Anyone who would be involved in an incident.
“Several senior leaders said this had been the best training they had been through in their entire career,” Erica Morrison said in her 2020 presentation.
After training, they rolled out a pilot group of just 14 incident commanders, a small group of their most experienced people at running outages. From their learnings, they then iterated and trained the whole organization in incident command.
Along with training, CSG also found they needed to update their whiteboard tooling. They ended up going with a simple Excel spreadsheet that shows who’s filling each role, the current status report, all previous status reports, and a timeline of what’s going on. This way when you join an incident call, you have all the information in front of you.
Even before the whole organization had been through the training, people started to see observable improvements in incident management and how outage calls were run.
First, clutter on the calls had been removed. Previously, calls were chaotic. Now, participants have better behavior and better understand when to talk, what to speak about, and what to take offline. This is really a result of keeping a singular focus on these calls on restoring service, not on finding the root cause of the outage.
Also, status reports have a known, steady cadence. People no longer have to ask for status information on the call. They know when a new status will be put out.
Having an LNO (liaison officer) was key in avoiding interruptions on the incident calls. Instead of the incident commander having to jump off to go talk to customers, the LNO now has all the information necessary and can take those calls.
The second biggest improvement was a sense of control over chaos. The simple act of having predictable cadences and patterns that are followed helps everyone feel more confident and in control. It also allows activities to run in parallel until the set time for a status update, allowing that activity to run without interruption.
Decision-making was also unclear in the old system. Now the incident commander takes clear command and authority so there’s no question about who can make decisions.
Today, CSG has a stronger organizational ability to perform incident management. They’ve reinforced and broadened culture norms around safety, and most impactfully, they have implemented the incident management system that changed how they run outage calls. You can watch Erica Morrison’s full presentation on the CSG outage in the IT Revolution Video Library here.