Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
Explore our extensive library of experience reports.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
Weekly discussion around “Deming’s Journey to Profound Knowledge” with author John Willis.
VIRTUAL — Helping leaders succeed and organizations thrive (formerly DevOps Enterprise Summit).
Venue: Fontainebleau — Helping leaders succeed and organizations thrive (formerly DevOps Enterprise Summit).
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
June 23, 2021
An outage can be one of the most horrifying and stressful experiences for many tech organizations. Most ops engineers have at least a few horror stories in their pocket. But does an incident have to be this way?
We’ve seen the proposed Incident Management Framework presented in a recent white paper and explored on our own blog. In this post, we share a case study of one organization that used some of these new methodologies to turn their biggest outage into powerful learning opportunity.
CSG is North America’s largest SaaS-based customer care and billing provider, with over 65 million subscribers and a tech stack that covers everything from Java to mainframe.
At DevOps Enterprise Summit Virtual – US 2020, Erica Morrison, Vice President of Software Engineering, shared the story of CSG’s worst outage—the result of a complex system failure that pushed CSG beyond the limits of its response systems, processes, and culture.
But in the face of that adversity, they were able to find opportunity and use the lessons they learned to improve how they understand incidents, respond to them, and prevent them in the first place.
The story starts on February 4th, 2019 with the 2/4 Outage, as it later came to be known. The outage lasted thirteen hours. It started abruptly and large portions of CSG’s product were unavailable.
On the initial calls as the outage began, the team was troubleshooting blind as they had trouble accessing the tools they normally use, including their system health monitoring system and server access. With the number of vendors and customers involved, the initial calls were particularly chaotic.
As the hours went on, the teams kept testing different theories for the outage, but again were hampered by the tool access problems. They’d see a little relief for a few minutes, only for everything to fail again. This led to a feeling of hopelessness within the teams. As the day continued, they took more and more drastic action, killing VLAN by VLAN. Finally killing one VLAN showed instant results and they knew they were finally onto something.
In the end, it would take several days to figure out what had actually happened by reproducing the outage in their lab. The issue started with routine server maintenance on an OS that was different than most of the servers they ran. When that server rebooted, it put out an LLDP packet out on the network. Due to a bug, CSGs network software picked it up and interpreted it as a spanning tree. It broadcast it out to the network and then it was picked up by their load balancer. Due to a misconfiguration in their load balancer, this got rebroadcast to the network, creating a network loop and taking the network down.
This is a great example of complex system failure: multiple failures in the system had to happen, there were latent failures (some had been in the system for months), and then the failures were changing throughout the day. In fact, when they first looked at this particular maintenance, its timing fit. But when they looked into it and troubleshot, they decided it was just a victim of the larger outage. Only days later recreating it in the lab were they able to pinpoint it as the cause.
The aftermath was severe. The extent of angry customers required leadership to pivot their focus from their planned work (strategic initiatives, etc.) to focus just on this outage. Throughout the company, there was also a large sense of loss and heartbreak over having failed their customers so severely. Morale was extremely low, and everyone had open wounds and strong emotions. Hurtful things were said, like “DevOps doesn’t work.”
They knew they wanted to respond to this failure differently. They needed to maximize learnings while also reducing the likelihood of an incident like this happening again.
Their first step was incident analysis. Their standard incident analysis was a structured process to help them understand what happened and identify opportunities for improvement. They did this through a series of questions: understanding the timeline of the incident; asking what happened, how can we detect sooner, how can we recover sooner, what went well; understanding system behavior; and maintaining a blameless culture by avoiding finger pointing.
But with the severity of this incident, they knew they needed to up their game. They reached out to Dr. Richard Cook and John Allspaw of Adaptive Capacity Labs to analyze the incident. Through two weeks of intense interviews and research, they gained a more thorough understanding of the events. And in particular, they learned the different perspectives of the people who were working on the outage.
They created an operation improvements program broken into four categories: incident response, tool reliability, datacenter/platform resiliency, and application reliability.
First they adopted the National Incident Management System used by government agencies like FEMA. It’s key components include the following:
To role out this new Incident Management System at CSG, they brought in a team to help train more than 130 incident commanders through a series of training sessions. But in addition to training incident commanders, they wanted to make sure all key personnel were trained in this new way of incident management, including executive leadership, internal client reps, and even some customers and technicians. Anyone who would be involved in an incident.
“Several senior leaders said this had been the best training they had been through in their entire career,” Erica Morrison said in her 2020 presentation.
After training, they rolled out a pilot group of just 14 incident commanders, a small group of their most experienced people at running outages. From their learnings, they then iterated and trained the whole organization in incident command.
Along with training, CSG also found they needed to update their whiteboard tooling. They ended up going with a simple Excel spreadsheet that shows who’s filling each role, the current status report, all previous status reports, and a timeline of what’s going on. This way when you join an incident call, you have all the information in front of you.
Even before the whole organization had been through the training, people started to see observable improvements in incident management and how outage calls were run.
First, clutter on the calls had been removed. Previously, calls were chaotic. Now, participants have better behavior and better understand when to talk, what to speak about, and what to take offline. This is really a result of keeping a singular focus on these calls on restoring service, not on finding the root cause of the outage.
Also, status reports have a known, steady cadence. People no longer have to ask for status information on the call. They know when a new status will be put out.
Having an LNO (liaison officer) was key in avoiding interruptions on the incident calls. Instead of the incident commander having to jump off to go talk to customers, the LNO now has all the information necessary and can take those calls.
The second biggest improvement was a sense of control over chaos. The simple act of having predictable cadences and patterns that are followed helps everyone feel more confident and in control. It also allows activities to run in parallel until the set time for a status update, allowing that activity to run without interruption.
Decision-making was also unclear in the old system. Now the incident commander takes clear command and authority so there’s no question about who can make decisions.
Today, CSG has a stronger organizational ability to perform incident management. They’ve reinforced and broadened culture norms around safety, and most impactfully, they have implemented the incident management system that changed how they run outage calls. You can watch Erica Morrison’s full presentation on the CSG outage in the IT Revolution Video Library here.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
Welcome to the final installment of IT Revolution’s series based on the book Investments…
As a business leader, you know that artificial intelligence (AI) is no longer just…
Welcome to the twelfth installment of IT Revolution’s series based on the book Investments…
In today's fast-paced and ever-evolving business landscape, organizations are constantly undergoing transformations to stay…