This case study has been excerpted from the second edition of The DevOps Handbook by Gene Kim, Jez Humble, Patrick Debois, John Willis and Nicole Forsgren, PhD.
After successfully improving releases between 2012 and 2015, CSG further evolved their organizational structure to improve their day-to-day operational stance. At the DevOps Enterprise Summit in 2016, Scott Prugh, Chief Architect and VP Software Engineering at the time, spoke about a dramatic organizational transformation that combined disparate development and operations teams into cross-functional build/run teams.
Prugh described the start of this journey:
We had made dramatic improvements in our release processes and release quality but continued to have escalations and conflicts with our operations team members. The development teams felt confident about their code quality and continued to push releases faster and more frequently.
On the other hand, our ops teams complained about production outages and the rapid changes breaking the environment. To battle these opposing forces, our change and program management teams ramped up their processes to improve coordination and to attempt to control the chaos. Sadly, this did little to improve production quality, our operational teams’ experience, or the relationship between the development and operations teams.
Figure: How Structure Influences Behavior and Quality (Image courtesy of Scott Prugh)
To understand more of what was going on, the team dug into the incident data, which revealed some surprising and alarming trends:
- Release impact and incidents had improved almost 90% (from 201 incidents to 24).
- Release incidents represented 2% of the occurring incidents (98% were in production).
- And, 92% of these production incidents were quick restorations that were fixed by operations.
Prugh further observed, “We had basically improved our development stance significantly but had done little to improve the production operations environment. We got the exact result we had optimized for: great code quality and poor operations quality.”
In order to find a solution, Prugh asked the following questions:
- Were different organizational goals working against system goals?
- Did development’s lack of operations understanding result in hard-to-run software?
- Did a lack of shared mission create a lack of empathy across teams?
- Did our handoffs contribute to elongated lead time?
- Did a lack of engineering skills in operations prevent improvements and encourage duct-tape engineering?
Figure: From Siloed Approach to Cross-Functional Teams (Image courtesy of Scott Prugh)
At the same time Prugh was making these discoveries, the customer escalations of production issues had repeatedly been raised to the executive leadership. CSG’s customers were irate, and the executives asked what could be done to improve CSG’s operational stance.
After several passes at the problem, Prugh suggested creating “Service Delivery Teams” that build and run the software. Basically, he suggested bringing together Dev and Ops onto one team.
At first, the proposal was viewed as polarizing. But after representing previous successes with shared operations teams, Prugh further argued that bringing Dev and Ops together would create a win-win for both teams by:
- improving understanding so the team could improve the entire delivery chain (development to operations)
- improving flow and knowledge efficiency and creating unified accountability for design, build, test, and operations
- making operations an engineering problem
- bring other benefits, like improving communication, reducing meetings, creating shared planning, improving collaboration, creating shared work visibility, and a shared leadership vision
Figure: Conventional vs. Cross-Functional Structure (Image courtesy of Scott Prugh)
The next steps involved re-creating a new team structure of combined development and operations teams and leaders. Managers and leaders of the new teams were selected from the current pool, and team members were re-
recruited onto the new cross-functional teams. After the changes, development managers and leaders got real experience in running the software they had created.
It was a shocking experience. The leaders realized that creating build/run teams was only the first step in a very long journey. Erica Morrison, VP Software Engineering, recalls:
As I got more involved with the Network Load Balancer team, I quickly started to feel like I was in The Phoenix Project. While I had seen many parallels to the book in previous work experiences, it was nothing like this. There was invisible work/work in multiple systems: one system for stories, another for incidents, another for CRQs, another for new requests. And TONS of email. And some stuff wasn’t in any system. My brain was exploding, trying to track it all.
The cognitive load from managing all the work was huge. It was also impossible to follow up with teams and stakeholders. Basically whoever screamed the loudest went to the top of the queue. Almost every item was a priority #1 due to the lack of a coordinated system to track and prioritize work.
We also realized that a ton of technical debt had accumulated, which prevented many critical vendor upgrades, leaving us on outdated hardware, software, and OS. There was also a lack of standards. When we did have them, they were not universally applied or rolled out to production.
Critical people bottlenecks were also prolific, creating an unsustainable work environment.
Finally, all changes went through a traditional CAB [change advisory board] process, which created a massive bottleneck to get things approved. Additionally, there was little automation supporting the change process, making every change manual, not traceable, and very high risk.
To address these issues, the CSG team took a multi-pronged approach. First, they created a bias for action and culture change by applying the learnings from John Shook’s Model of Change: “Change Behavior to Change Culture.” The leadership team understood that in order to change the culture they had to change behavior, which would then affect values and attitudes and result in eventual culture change.
Next, the team brought in developers to supplement the operational engineers and demonstrate what would be possible with great automation and engineering applied to key operational problems. Automation was added to traffic reporting and device reporting. Jenkins was used to orchestrate and automate basic jobs that were being done by hand. Telemetry and monitoring to CSG’s common platform (StatHub) were added. And finally, deployments were automated to remove errors and support rollback.
The team then invested in getting all the config in code and version control. This included CI practices as well as lower environments that could test and practice deployments to devices that would not impact production. The new processes and tools allowed easy peer review since all code went through a pipeline on the way to production.
Finally, the team invested in bringing all work into a single backlog. This included automation to pull tickets from many systems into one common system that the team could collaborate in and thus prioritize the work.
Erica Morrison recalls her final learnings:
We worked really hard in this journey to bring some of the best practices we know from Development to the Ops world. There have been many things in this journey that went well, but there were a lot of misses and a lot of surprises. Probably our biggest surprise is how hard Ops really is. It’s one thing to read about it but quite another to experience it first-hand.
Also, change management is scary and not visible to developers. As a development team, we had no details about the change process and changes going in. We now deal with change on a daily basis. Change can be overwhelming and consume much of what a team does each day.
We also reaffirmed that change is one of the key intersection points of opposing goals between Dev and Ops. Developers want their changes to make it to production as soon as possible. But when Ops is responsible for putting that change in and dealing with the fallout, it creates a natural reaction to want to go slower. We now understand that getting Dev and Ops working together to both design and implement the change creates a win-win that improves both speed and stability.