July 27, 2018

More Culture, More Engineering, Less Duct-Tape

By IT Revolution

The following is an excerpt from a presentation by Scott Prugh and Erica Morrison from CSG, titled “More Culture, More Engineering, Less Duct-Tape.”

You can watch the video of the presentation, which was originally delivered at the 2017 DevOps Enterprise Summit in San Francisco.

CSG is the largest SAS-based customer care, and billing provider in the United States. So, if you get your cable bills from a lot of the major providers, we produce those, but we also run the software on the back end for that customer care. We’ve been doing that for about 35 years.

We’re really proud of our heritage, but we’re also really proud about continuing to innovate and improve.

Just recently, we hit about 61 million subscribers. We have about 150,000 call center seats in the US. And there’s about 40 of the development and operations teams that we’ll be talking about today. We run everything from mainframes with assembly code all the way to javascript and nodes. So we’ve got, across the board, all the different types of technologies you’ll find in traditional organizations.

We have the same challenges as everyone else, too — getting things faster to market and higher quality, both on the software and on the operations side.

Last year, we introduced what we called our DevOps teams model, where we collapsed together both development and operations into build run teams. We’re going to share about that today:

I’ll re-open, and hopefully close for good, this concept of bimodal IT.
We’ll go through our DevOps journey, and business and culture metrics, looking at the improvements we’ve put in.
Then, we’ll introduce what we’re calling the service owner model, where we’ve begun to challenge some of the traditional thought processes around separating SDLC, and ITIL, and ITSM processes.
We’ll look at post-incident reviews.
And finally, we’ll look at our targeted DevOps culture leadership series, where we continue to really reinforce our culture, about how we’re leading our DevOps transformation.

Closing the concept of Bimodal IT

This slide is Mode2.

Mode2, which lines up pretty much with the definition of bimodal, is you run your servers and apps safely with speed and quality, and it’s your obligation to do that, whether or not you have systems of record. Whether or not they’ve been deemed innovative. They all need to be run as they were.

So then the question becomes, what does Mode1 look like?

Here’s my definition: Servers are destroyed with a sledgehammer in a parking lot.

Before we destroyed our servers, there were five years when we were transitioning them out of production.

But it took about 40 minutes to recycle those servers. Imagine patching that infrastructure, when you’re running transactions for all the major customers in the US and you’re taking 40 minutes per server to cycle. It’s extremely dangerous to do that.

Now, we’ve transitioned, imported, and strangled off those servers, they now restart in a few seconds, making it a much safer environment by transitioning it.

Really, there is no Mode1. It has to be Mode2 across everything you need to run. The bad guys don’t really care that those systems have been deemed not innovative because they still want your data, and it’s your responsibility to protect it.

Now, we’ll take a look at our DevOps Journey

First up are the metrics.

We went from 48 million subscribers to about 61 million from 2012 to the end of 2017. The growth of our API platform, we went from 750 TPS to about 4,000 TPS. That’s a 400% growth.

Basically, our customers just continued to consume our APIs. They have this really kind of insatiable appetite to use our services. And that’s a good thing— you want these green lines going up because that means more value for your customers and more value for your company.

Next, we’ll look at quality.

The maroon line at the bottom really represents what we did with release quality through our agile transformation, and what we call ‘early DevOps.’ We improved, putting in things like continuous integration, automation, about an order of magnitude.

But the one at the top is really interesting. It’s the incidents that we actually have in coming onto the platform during these years. For the most part, they hovered between 1,400 and 1,600 incidents per month.

In 2017, we saw an incredible drop in those number of incidents. That was just a few months after we introduced those DevOps team. Which was a 60% improvement, by just having the same teams that build it, operate it, see all the incidents coming in, and can fix those root causes.

The next thing is the service owner model.

So for us, the service owner model looks like this: It’s the transformational leader that’s accountable for the end-to-end construction.

The operation
The SLAs
The customer experience
The stewardship of business value for a product or a set of services

They really have the whole thing.

Here’s the traditional resource efficiency model from IT. Now, this is very project-centric. If you are still doing this, you have to question what efficiencies you are getting, and how you need to change to work differently.

On the other side is what we have, our DevOps teams, and the service owner model.

With that, we’re organizing to run and build a service across all the resources that are required. So we take T-shaped teams and T-shaped resources, and we put T-shaped leaders in place that are able to transform, and lead teams that both build and run the software. I think you’ll see more and more of this model in place, as teams look to get more efficiency and provide more value.

Now I want you to look at the combination of SDLC and ITIL.

On the left, you’ll see the traditional model. In our case on one side we use SAFe, a scaled, agile framework, and on the other side, we use ITIL.

You usually have your feature board, and it has all kinds of great features on it. And then you have your operations board, which has all the other stuff. And generally, the flow goes from left to right. You dump things over to operations, and they run it, and things break, and then they try to fix it.

There isn’t great feedback to improve, operate and design the service.

So what we are suggesting in doing is this — as you combine the teams, combine the processes of your software development lifecycle, your service operations, and service design processes into one backlog.

Talk about work visibility, it’s all right here! You can now really see all of the work that is going into that service.

So, teams that stand up are looking at the issues. They see they’re having incidents. They see the changes that are coming. They can integrate security into the construction process. It’s an incredibly powerful model that allows you to get continuous service improvement, as an everyday activity with that feedback.

Now, this isn’t without its challenges.

My product managers tell me this all the time, “Well great, Scott, but how do I get more blue features?” My dev teams are spending all of their time fixing the service that they built.

Well, that’s exactly the idea. You need to get that investment into the service operations to improve it, and this is a great forcing function to do that.

On the inverse, how do we keep this transformation from crushing your development capacity?

In the beginning, it can really seem that that’s occurring, as a lot of your backlog gets eaten up improving that service. But if you remember the incidents, in the beginning, doing things like this is what gave us that 60% improvement.

Change

Now, let’s talk about change. Who doesn’t love CAB? You get a whole bunch of senior people that are really smart, put them in a room, have them advise on all change going into the system. Let me tell you, it doesn’t work very well for a lot of reasons.

One, it puts the approval furthest from the knowledge.
It also creates really large batches, because those senior folks can only get together once in a while, and we have to batch up a whole bunch of work.
It also increases the risk that you’re going to fail, that your change is going to fail.

Instead, we actually recommend this: Let’s decentralize all the change.

We’re going to push that change into the backlogs of the teams doing it, and have them manage it. They understand the most about that change, so why wouldn’t they be the ones who understand the risk, and how to approve it? You also get other things like, ‘Hey, they can now start to make the system safer for change.’ Let’s redesign, how that change is going to work. Let’s automate it. Let’s create things like standardized work.

And I really view change as a feature that has very low variability. And over time, you can get that automation in place.

Now, this isn’t to say there is no CAB, but we still have that, we have it for changes that have really large blast areas that cross a lot of teams.

The next thing is support — Here is a slide of the traditional model of support.

The standard model is three tiers of support, where you handoff from a help desk to product operations, maybe to a level three development, and it creates a lot of problems.

One, it creates queues in hand-offs, but it also creates organizational boundaries, because those queues, they don’t learn. People learn. And when you put those boundaries between them, they don’t generate new knowledge to actually fix the system.

What we’re recommending is the SWARM model.

In this case, you bring everyone together for a major incident on a shared bridge.

You have all of them involved and sharing information to resolve the issue as fast as possible.

You still have a help desk, but the help desk facilitates the call and has a shared whiteboard. It annotates, it works a timeline. But all those teams are now active in resolving that issue.

The people with the expertise SWARM it, because they have the knowledge to fix the problem, and it removes those queues and hand-offs. It also removes the frustration from the customer. Imagine the customer calls, they’re having an issue. And what they realize is, they’re going to have to wait in three queues before they actually talk to the people involved. This removes that problem.

Not only has our service owner model changed how we respond to issues as they occur..

It’s also changed how we respond to them after they’re resolved.

In the old way of doing things, often our operations team would be the one that would be resolving the production issue. Then, they would have an after-action summary, or AAS, to talk through the issue after the fact.

I came from the development world. We were often blissfully unaware there was even a production issue, with no idea what an AAS was. Our infrastructure team may or may not have been involved in this process. Obviously, we weren’t applying system thinking, continuous feedback, to come up with holistic solutions.

So what we do now? We have a post-incident review at the team level. This is one or more DevOps teams, along with infrastructure teams, if necessary, and they’re talking through the timeline of what happened. It’s an informal discussion. We’re brainstorming on different ideas to make the system better. We’re looking at can we avoid this altogether? Can we get it resolved faster? What sort of knowledge and training can we share with our team members?

And then we have our after action summary. This is really targeted at a different level. It’s more of a summary of the issue. It’s for our senior leaders, our business partners, and our DevOps teams participate here, as well. It’s really a summary of, what’s the impact to the customer? And how are we going to get better?

So our service owners really are doing a lot.

We really only highlighted a few key areas of responsibility that they have. This slide shows additional responsibilities that our service owners take on. Things like performance, monitoring, tech debt, people operations.

Change at a team level

We’ve been talking about changes that we’ve made at the organizational level. And we really have made great strides in a number of areas. However, we’ve also exposed that the continual journey is not easy.

I’d like to take it to the team level for a minute, spotlight a few specific teams, see how that success can sometimes bring unexpected challenges, and it’s not a straight line journey. We’ll talk through these teams, and the successes and the challenges that they’ve had.

The first team that I’d like to spotlight is the team that manages our load balancer.

Last year there were great successes for this team lying DevOps foundations, making work visible, automating manual changes, integrating with our telemetry system, and getting better visibility into changes that we were making in our production environment. This year, infrastructure as code has been a big focus for this team.

We’ve developed a framework, and now we’re porting product by product into this framework, and we’ve got about 20 products converted over so far.

This has allowed us to make change in a much safer fashion.

Now, with a click of a button, we can deploy to production exactly what we deployed and tested to our QA environment. To give you the scope of our manual changes, our largest manual changes took up to six hours. That was a lot of clicking in a UI and doing a lot of work. Those weren’t necessarily standard, but we were doing a lot of changes in this area.

With these new, safer, models, we can now, not only have smoother deployments, but we have fewer outages. However, there’s been some complexities along this journey, as we’ve learned our way through this.

1 — The intake process takes longer, now. It’s safer, it’s more maintainable, but there is a larger up-front cost when we do new setups.

2 — Other issues have to do with production outages, and our ability to respond to them. So in the old way, we could go and in, say, 30 seconds in a UI once we knew what to change, we could go change it. Now, we’ve got to check it out of source control, we’ve got to build it, we’ve got to deploy it. We had to revisit our continuous integration system, our module layout, and optimize the flow through this system so that we could get things through here very quickly.

3 — Next, we developed a stopgap for teams where with the click of a button, they could do some basic things, like enabling and disabling servers. But when we went to source code as a source of truth, what happened a couple time is, we let them keep the button, and they would use it, change the state of what was on the server, we would deploy over the top of that, and cause issues. So we’ve streamlined that.

4 — Another thing that we had was rollback, and confusion around what to roll back to. Rollback is awesome with infrastructure code, it’s very easy, but we need to know what to roll back to. So we integrated with our telemetry system here. We said, ‘Hey we’re already writing what version we’re deploying. Let’s just write what the previous version was there,’ and that’s really smoothed things for us, as well.

5 — We’ve created a synthetics framework, so now we’ve got a dashboard of about a thousand endpoints, and we can ping them and get a red ring status every five minutes. We had hesitated to do this thing. ‘Hey, we’re routing on behalf of these other applications. These application teams know their product better than we do. They know the feature functionality. They’ve got tests here.’ But what kept happening is, we’d have change windows. And we’d be blind during the window. We’d say, “Well how is it looking? How is it looking?”

And then, we had multiple times where we implemented a change, it validated cleanly, only to find that there was an issue the next day.

We said we’re going to take control back ourselves, and we’re going to just simply ping these endpoints, which gives us a lot better visibility than what we have now. This has greatly helped with our changes, when we have an issue, we can have a post-incident review. We can talk about what changes we need for our dashboard. And we even had other teams who use this to troubleshoot some of their issues.

6 — Another thing that we’ve done is introduced a release cadence, where we’re deploying this infrastructures code in small batches, just like we would any other sort of code. In my early days with my involvement with this team, we touched production as infrequently as possible. It’s fragile, high-risk, to do a lot of these changes. If someone explicitly requested something like an IP change, we would do that.

However, even in our early days of infrastructures code, as we were learning our way through this, if we were doing something like changing the underlying standards, that wouldn’t go to prod until the next time someone requested it.

As you can imagine, a couple of times things went to production and they didn’t realize that they were getting them. But we’ve finally turned the corner where it’s safer to do these more frequently, and it’s lower risk in that manner. We’ve now developed a risk analyzer, so we can show them exactly what’s changed since the last time we went to production. And they get used to this cadence.

A great metric is how much sleep I get on the nights we make these changes.

When we first started doing these, and I first started getting involved, I can tell you I never slept through the night on any of these major changes. I was often on the call when we were doing the change itself. If I wasn’t on the call, I was checking my phone every hour, waking up, seeing how things were going.

The first time that we deployed everything to production with infrastructure as code, I slept through the night.

We’re also evolving toward self-service, so we’re coming up with lighter weight solutions, where teams can do more themselves, and supporting cloud.

Next team that I’d like to talk about is the team that manages our monitoring and alerting solution.

This team has really experienced unprecedented growth this year. We’ve got lots of users of these systems, DevOps teams, internal business partners, our help desk, and our customers. And as more and more people have seen the value of this centralized telemetry system, the request to get products on here has really gone through the roof, so much so that we’ve outpaced our ability to scale our capacity.

We’ve made huge changes in growth and our capacity that we can handle this year. Lots of improvements to this system.

We’ve done things like:

At the foundational layer, add and separate out the infrastructure.
At the software layer, similar changes, where we’re separating out, so we can scale these individual components.
Partnering with our third-party vendors.
Looking at our ever-changing operational footprint, things like node allocation, indexing strategy, et cetera.
We’ve introduced blue-green deployments.

The biggest value of this, for us, is if we get into an unhealthy state, we can flip over to an environment that’s healthy.

That allows us to focus on current data moving forward, and improve fault tolerance between the components.

We had tight coupling here, so we’ve de-coupled that, to the best of our abilities. We’re focusing on infrastructure as code in the cloud, so we developed cookbooks here. And this is likely to be the first product that we take to the public cloud.

Finally, we’ve improved visibility around system usage. One of the biggest things that’s bitten us is huge, unexpected volumes. So now we can see better abuse of the system, new products coming on, ec.

This team’s been an interesting use case this year. They’re one of our more mature DevOps teams. They were really doing DevOps before we’d fully embraced it as a company. They’re just a bunch of developers that happened to support an operational environment.

However, they’ve been victims of their own success a little bit with this product that so many people like, and want to have. We’re constantly operating at capacity.

We’ll make an architectural change to get us more capacity, more breathing room. And it’s like two or three weeks later, we just can’t keep the next product from coming onto the system, so we’re right back to at capacity.

As you can imagine, when we were operating at capacity, you’re fine, as long as nothing goes wrong. As a result, we firefight more than I’d like. It’s firefighting again. We’ve got post-incident reviews, we’ve identified many improvements we’ve made to the system, which is great.

But it has detracted from our roadmap of where we want to go to get to true enterprise scale. So we do have a good plan, it’s just been a matter of getting the work prioritized.

The last thing I want to talk about…

is our targeted DevOps culture focus.

18 months ago, when we reorganized around DevOps, we jumped right in, and just started doing, and let the methods speak for themselves. This was intentional.

However, we took a step back and said, “Hey, we also need to follow this up with the why and the how and set the vision for the entire organization, so everybody gets it.”

First, we came up with, “What does DevOps mean to CSG?”

Some key areas for us were customer focus and delight, people, and modernization in process.

We knew we also needed to have a forum for discussion, and a venue to share our success stories. So one of the things we did, we introduced a DevOps leadership series. This is a monthly meeting where we meet with our leaders. We have a topic, and we also celebrate our wins.

We’ve extended our DevOps community of practice. This is a meeting for our practitioners. We talk about things like tooling. For instance, this month’s topic is a cobol unit testing framework. We’re holding book clubs at the team level, and we’re also participating in our local DevOps community.

In summary, you’ve seen the graph that showed the tie-in between our business and culture metrics along our DevOps journey.

Our service owner model has really changed how we approach a number of things. Having one person take a system view across a software development life cycle. Change, incidents, both when they occur, and after they are resolved.

We’ve walked through a few examples of some teams that have shown the ups and downs of this journey along the way, and we’re starting to have a better targeted DevOps culture focus.

So help that we’re looking for really centers around this culture space. We’re looking for your best practices, your ideas, success stories around these topics. Things like continuing to build DevOps culture and cross-scaling, penetrating the next level of leadership, and building consistency of message.

- About The Authors

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

No comments found

with Dominica DeGrandis

with Matthew Skelton & Manuel Pais

August 20-22, 2024

By Gene Kim

By Dr. André Martin