The following is an excerpt from a presentation by Stephanie Gillespie and John Rzeszotarski from KeyBank, titled “Augmenting the Org for DevOps.”
You can watch the video of the presentation, which was originally delivered at the 2017 DevOps Enterprise Summit in San Francisco.
Stephanie is head of digital channel technology at KeyBank for our community bank. Her team aligns directly with their retail line of business, and they design, develop, and deliver online banking solutions for their retail clients.
John is the director of the continuous delivery and feedback organization. He manages part of the infrastructure for Stephanie, as well as their code release management teams and monitoring teams.
Who is Key?
Well, we’re probably the biggest bank you’ve never heard of.
We are the 13th largest bank in the United States. We were founded in 1825 and we are headquartered in Cleveland, Ohio. We have about 12,000 branches across 15 states currently, and we have about 20,000 employees, which of about 25%, or 5,000, are in our IT and ops organization.
We have three million clients and about 135 billion dollars in assets. That puts us at the highest standard of regulatory requirements from an OCC perspective. Not much different than our friends over at Bank of America. So, it makes our lives in IT just that much more interesting, but also complex.
Here’s a map of the states that we’re currently located in.
The ones that are highlighted in black are where we have our retail presence, and the ones highlighted in gray are where we have our corporate and private banking presence.
We’ve grown over the years through acquisitions and mergers to kind of scale our footprint across the United States, but there is a lot of technical debt when you grow that way. There are a lot of systems that we have to upkeep and maintain, and we did that primarily through a very traditional model.
We had a separate development team that was really focused on line of business. We have a separate security NEA team focused on standards, and we never fired anyone at Key for adding another layer of security.
Then we had an operations team that it was very siloed based off technology platforms.
Then these teams all have to come together to develop solutions, but they all have different priorities.
- The development team’s priority is speed to market, trying to get the features out as fast as possible. Sometimes, cutting corners, maybe.
- Security NEA team is really focused on governance and standardization.
- The operations team is focused on reliability. They don’t want too much change because they’re just trying to keep the lights on.
So, designing solutions kind of becomes this downward spiraling effect, and you design point to point solutions.
This was most notable when we had a significant outage in 2015.
Here is a representation of our online banking platform, and you’ll see one red user and one green user, and just one transaction where they’re clicking the login button.
One log in transaction effectively meant 200 network hops behind the scenes. We could ping-pong back and forth across our data center anywhere from 7-30 times for one transaction. We also had single points of failure built within both data centers. We weren’t any more highly reliable by having two different data centers by any means— we’re just more complex.
So, in 2015, we had a network outage, and this really caused a catastrophic event. We tried to fail around the network outage, and because we didn’t have this diagram (which would have been nice,) we made it worse. We were down for the better part of a day, and our CIO and our chief architect demanded change, not just from the infrastructure teams, but also from the application.
Or d17 as we called it at Key, which was really the name of the project that we were going to implement in order to get rid of all that complexity and craziness.
Digital17 was intended to be a two-year project for the 2 million clients that we had in our online banking digital applications at the time, but the question wasn’t necessarily how could we create an online banking application that didn’t suck…
The real question was how could we build the digital framework that allows us to test and learn inexpensively, and the answer was to change everything.
We looked to change the architecture, getting away from that monolithic, single application that everything ran within, and breaking it out into a three-tiered architecture separating the user interface from the channel services, from the core processes, and the enterprise services.
Then we looked at the user experience. How do we redefine the user experience and break it up into a widget-based design that we could then extend to both our web and our mobile applications?
Then we focused on people, and we had to reskill the team that we had to learn the new technologies and to learn the new frameworks.
And finally, we changed the way we worked and collaborated together by leveraging agile-based practices.
Everything needed to be built to change and change quickly.
Good thing for that, though, because a few months into the project, Key announced we were going to purchase First Niagara.
And oh, by the way, those 1 million clients that we had just purchased were going to be migrated into Key’s environment before our digital17 application was going to launch.
Wow, what do we do?
I think any reasonable organization would say “Hey, let’s stop the project, let’s focus on this acquisition.” Which at the time, it was the largest banking acquisition in the country since the 2008 crash. We needed it to be successful.
But that meant bringing 1 million clients onto that crazy complex platform only to then migrate them to a new experience and a new platform a few months later.
We made the decision to accelerate, and that 24-month project became an 18-month project, and we had to scale for an additional 30 percent user base.
We kept the name digital17, even though we were going to implement what was now Q2 of 2016.
Why did we think we could accelerate, and if that was the case, why wouldn’t we have just done that, to begin with? Why wouldn’t we have just had an 18-month project to start?
Well, really, to make this work, we had to change the traditional approach to application, development, and infrastructure management at Key.
There were two things that we wanted to focus on, and both involved speed.
- We needed speed in decision making on the left side of the equation. We identified decision owners, we held them accountable, answers were needed in 24 hours or less, or we were escalating.
- We also needed a way to deploy our application quickly. And we needed a way to scale our infrastructure quickly. If we were making quick decisions, chances are we weren’t going to get everything right the first time, and when those scenarios cropped up, we wanted to be able to change them fast.
There’s actually a great quote from Kurt Bitner from Forrester Research, which I think sums this up perfectly, which basically says “If agile was the opening act for a great performance, then continuous delivery would be the headliner.”
So, really, you need both. You need to consider the full end to end spectrum.
And hence, our exploration into DevOps.
When we came to the DevOps Enterprise Summit in 2015, we got to hear from Target and Capital One, and we were kind of like “We would never do that, not at our bank,” but after we went, we basically came out with three big things.
- We had to have executive support. You’re going to have to get your CIO’s buy-in.
- It’s got to be metrics driven. For us, it was mean time to resolution, it was release frequency, and it was some of our service levels for infrastructure services.
- The focus of removing bottlenecks had to be at the forefront of every problem that we came across.
We had to figure out where are we going to start, and what you see in the slide below is our traditional waterfall enterprise software and development lifecycle framework, which I’m sure many of you have some form or factor within your organizations.
But we couldn’t wait for a top-down corporate initiative to tell us how to get started with DevOps. We had to purposefully pick certain areas of that framework where we thought we could be successful in order to get this movement started.
What you see highlighted in blue with the circles is where we chose to start. We chose to start with the installation of the infrastructure and the configuration, and that’s where we brought in containers and Kubernetes.
We also looked at automated testing, and how do we automate the regression testing? How do we automate testing so that we can identify defects earlier in the cycle at the time that the builds are being committed? And how do we leverage continuous delivery within our development pipeline?
Then we layered in agile practices throughout the life cycle as it made sense in order to enable that quick decisions. The point was it was kind of a grassroots, bottom-up initiative. We chose where we wanted to start, and then we took the pitch to our executive leadership team.
Here’s John riding up the elevator with our CIO, Amy Brady, but trust me, Amy is a lot less scary and a lot more attractive than the picture. ALEX: WE CAN CUT THIS BUT IT’S FUNNY
The crux of the conversation was “look, we are in the middle of this major program to re-platform our application and now, we have to accelerate for the First Niagara acquisition. We need to think differently about the way we run and operate our platforms to bring the speed.”
Obviously, the conversation went well or we wouldn’t be here today.
I would encourage that as you look to carry these movements forward in your own organization, just be thoughtful about who you’re making your pitch to because DevOps is not an easy thing to explain to people, especially outside of the technical community.
Now, let’s talk briefly about containers.
The way I like to explain it is when we want to build a specific product, we start with CPU ram resources disk.
- We virtualize that to get more bang for our buck.
- We install operating systems on top of it.
- We install frameworks, we install more frameworks because one’s never enough.
- Then, we install a platform so we can get vendor support and vendor lock-in sometimes, and then we install applications on top of that.
- We have to go through and configure it.
- Some have some security vulnerabilities. We’ve got to get them fixed.
- If I have to operationalize and make sure I have the right alerting, make sure I have the right monitoring all put in place.
Now, I’m done.
No. I’ve got to test.
- I’ve got to test those dependencies between all those different layers,
- To test to validate to make sure the application’s working.
Now, I’m done.
Nope. It’s time to start all over again because I’ve to go patch, upgrade, and fix the application. Oh boy!
So, at Key, each of these different lines and boxes in the slide above is typically a different team. So, a project manager is having to cross-coordinate across dozens of teams in order to put a system together, and by the way, we don’t just have to do that once. We have a dev environment, an IT environment, and a QA environment, et cetera.
Containers, game changer.
I’ve built it once as an image, I build it through code, and I just deploy it on top of the infrastructure.
The patching, upgrading, fixing all goes away. I deploy it with the application. I’m also giving more responsibility back to the developers. They control more of their configuration for what they need to operate with. But I also have to protect the developers from themselves because sometimes, they want more resources than I’m willing to give them.
This is kind of where Kubernetes enters.
We didn’t go to Kubernetes because we wanted to build a platform as a service and let developers go out and build innovative apps. We went there for reliability.
We went there because Google is always up, and we want to emulate that. They’re rolling deployments, they’re autoscaling, their high availability they offer. That’s exactly what we wanted, and that’s what we put together.
So, one other area we had to focus was continuous delivery. Because we’re a very large bank, 700 applications, hundreds of project teams, lots of teams doing things differently, we wanted to standardize how those pipelines, but while offering really good flexibility. XebiaLabs XL Deploy and XL Release came in and has done a great job for us.
The other thing to mention is our two separate teams had to collaborate together to help build reliability within the application, and we built something called the circuit breaker pattern.
All banks have to use third party services, whether it’s FIS or Visa, or MasterCard, and we don’t want, when they’re down, to affect us. So, this is where the circuit breaker pattern kind of comes in. We use Netflix’s Hystrix framework to really help safeguard our application, and it’s come in handy countless times.
What’s DevOps without great tools, right?
And what’s DevOps without automation
For Key, with the speed we were trying to sustain with d17, we needed to really focus automation around testing because you’re only as fast as your slowest bottleneck, and manual testing takes a long time.
Given the fact that we had one million logins per day, we weren’t willing to risk quality for the sake of speed and shortchange testing.
We took time to build automated test scripts, which got us much more coverage than we were used to and also took much less time, from 20 hours down to less than 12 minutes.
It really helped to increase our confidence in what we were deploying because those defects were caught earlier in the cycle at the time the code was being built and migrated through the environments. That was a big win for us.
So, how did everything go?
We talked about the tools, we talked about testing. D17 was actually very successful. We met our accelerated timeline, we migrated key clients into the new platform, and we got prepped and ready for First Niagara.
When that day came, to migrate those one million clients into Key’s environment of which 500,000 were online banking clients, things didn’t go so well.
But it wasn’t the technology. The technology actually performed successfully. What it ended up being was a decision around the user experience in the first time login, where we thought we were going to be making it easier for clients, we actually made it very confusing, and they were locking themselves out left and right.
Calls started flooding into the contact center. We had over two hour wait times. There was a social media frenzy about how bad KeyBank screwed this up. Which was not the headlines we were looking for, given all the planning and preparation that went into the migration, especially since the technologies were performing.
So, although it was a firestorm, it actually ended up being kind of a blessing in disguise for the DevOps movement.
At Key, this is when the real beauty of DevOps came into play.
Envision a command center, and envision you’ve got senior executives, you’ve got your business, you’ve got your developers, you’ve got your operations team, you’ve got your testers all hunkered down in a room talking, working, collaborating, and we made changes to the user experience on the fly.
We had 10 changes within four business days, all done during the day, and not a single one impacted our clients and brought the system down.
We were able to quickly make changes to the way the application was working and the experience that the users had in order to manage their first-time login.
With that, DevOps basically sold itself that day. It really helped those people at Key realize that we were well on our way to becoming that 190-year-old digital bank that our CEO Beth Mooney so often refers to.
Where do we go from here?
With all that success, there was a real appetite to get DevOps more in the enterprise. Like, where else can we put it?
We started looking at other areas that could leverage this framework. Where can we grow one team at a time? And we know it’s going to be an iterative, continual process, but it’s not also just about expanding DevOps across other teams. It’s also about how do you go deeper with DevOps into the teams that are currently using it?
As an example, in digital, we’re running in containers, we’re leveraging a lot of these frameworks, but our releases, they’re still pretty big. So, how do we break those up into smaller chunks? And how do we leverage containers more effectively so that we can deploy changes to small subsets of users and test those out before we roll it out to the broader user base?
D17 was actually the flagship application that started running within this framework at Key, and since, our corporate banking online application is now running in this environment as of Q3 into Q4 of this 2018, and our online account opening application is moving to this framework in December 2018.
There’s a lot of excitement. We’re going to keep moving, but we’re also going to need to expand and continue to go deeper into some of these principles.
But, as we started to scale this, not all the engineers got onboard.
Kind of a shocker, right? We definitely have some passionate change agents that want to change, sometimes even change a little bit too much, but we also have the engineers that have been able to keep reliable systems up for a very long period of time.
But when these two guys get together, there can be a little bit of animosity.
It’s not necessarily a bad thing, right, because you wouldn’t stand up for what you believe in if you didn’t strongly believe it. There are right and wrong points on either side of this. You have to be empathetic. That’s a must-have.
The way we’re handling this is through three different types of leadership.
- Leadership at the engineering level to actually show how to do the changes. Yes, you can do this in a large enterprise, and I’m going to show you how.
- We do think there’s leadership at the middle management layer that’s also required. That’s to sell the business case, put the plan together, prioritize it as part of that continuous improvement backlog
- And then lastly, you still need that executive leadership that’s going to give you the funding and has the strong beliefs that you have to continuously invest in your systems long term.
Also, we’re changing the mix of contractors to employees.
It’s not because contractors don’t bring value and expertise, but it’s because we also want to protect the risk of critical subject matter expertise staying internal within the organization. As we’re growing our employee base, we’re growing not the same type of talent that we’ve always grown. We’re not necessarily looking for a developer anymore who knows how to write their application code.
We’re looking for an engineer who knows how to write application code, but also understands how that code gets deployed, also understands how that code works within the broader ecosystem, how caching can come into play, how proxy settings can come into play, and just a full end to end engineer.
And we continue to grow the team.
One of the things that happened in July 2017, is Key announced the purchase of Hello Wallet. Here we are a bank, buying a software company, and Hello Wallet is a startup.
They have about 32 employees in the Washington D.C. area, and we purchased their capability, which is going to be a core strategic capability for Key’s strategy going forward, but we also purchased the talent, and this is a group of engineers who have new ways of thinking.
They’re challenging the historical traditional approach that we’ve got in our 190-year-old bank, but really embracing open source technologies, embracing agile, and challenging us in the way we’ve been thinking. So, we’re really excited about having this team onboard and bringing them into our thought leadership.
We also brought in a new CTO this year, and he definitely is behind a lot of the practices and principles of DevOps. In his first town hall, he came out and made a big statement that says we need more technology generalists if we think we’re going to be able to move the speed the line of business wants. So, Linxus engineers start using a mouse, Windows engineers start using a keyboard.
But his other analogy that he uses that I really like, I think it hits home, is that we’re actually air traffic controllers.
We have to land 99 planes. If we land 98 out of 99, it’s a failed day. That is not acceptable.
Our job now is to get those planes out faster, get more of them out, and make sure that they’re always on time. In doing so, we have to reevaluate our organizational structure to try to optimize for speed. We were a vendor that very much focused on keeping the lights on versus continuously improving the infrastructure. Now, we were very eager for starting to take baby steps.
We followed our DevOps books under Conway’s law and said ‘Alright, how do we get work through the infrastructure pipeline today?’
There needed to be a much bigger focus on planning. We planned within the silos, and we planned to do the minimum amount of upgrades, etc.
Infrastructure development was also at the forefront. It’s kind of funny because we’ve had engineers push back on automation because they said “well, we don’t do it that often,” and I was like,
“Well, wait a second. That’s exactly why you would like to develop it and automate it. If you’re not touching it very much, you’re probably going to make a mistake the next time you do it. It’s going to be more accurate. It’s going to be traceable because it’s going to be sitting in source control. It’s going to be versioned, and it’s going to be much easier for that next person to pick that up. You want to automate it if it’s not a very common task, just as you would want to automate it if it is a common task.”
We do think that we want to organize the infrastructure development by our lines of business to help share the prioritization, and then we still have 700 applications that we have to support and keep an open running. Some of them are very specific. We still have specialists, but we’re striving to make those specialists more generalized.
Now, throughout our entire technology organization, there is a huge focus on continuous learning, and one of the things that our CTO also did was implement something called our 8 a.m., and it’s a post-mortem call, and he’s very vocal in saying that it’s not a blame session. He’s very adamant about it, but we all need to learn together. We all need to be able to figure out why is this and how is this affecting our customers, and let’s continuously improve everywhere within the bank.
Most of the entire technology organization calls into this 8 a.m., and it’s actually inspired a lot of people and drove a lot of passion.
Ultimately, we still have to make sure that each one of our organizations has to be able to federate to support the Dev team because obviously, being able to hit our line of business objectives is the most important thing as well.