Skip to content

February 1, 2022

Iterative Enterprise SRE Transformation at Vanguard

By IT Revolution

Vanguard, a global asset management company, began to adopt SRE best practices to make their DevOps teams more effective around 2016, and shared their experiences at the 2021 DevOps Enterprise Summits.

At the time Vanguard began its transformation, they had not yet begun public cloud migration. All monolithic applications were hosted in a privately owned data center. All deployments were released on a quarterly schedule by deployment and operations teams—not development teams. These deployments were monitored via an “alert-only visibility” policy, under which it was assumed that if there were no alerts, the application was up and running. Ownership of the alerts was centralized. In order to get an alert configured, the application team had to submit a request to a central team and wait for them to have a spare cycle to set up an alert.

In order to migrate from a data center to a public cloud, Vanguard had to “break down the monolith” by slowly carving out microservices, which reduced the duration of the regression cycle. They began running microservices on a platform-as-a-service (PaaS) private cloud that was run from their data center. They also introduced a test-
automation engineer role to create tests for smaller slices of functionality covered by the microservices. This allowed Vanguard to increase deployment frequency, to automatically generate change records, and to attach automated test evidence to increase the velocity of the change-management process.

These changes helped Vanguard focus on lifting and shifting PaaS into the public cloud. This meant that all microservices that had been running PaaS in the on-premise data center had to make few changes to complete migration to the public cloud because the platform was the same; however, it was difficult for infrastructure teams doing that work—the shift expedited the cloud migration for microservices, but left Vanguard with an unnecessary abstraction layer.

Next, Vanguard needed to take the microservices that had been running in the public cloud on the PaaS and get them into a more cloud-native solution. This would reduce the operational complexity for the infrastructure teams that put the PaaS in the cloud initially and allow them to leverage out-of-the-box resources provided by public cloud providers. Because the carved-out microservices worked a little differently, it made sense to put some of them into AWS Lamda (for a serverless compute option), and Amazon EKS was a good option for other services that could benefit from the control plane of Kubernetes.

Now that Vanguard was exploring cloud-native solutions outside of the PaaS, much of the infrastructure responsibility rested on the microservice application teams. These teams were able to leverage new features (autoscaling, automated task replacement), but they needed to test new configurations, which were no longer centralized by on-premise configurations or PaaS configurations. They needed to adopt new day-to-day processes, such as using chaos engineering to validate the teams’ hypotheses about how their systems behave in times of stress. Performance testing was another way to test systems, but as the deployment frequency increased, the teams also needed to increase the frequency and flexibility of performance testing. To address this, they built a performance-testing-as-a-service application so that all individual product teams had access to the hosted load generators they needed to test their own applications.

What seemed like significant additional cognitive load on the application teams actually yielded several successes. Teams used chaos game days (in which they developed a hypothesis about system resilience and then tested that hypothesis by purposefully causing crashes to different components of the application and validating scaling behaviors and self-healing) and chaos fire drills (in which they intentionally injected faults they knew would raise system alarms in order to test new observability tools before they went live). In addition to testing systems and tools, these processes were also helpful for onboarding and training.

In a unique process that blurred the line between chaos experimentation and performance testing, Vanguard also performed brake testing on the CI/CD pipeline. As a large IT organization, they ran into growing pains as they were onboarding microservices into the CI/CD pipeline, observing recurring instability during high-traffic times, in which crashes were preventing thorough investigations because they were wiping out the critical logs before they could be offloaded to log-aggregation tools.

To troubleshoot this issue, they performance-tested the pipeline over a weekend by creating builds and deployments that recreated specific resource-intensive conditions, recreating crashes while someone watched the log file be created and then captured the relevant log files and thread dumps, allowing them to identify and address the bottleneck. By the following Monday, they saw immediate improvements to the pipeline.

Vanguard also had to find the right tools for their observability journey. “Alert only” visibility led to legacy alert consoles that were still almost exclusively used by operations teams. As microservices were carved out, the key benefit of the PaaS was that all the applications were operating in the same containerized environment. Their logs were filtering in the same ways and into the same places, so they were able to create standard microservice platform dashboards, originally intended for use by the platform owners.

They soon realized how beneficial these dashboards were for application teams, who began working with infrastructure teams to make them more customizable by cloning the dashboards and adding their own panels that were specific to their use cases. Application teams began adding alerts, which helped them to move quickly now that they were no longer submitting requests to central teams. This allowed them to make data-driven decisions as a unit; however, some of the consequences of these customizable dashboards included dashboard clutter, ignored alerts, and alert fatigue, because teams had access to alert customization but not all of the necessary information about best practices for alerting and dashboarding.

Because there were so many dashboard and alert queries running, “everything was logs.” Everything was in one tool and everyone used one querying language, but the scope and utilization were increasing. Costs were also rapidly increasing, and the performance of the tool was degrading, meaning that troubleshooting was held up by dashboard performance.

To solve this problem, Vanguard pulled metrics and traces out of the central tool and utilized tools like Amazon CloudWatch for metrics and Honeycomb for traces. The distributed tracing functionality offered by Honeycomb was used in chaos fire drills to see how much easier it would be to identify sources of latency within a complex web of microservices. To get from the user interface all the way to the datastore and back, they were encountering many different microservices depending on each investor’s account structure.

As part of the move to distributed tracing and the adoption of Honeycomb, Vanguard also standardized around OpenTelemetry—a move they saw as investing in the future by learning from the mistakes made in the past with other tools. Because standardizing around the  OpenTelemetry framework for sending telemetry data to backend collectors is now common practice in the industry, Vanguard believed that the same practice would allow them to more effectively avoid vendor lock-in. Many observability tools offer integration with OpenTelemetry collectors out of the box, which makes it simple to swap out backends in the future for logs, metrics, or traces if it becomes necessary.

As part of this investment, central teams at Vanguard began creating shared libraries for the benefit of application teams, e.g., to extract common fields that they may want to add to their trace context regularly.

More recently, Vanguard has made changes to SRE best practices by changing the way they measure availability by injecting nuance in the previously binary way availability is discussed. They have been rolling out the SRE practice of using SLIs and SLOs to talk about availability and latency in terms of reasonable thresholds based on the expectations of clients, as opposed to saying that an application must be as fast as possible and always available, a mindset that leads to burnout since it is impossible to achieve 100% uptime.

Moving forward, Vanguard is aiming to strike the right balance between efficiency and flexibility, and between time spent training and upskilling and time spent delivering. They are also seeking to decrease on-premise workloads by moving more applications to the public cloud and to become fully observable by developing a single visualization tool to aggregate the telemetry data. Finally, they are striving to grow in blameless post-incident reviews by sharing knowledge throughout the entire IT organization, which, when done correctly, maximizes learning for the entire organization and improves their ability to operate as DevOps teams effectively.

- About The Authors
Avatar photo

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

Follow IT Revolution on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    The Original Disruptor of the Music Industry
    By Matt McLarty , Stephen Fishman

    I know. You’re thinking I'm talking about Napster, right? Nope. Napster was launched in…

    From Turbulence to Transformation: A CIO’s Journey at Southwest Airlines
    By Summary by IT Revolution

    When Southwest Airlines' crew scheduling system became overwhelmed during the 2022 holiday season, the…

    High Stakes Communication: The Four Pillars of Effective Leadership Communication
    By Summary by IT Revolution

    You've been there before: standing in front of your team, announcing a major technological…

    Mitigating Unbundling’s Biggest Risk
    By Stephen Fishman , Matt McLarty

    If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…