Skip to content

April 7, 2022

Jason Cox’s Site Reliability Engineering (SRE) Playlist

By Jason Cox

What is Site Reliability Engineering?

Site Reliability Engineering (SRE) is often considered an expression of DevOps. Like DevOps, SRE seeks to bridge the traditional gap between development and operation teams with the goal of improving the organization’s ability to ship and operate software better, faster, safer and happier. DevOps is the philosophy that informs the practice and SRE is a practice that helps deliver on the vision.

This close linkage between DevOps and SRE has resulted in growing numbers of talks and publications produced in the DevOps community about SRE. DevOps Enterprise Summit is no exception.

As a particular interest area for me, I have started curating a collection of talks that I found to be especially informative and helpful. Whether you have a mature SRE organization in place already or are just starting out in your SRE journey, I believe you will find these talks to be compelling and inspiring.

The Origin of SRE

I would be remiss if I didn’t point out the origin of SRE.

Google, specifically Ben Treynor Sloss, is credited with the creation of SRE as a functional discipline in 2003. Google’s pursuit to help product teams get features to market quickly but in a way that didn’t jeopardize the reliability or correctness of the services gave rise to Google SRE.

At Google it is organized centrally. Presented as a scarce resource, SRE is selectively deployed into the most mission critical products, at the request of those product teams. Google SRE leaders, Jennifer Petoff and Christof Leng, give a great overview of how SRE is deployed at Google in this 2021 DOES talk:

How Google SRE and Developers Work Together (US 2021) *
Jennifer Petoff, PhD, Director, SRE Education, Google
Christof Leng, PhD, SRE Engagements Engineering Lead, Google

Read the Article

Google has also written several books on SRE including a workbook. These are available online here.

The Language of Reliability

As the name implies, SRE focuses on reliability. But what is reliability?

There is a tendency to simply focus on traditional availability that comes packaged in a neat little binary bucket of “up” or “down”. Service interruptions record total minutes of downtime and post mortems are full of lore about fights waged over the beginning and ending of the “outage”. Unfortunately that approach completely misses the mark for our customers.

Understanding true reliability means understanding the true expectation from the customer or business that uses the system. A full-service outage that is restored within 24 hours may be acceptable but getting otherwise undetected error responses more than 0.1% of the time may not be. A greater than 15ms delay in a response may be unacceptable for one service, but a 2 day data pipeline update may be fine.

SRE understands the complexity of reliability and uses a specific language to crisply define and operate reliability at scale. The vocabulary of this language is based on “service levels”, specifically service level indicators (SLI) and service level objectives (SLO).

David Stanke of Google and Adam Shake of MediaMath talk about the mission of SRE to protect and improve availability, latency, performance and capacity, but do so in a principled way using the language of SLI, SLO and Error Budgets. I particularly enjoyed Adam Shake’s experience report of moving systems administrators into SRE and the impact that made on the rest of the business.

WATCH: Service Level Objectivity: Improving Mutual Understanding Through the Language of SRE Accepted (MediaMath and Google) (Las Vegas 2020)
Adam Shake, Director of Site Reliability Engineering, MediaMath Source
David Stanke, Developer Advocate, Google

Summary: great introduction to SRE and how MediaMath went from SREs called SysAdmins to real SREs

Adam references a DevOpsDays Chicago workshop, “Site Reliability Engineering (SRE) and the Art of SLOs” by Jennifer Petoff and Nathen Harvey from Google. Adam describes how this workshop was instrumental in their adoption of SRE and SLOs. If you are just starting on your SLO journey, this could be a helpful tool for you as well.

The Scarcity of SRE

Anyone who starts on the SRE journey will quickly discover the challenge of meeting demand. With any degree of success, you will quickly have more need for SRE than you will have SREs. Recruiting and training efforts tend to lag behind the growing demand for more SRE help.

So what do you do? How do you deploy for greatest effect?

Stephen Thorne, Staff SRE from Google provides some guidance on how to prioritize the next engagements for SRE.

His simple framework helps you identify application targets based on being Mission Critical, Operable and Mutable. This allows the scarce resource of SRE to be applied to the highest value targets. Stephen gives several examples and provides additional advice on how to scale your SRE team, like lowering SLO targets when you can.

WATCH: When Is SRE Right For You? (London 2019)
Stephen Thorne, Staff Site Reliability Engineer, Google

The Adoption of SRE

How do organizations get started with SRE?

Christina Yakomin and Robbie Daitzman at Vanguard, give their story of adopting SRE. Their tale begins with a massive cloud migration effort demanding serious refactoring of legacy monolithic service into more cloud native platforms and microservices. Automation and measurement was key to being successful. By shifting to SRE driven SLIs, SLOs and error budgets, they were able to answer questions like, “What is healthy?” This allowed them to better comprehend and use failure modes, effects analysis, self-service, performance testing, and chaos engineering.

I found their organization model of SRE coaching and champions to expand the impact of SRE extremely compelling. I was also particularly intrigued by their approach of using Chaos Engineering fire drills to build muscle memory for more junior engineers.

WATCH: Iterative Enterprise SRE Transformation (US 2021) – Vanguard
Christina Yakomin, Site Reliability Engineer, Vanguard
Robbie Daitzman, Vanguard Intermediary Platform – Delivery Lead, Vanguard

Read the Article

The Cure for Toil

I often hear complaints that teams are too busy doing the daily work to improve the daily work. They are not wrong. It is a reality that infects many technology and product teams across the industry. Most teams are buried in a mountain of work that is manual, repetitive and automatable… in other words, teams are overloaded with “toil”.

SRE can help teams reach escape velocity away from the gravity of toil. They do this by chipping away at toil through automation. Each hour of toil removed is an hour of work that can be spent on feature development, automation and other infrastructure improvements.

Michael Winslow at Comcast addresses this sticky issue in his 2021 DevOps Enterprise Summit talk on “Building Confidence in Your SRE Team”. He gives his experience report of onboarding SRE teams into the business. Along the way he watches confidence grow in SRE as they persistently identify and remove toil through automation.

I’m particularly fond of Michael’s visual representation of the erosion of toil over time. Toil is converted into automation blocks that become part of the automation library. The time that automation offsets is handed back to the team to do more strategic and valuable engineering work.

WATCH: Building Confidence in Your SRE Team (US 2021) – Comcast
Michael Winslow, Senior Director, Software Development & Engineering, Comcast

The Clarity of Chaos

As SRE teams are able to buy back time through the investment of automation and other toil reducing efforts, more advanced efforts can be undertaken to increase the reliability of the whole system.

One way to achieve this is by better understanding the weakness and safety boundaries of a system through Chaos Engineering efforts—Chaos Engineering is part of SRE.

Chaos Engineering is not about testing broken parts of the system. If you already know something is broken, you should prioritize and fix it. No, Chaos Engineering is about discovering otherwise unknown weaknesses and limits of your working system. By introducing various degrees of failures (chaos) into your system, you learn new ways the service levels (SLO) can be impacted. This gives you the opportunity to improve the system and gain even high levels of reliability.

Chaos Engineering uses the scientific method of creating a hypothetical scenario (e.g. network outage, component failures, errors), conducting a test in a deliberate and safe way, taking measurements (SLIs) and reviewing the results for learning.

Troy Koss at Capital One and Courtney Nash at Verica, remove some myths about Chaos Engineering and provide some practical helps to add Chaos Engineering into your SRE toolbox.

I particularly like the guidance they provide around prerequisites for executing a successful Chaos Engineering programs. They also share the tight linkage between Chaos Engineering and SLOs. Specifically, one helps the adoption of the other.

WATCH: Chaos and Reliability: A Surprising Friendship in the Enterprise (US 2021) – Capital One
Troy Koss, Director, Site Reliability Engineering (SRE), Capital One
Courtney Nash, Senior Research Analyst, Verica

SRE is Making a Difference

As you can see from some of the talks I posted, SRE continues its expansion outside the orbit of Google and is being adopted by businesses in many different areas. Over and over, these businesses are reporting that SRE is helping them fulfill their DevOps mission to bridge the gap between development and operation teams and ultimately achieve the organization’s need to deliver and operate software powered systems and products, better, faster, safer and happier.

- About The Authors
Avatar photo

Jason Cox

Jason Cox is a champion of DevOps practices, promoting new technologies and better ways of working. His goal is to help businsses and organizations deliver more value, inspiration and experiences to our diverse human family across the globe better, faster, safer, and happier. He currently leads SRE teams at Disney and is the coauthor of the book Investments Unlimited. He resides in Los Angeles with his wife and their children.

Follow Jason on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    High Stakes Communication: The Four Pillars of Effective Leadership Communication
    By Summary by IT Revolution

    You've been there before: standing in front of your team, announcing a major technological…

    Mitigating Unbundling’s Biggest Risk
    By Stephen Fishman , Matt McLarty

    If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…

    Navigating Cloud Decisions: Debunking Myths and Mitigating Risks
    By Summary by IT Revolution

    Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…

    The Phoenix Project Comes to Life: Graphic Novel Adaptation Now Available!
    By IT Revolution

    We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…