Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
April 7, 2022
Site Reliability Engineering (SRE) is often considered an expression of DevOps. Like DevOps, SRE seeks to bridge the traditional gap between development and operation teams with the goal of improving the organization’s ability to ship and operate software better, faster, safer and happier. DevOps is the philosophy that informs the practice and SRE is a practice that helps deliver on the vision.
This close linkage between DevOps and SRE has resulted in growing numbers of talks and publications produced in the DevOps community about SRE. DevOps Enterprise Summit is no exception.
As a particular interest area for me, I have started curating a collection of talks that I found to be especially informative and helpful. Whether you have a mature SRE organization in place already or are just starting out in your SRE journey, I believe you will find these talks to be compelling and inspiring.
I would be remiss if I didn’t point out the origin of SRE.
Google, specifically Ben Treynor Sloss, is credited with the creation of SRE as a functional discipline in 2003. Google’s pursuit to help product teams get features to market quickly but in a way that didn’t jeopardize the reliability or correctness of the services gave rise to Google SRE.
At Google it is organized centrally. Presented as a scarce resource, SRE is selectively deployed into the most mission critical products, at the request of those product teams. Google SRE leaders, Jennifer Petoff and Christof Leng, give a great overview of how SRE is deployed at Google in this 2021 DOES talk:
How Google SRE and Developers Work Together (US 2021) *Jennifer Petoff, PhD, Director, SRE Education, GoogleChristof Leng, PhD, SRE Engagements Engineering Lead, GoogleRead the Article
How Google SRE and Developers Work Together (US 2021) *Jennifer Petoff, PhD, Director, SRE Education, GoogleChristof Leng, PhD, SRE Engagements Engineering Lead, Google
Google has also written several books on SRE including a workbook. These are available online here.
As the name implies, SRE focuses on reliability. But what is reliability?
There is a tendency to simply focus on traditional availability that comes packaged in a neat little binary bucket of “up” or “down”. Service interruptions record total minutes of downtime and post mortems are full of lore about fights waged over the beginning and ending of the “outage”. Unfortunately that approach completely misses the mark for our customers.
Understanding true reliability means understanding the true expectation from the customer or business that uses the system. A full-service outage that is restored within 24 hours may be acceptable but getting otherwise undetected error responses more than 0.1% of the time may not be. A greater than 15ms delay in a response may be unacceptable for one service, but a 2 day data pipeline update may be fine.
SRE understands the complexity of reliability and uses a specific language to crisply define and operate reliability at scale. The vocabulary of this language is based on “service levels”, specifically service level indicators (SLI) and service level objectives (SLO).
David Stanke of Google and Adam Shake of MediaMath talk about the mission of SRE to protect and improve availability, latency, performance and capacity, but do so in a principled way using the language of SLI, SLO and Error Budgets. I particularly enjoyed Adam Shake’s experience report of moving systems administrators into SRE and the impact that made on the rest of the business.
WATCH: Service Level Objectivity: Improving Mutual Understanding Through the Language of SRE Accepted (MediaMath and Google) (Las Vegas 2020)Adam Shake, Director of Site Reliability Engineering, MediaMath SourceDavid Stanke, Developer Advocate, GoogleSummary: great introduction to SRE and how MediaMath went from SREs called SysAdmins to real SREs
WATCH: Service Level Objectivity: Improving Mutual Understanding Through the Language of SRE Accepted (MediaMath and Google) (Las Vegas 2020)Adam Shake, Director of Site Reliability Engineering, MediaMath SourceDavid Stanke, Developer Advocate, Google
Adam references a DevOpsDays Chicago workshop, “Site Reliability Engineering (SRE) and the Art of SLOs” by Jennifer Petoff and Nathen Harvey from Google. Adam describes how this workshop was instrumental in their adoption of SRE and SLOs. If you are just starting on your SLO journey, this could be a helpful tool for you as well.
Anyone who starts on the SRE journey will quickly discover the challenge of meeting demand. With any degree of success, you will quickly have more need for SRE than you will have SREs. Recruiting and training efforts tend to lag behind the growing demand for more SRE help.
So what do you do? How do you deploy for greatest effect?
Stephen Thorne, Staff SRE from Google provides some guidance on how to prioritize the next engagements for SRE.
His simple framework helps you identify application targets based on being Mission Critical, Operable and Mutable. This allows the scarce resource of SRE to be applied to the highest value targets. Stephen gives several examples and provides additional advice on how to scale your SRE team, like lowering SLO targets when you can.
WATCH: When Is SRE Right For You? (London 2019)Stephen Thorne, Staff Site Reliability Engineer, Google
How do organizations get started with SRE?
Christina Yakomin and Robbie Daitzman at Vanguard, give their story of adopting SRE. Their tale begins with a massive cloud migration effort demanding serious refactoring of legacy monolithic service into more cloud native platforms and microservices. Automation and measurement was key to being successful. By shifting to SRE driven SLIs, SLOs and error budgets, they were able to answer questions like, “What is healthy?” This allowed them to better comprehend and use failure modes, effects analysis, self-service, performance testing, and chaos engineering.
I found their organization model of SRE coaching and champions to expand the impact of SRE extremely compelling. I was also particularly intrigued by their approach of using Chaos Engineering fire drills to build muscle memory for more junior engineers.
WATCH: Iterative Enterprise SRE Transformation (US 2021) – VanguardChristina Yakomin, Site Reliability Engineer, VanguardRobbie Daitzman, Vanguard Intermediary Platform – Delivery Lead, VanguardRead the Article
WATCH: Iterative Enterprise SRE Transformation (US 2021) – VanguardChristina Yakomin, Site Reliability Engineer, VanguardRobbie Daitzman, Vanguard Intermediary Platform – Delivery Lead, Vanguard
I often hear complaints that teams are too busy doing the daily work to improve the daily work. They are not wrong. It is a reality that infects many technology and product teams across the industry. Most teams are buried in a mountain of work that is manual, repetitive and automatable… in other words, teams are overloaded with “toil”.
SRE can help teams reach escape velocity away from the gravity of toil. They do this by chipping away at toil through automation. Each hour of toil removed is an hour of work that can be spent on feature development, automation and other infrastructure improvements.
Michael Winslow at Comcast addresses this sticky issue in his 2021 DevOps Enterprise Summit talk on “Building Confidence in Your SRE Team”. He gives his experience report of onboarding SRE teams into the business. Along the way he watches confidence grow in SRE as they persistently identify and remove toil through automation.
I’m particularly fond of Michael’s visual representation of the erosion of toil over time. Toil is converted into automation blocks that become part of the automation library. The time that automation offsets is handed back to the team to do more strategic and valuable engineering work.
WATCH: Building Confidence in Your SRE Team (US 2021) – ComcastMichael Winslow, Senior Director, Software Development & Engineering, Comcast
As SRE teams are able to buy back time through the investment of automation and other toil reducing efforts, more advanced efforts can be undertaken to increase the reliability of the whole system.
One way to achieve this is by better understanding the weakness and safety boundaries of a system through Chaos Engineering efforts—Chaos Engineering is part of SRE.
Chaos Engineering is not about testing broken parts of the system. If you already know something is broken, you should prioritize and fix it. No, Chaos Engineering is about discovering otherwise unknown weaknesses and limits of your working system. By introducing various degrees of failures (chaos) into your system, you learn new ways the service levels (SLO) can be impacted. This gives you the opportunity to improve the system and gain even high levels of reliability.
Chaos Engineering uses the scientific method of creating a hypothetical scenario (e.g. network outage, component failures, errors), conducting a test in a deliberate and safe way, taking measurements (SLIs) and reviewing the results for learning.
Troy Koss at Capital One and Courtney Nash at Verica, remove some myths about Chaos Engineering and provide some practical helps to add Chaos Engineering into your SRE toolbox.
I particularly like the guidance they provide around prerequisites for executing a successful Chaos Engineering programs. They also share the tight linkage between Chaos Engineering and SLOs. Specifically, one helps the adoption of the other.
WATCH: Chaos and Reliability: A Surprising Friendship in the Enterprise (US 2021) – Capital OneTroy Koss, Director, Site Reliability Engineering (SRE), Capital OneCourtney Nash, Senior Research Analyst, Verica
As you can see from some of the talks I posted, SRE continues its expansion outside the orbit of Google and is being adopted by businesses in many different areas. Over and over, these businesses are reporting that SRE is helping them fulfill their DevOps mission to bridge the gap between development and operation teams and ultimately achieve the organization’s need to deliver and operate software powered systems and products, better, faster, safer and happier.
Jason Cox is a champion of DevOps practices, promoting new technologies and better ways of working. His goal is to help businsses and organizations deliver more value, inspiration and experiences to our diverse human family across the globe better, faster, safer, and happier. He currently leads SRE teams at Disney and is the coauthor of the book Investments Unlimited. He resides in Los Angeles with his wife and their children.
No comments found
Your email address will not be published.
First Name Last Name
Δ
"This feels pointless." "My brain is fried." "Why can't I think straight?" These aren't…
As manufacturers embrace Industry 4.0, many find that implementing new technologies isn't enough to…
I know. You’re thinking I'm talking about Napster, right? Nope. Napster was launched in…
When Southwest Airlines' crew scheduling system became overwhelmed during the 2022 holiday season, the…