Jason Cox is a Director of Systems Reliability Engineering (SRE) and DevOps Enterprise Summit programming committee member. He is also a coauthor of Investments Unlimited: A Novel About DevOps, Security, Audit Compliance, and Thriving in the Digital Age (available September 13, 2022).
What is Site Reliability Engineering?
Site Reliability Engineering (SRE) is often considered an expression of DevOps. Like DevOps, SRE seeks to bridge the traditional gap between development and operation teams with the goal of improving the organization’s ability to ship and operate software better, faster, safer and happier. DevOps is the philosophy that informs the practice and SRE is a practice that helps deliver on the vision.
This close linkage between DevOps and SRE has resulted in growing numbers of talks and publications produced in the DevOps community about SRE. DevOps Enterprise Summit is no exception.
As a particular interest area for me, I have started curating a collection of talks that I found to be especially informative and helpful. Whether you have a mature SRE organization in place already or are just starting out in your SRE journey, I believe you will find these talks to be compelling and inspiring.
The Origin of SRE
I would be remiss if I didn’t point out the origin of SRE.
Google, specifically Ben Treynor Sloss, is credited with the creation of SRE as a functional discipline in 2003. Google’s pursuit to help product teams get features to market quickly but in a way that didn’t jeopardize the reliability or correctness of the services gave rise to Google SRE.
At Google it is organized centrally. Presented as a scarce resource, SRE is selectively deployed into the most mission critical products, at the request of those product teams. Google SRE leaders, Jennifer Petoff and Christof Leng, give a great overview of how SRE is deployed at Google in this 2021 DOES talk:
How Google SRE and Developers Work Together (US 2021) *Read the Article
Jennifer Petoff, PhD, Director, SRE Education, Google
Christof Leng, PhD, SRE Engagements Engineering Lead, Google
Google has also written several books on SRE including a workbook. These are available online here.
The Language of Reliability
As the name implies, SRE focuses on reliability. But what is reliability?
There is a tendency to simply focus on traditional availability that comes packaged in a neat little binary bucket of “up” or “down”. Service interruptions record total minutes of downtime and post mortems are full of lore about fights waged over the beginning and ending of the “outage”. Unfortunately that approach completely misses the mark for our customers.
Understanding true reliability means understanding the true expectation from the customer or business that uses the system. A full-service outage that is restored within 24 hours may be acceptable but getting otherwise undetected error responses more than 0.1% of the time may not be. A greater than 15ms delay in a response may be unacceptable for one service, but a 2 day data pipeline update may be fine.
SRE understands the complexity of reliability and uses a specific language to crisply define and operate reliability at scale. The vocabulary of this language is based on “service levels”, specifically service level indicators (SLI) and service level objectives (SLO).
David Stanke of Google and Adam Shake of MediaMath talk about the mission of SRE to protect and improve availability, latency, performance and capacity, but do so in a principled way using the language of SLI, SLO and Error Budgets. I particularly enjoyed Adam Shake’s experience report of moving systems administrators into SRE and the impact that made on the rest of the business.
WATCH: Service Level Objectivity: Improving Mutual Understanding Through the Language of SRE Accepted (MediaMath and Google) (Las Vegas 2020)Summary: great introduction to SRE and how MediaMath went from SREs called SysAdmins to real SREs
Adam Shake, Director of Site Reliability Engineering, MediaMath Source
David Stanke, Developer Advocate, Google
Adam references a DevOpsDays Chicago workshop, “Site Reliability Engineering (SRE) and the Art of SLOs” by Jennifer Petoff and Nathen Harvey from Google. Adam describes how this workshop was instrumental in their adoption of SRE and SLOs. If you are just starting on your SLO journey, this could be a helpful tool for you as well.
The Scarcity of SRE
Anyone who starts on the SRE journey will quickly discover the challenge of meeting demand. With any degree of success, you will quickly have more need for SRE than you will have SREs. Recruiting and training efforts tend to lag behind the growing demand for more SRE help.
So what do you do? How do you deploy for greatest effect?
Stephen Thorne, Staff SRE from Google provides some guidance on how to prioritize the next engagements for SRE.
His simple framework helps you identify application targets based on being Mission Critical, Operable and Mutable. This allows the scarce resource of SRE to be applied to the highest value targets. Stephen gives several examples and provides additional advice on how to scale your SRE team, like lowering SLO targets when you can.
WATCH: When Is SRE Right For You? (London 2019)
Stephen Thorne, Staff Site Reliability Engineer, Google
The Adoption of SRE
How do organizations get started with SRE?
Christina Yakomin and Robbie Daitzman at Vanguard, give their story of adopting SRE. Their tale begins with a massive cloud migration effort demanding serious refactoring of legacy monolithic service into more cloud native platforms and microservices. Automation and measurement was key to being successful. By shifting to SRE driven SLIs, SLOs and error budgets, they were able to answer questions like, “What is healthy?” This allowed them to better comprehend and use failure modes, effects analysis, self-service, performance testing, and chaos engineering.
I found their organization model of SRE coaching and champions to expand the impact of SRE extremely compelling. I was also particularly intrigued by their approach of using Chaos Engineering fire drills to build muscle memory for more junior engineers.
WATCH: Iterative Enterprise SRE Transformation (US 2021) – VanguardRead the Article
Christina Yakomin, Site Reliability Engineer, Vanguard
Robbie Daitzman, Vanguard Intermediary Platform – Delivery Lead, Vanguard
The Cure for Toil
I often hear complaints that teams are too busy doing the daily work to improve the daily work. They are not wrong. It is a reality that infects many technology and product teams across the industry. Most teams are buried in a mountain of work that is manual, repetitive and automatable… in other words, teams are overloaded with “toil”.
SRE can help teams reach escape velocity away from the gravity of toil. They do this by chipping away at toil through automation. Each hour of toil removed is an hour of work that can be spent on feature development, automation and other infrastructure improvements.
Michael Winslow at Comcast addresses this sticky issue in his 2021 DevOps Enterprise Summit talk on “Building Confidence in Your SRE Team”. He gives his experience report of onboarding SRE teams into the business. Along the way he watches confidence grow in SRE as they persistently identify and remove toil through automation.
I’m particularly fond of Michael’s visual representation of the erosion of toil over time. Toil is converted into automation blocks that become part of the automation library. The time that automation offsets is handed back to the team to do more strategic and valuable engineering work.
WATCH: Building Confidence in Your SRE Team (US 2021) – Comcast
Michael Winslow, Senior Director, Software Development & Engineering, Comcast
The Clarity of Chaos
As SRE teams are able to buy back time through the investment of automation and other toil reducing efforts, more advanced efforts can be undertaken to increase the reliability of the whole system.
One way to achieve this is by better understanding the weakness and safety boundaries of a system through Chaos Engineering efforts—Chaos Engineering is part of SRE.
Chaos Engineering is not about testing broken parts of the system. If you already know something is broken, you should prioritize and fix it. No, Chaos Engineering is about discovering otherwise unknown weaknesses and limits of your working system. By introducing various degrees of failures (chaos) into your system, you learn new ways the service levels (SLO) can be impacted. This gives you the opportunity to improve the system and gain even high levels of reliability.
Chaos Engineering uses the scientific method of creating a hypothetical scenario (e.g. network outage, component failures, errors), conducting a test in a deliberate and safe way, taking measurements (SLIs) and reviewing the results for learning.
Troy Koss at Capital One and Courtney Nash at Verica, remove some myths about Chaos Engineering and provide some practical helps to add Chaos Engineering into your SRE toolbox.
I particularly like the guidance they provide around prerequisites for executing a successful Chaos Engineering programs. They also share the tight linkage between Chaos Engineering and SLOs. Specifically, one helps the adoption of the other.
WATCH: Chaos and Reliability: A Surprising Friendship in the Enterprise (US 2021) – Capital One
Troy Koss, Director, Site Reliability Engineering (SRE), Capital One
Courtney Nash, Senior Research Analyst, Verica
SRE is Making a Difference
As you can see from some of the talks I posted, SRE continues its expansion outside the orbit of Google and is being adopted by businesses in many different areas. Over and over, these businesses are reporting that SRE is helping them fulfill their DevOps mission to bridge the gap between development and operation teams and ultimately achieve the organization’s need to deliver and operate software powered systems and products, better, faster, safer and happier.
About the Author
Director of Systems Reliability Engineering (SRE)
Jason Cox is a Director of Systems Reliability Engineering (SRE). He majored in Computer Science and Electrical Engineering at the University of Tulsa and earned a Bachelor’s of Computer Science at the American Institute of Computer Sciences in Birmingham, Alabama. After graduation, Jason transitioned into Civil Engineering and helped transition from manual engineering and drafting to computer-aided design (CAD) building commercial and residential subdivisions, infrastructure, ponds and bridges for nearly 7 years. He later co-founded FamilyNet, Inc., a local internet service provider and web hosting startup, managing datacenters and business operations until it was sold. He relocated with his family to sunny Los Angeles to work on an M.Div. degree and shortly after took a job as a Systems Engineer in 2005. He is currently responsible for leading software engineers and SRE teams to provide platforms and embedded DevOps support for physical and cloud-based infrastructure, systems and applications for business segments.
Jason is a champion of DevOps practices, collaboration, curiosity, automation, agile and lean methodologies. He has had the privilege of speaking at several tech conferences and enjoys writing on leadership and DevOps topics. He is the author of iCurlHTTP, an iOS app for those who want to cURL on the go. He currently resides in Los Angeles with his wife and their children.
He is also a coauthor of Investments Unlimited: A Novel About DevOps, Security, Audit Compliance, and Thriving in the Digital Age, coming out on September 13, 2022.