Skip to content

April 5, 2022

How Google SRE and Developers Work Together

By IT Revolution

One of the prominent themes in so many DevOps Enterprise Summits has been Site Reliability Engineering, which I’ve always been dazzled by. There are so many interesting things about these principles and practices that Google pioneered back in 2003. I think it’s one of the most incredible examples of how one can create a self-balancing system that helps product teams get features to market quickly, but in a way that doesn’t jeopardize the reliability and correctness of the services they create. For over a decade, I’ve wanted to better understand why Google chose a functional orientation for this: Site Reliability Engineers.

To this day, thousands of Google SREs are still in one organization reporting to Ben Treynor Sloss, VP of 24/7 Engineering, which includes SRE, very purposefully outside of the product organizations. Undoubtedly, this creates certain benefits as well as certain complications. I can’t think of two better people to explain the implications of this to technology leaders. Dr. Jennifer Petoff is the global director of SRE education and is one of the most widely cited authors in this space. She will be co-presenting with Dr. Christof Leng, SRE engagements engineering lead, who has managed and worked on various parts of Google services, including Cloud, Ads, and internal developer tooling. Here are Dr. Petoff and Dr. Leng.

Dr. Christof Leng:

Welcome to our talk about the collaboration between Google’s Site Reliability Engineers and Developers. I’m Christof Leng. And I lead three teams in a horizontal function called SRE engagements, based in Google’s Munich office. These teams work on improvements for all of Google SRE. One of my responsibilities is to maintain the SRE engagement model, the collection of policies that define how collaboration between SRE and Developers partner works, which is what we’re going to talk about today. I’ve been working at Google since 2014 on various products. And before Google, I was a distributed systems researcher. Over to Jennifer.

Dr. Jennifer Petoff:

Great. Thanks so much, Christof. Hi, everyone. I’m Jennifer Petoff but my friends actually call me Dr. J. I’m Google’s director of SRE education and I’m based in Dublin, Ireland. I’ve actually been at Google for 14 years. I like to think that time flies when you’re having fun. You may be wondering, why Dr. J? Well, I have a PhD in Chemistry and started my career in the lab working on things that could start on fire or explode if you expose them to air. I currently lead the global SRE education team or SRE EDU as we affectionately call it. And I’m also one of the co-authors of the original SRE book that we published back in 2016. Of course, when we aren’t living in a pandemic, I love to travel and I’m a part-time travel blogger at Sidewalk Safari.

All right. So now that we’ve been properly introduced, let’s start with some context on complexity. So if you think about it, at 2 billion lines of code or more than 2 billion lines of code, it’s not an exaggeration to say that Google’s production environment might be one of the most complex integrated systems ever created. It’s also highly interconnected, which is a key enabler but also creates many challenges at this scale. The systems that you run may be smaller than that, but it probably depends a lot on third-party code, external dependencies, and maybe even a cloud platform, or two, of course. While this is a case study about Google and shouldn’t be applied verbatim to your organization, the challenges of scaling with complexity are universal.

So given the scale and complexity, that raises the question, how do you run a planet-scale system? How do you keep it stable? How do you add new functionality to it? So let’s look at balance. Reliability and velocity are oftentimes at odds with each other in the traditional software development model. Developers push for velocity to quickly launch new features and Ops pushes for reliability to slow things down. This is inherent to the way the incentive structure is set up. But the reality is that you need both aspects. You need reliability and you need to also move fast. So how do you find the right balance and actually resolve this potential conflict?

So one way to do it is to sidestep the problem. And what we’ve done at Google is actually create a discipline that balances these competing concerns of reliability and feature velocity. Ben Treynor Sloss, the founder of SRE at Google actually describes SRE as what happens when you ask a software engineer to design and run an operations function. So SREs come from many different backgrounds, but what they have in common is a mix of software engineering and systems engineering backgrounds. We know how to build and how run systems, and we deeply care about both aspects.

So SREs focus on reliability to meet the availability targets our users need while maximizing long-term feature velocity. SREs also focus on maintainability. So ensuring that we aren’t feeding the machines at human toil, where we define toil as work that’s manual, repetitive, automatable, tactical. So interrupt-driven and reactive and gives no enduring value. Toil also grows linearly with service growth, which can be problematic. Of course, efficiency is also important. So using engineering, time, and machine resources as efficiently as possible. Now, let me turn it over to Christof, to tell you a little bit more about the scope of what we work on.

Dr. Christof Leng:

Thank you, Jennifer. So SRE at Google, unlike many similar functions at other companies, is one central organization that takes care of many different areas across Google engineering, from user-facing products that everyone knows like search and YouTube to internal infrastructure like network or developer tools that our users never interact with directly. It has been tested, applied, and adapted to many different contexts across Google for more than 20 years, almost 20 years. The overall organization is more than 3,000 SREs nowadays, which is big. However, Google SRE is grouped in what we call product areas. A group of related services and products, and each has overall let’s say 50 to 300 SREs.

And each of these product areas, partners with a developer organization that develops these products, but the developer organization is typically much, much larger. And this asymmetry is intentional, it keeps SRE focused on its core mission. It also means that you cannot offload all operational work from Dev to SRE because that would easily overwhelm the much smaller SRE teams. The important part here is that SRE receives its headcount from its Dev Partner Org, typically in the context of individual engagements. These engagements happen at a team level, but the engagement planning and funding is done at the PA, at a Product Area level.

An engagement is a peer relationship between an SRE and a Dev team. It’s typically scoped around a specific service or product and relevant production assets or end-user interactions, so a related group of things from an engineering perspective. And such an engagement is not a one-way street, it requires significant contributions from both sides, from both SRE and Dev. A common misconception here is that SRE only comes in after the service is implemented and launched. But actually, an SRE engagement can happen at any time in the service life cycle or even be covered from start to finish. Because each service is different, we talked about its, network developer tools, customer-facing products, and every lifecycle stage has different needs, the types of engagements are diverse. We’ll cover that in more detail later. However, what they all have in common is that they are scoped around SRE’s mission, reliability, velocity, maintainability, efficiency, and a shared set of principles. Over to you, Jennifer.

Dr. Jennifer Petoff:

Thanks so much Christof. So of course these principles are called the SRE engagement model, and it really describes how the SRE and how SRE and Dev collaboration actually works. Let’s walk through that in a little bit more detail. So the first thing to note is that SRE support is not automatic. So SRE is a scarce resource by design. And in fact, many services at Google are built and operated by their Dev team with no SRE support at all. A few things to call out. SRE teams are funded by Dev. So it’s their choice whether to invest in SRE or not. And once transferred, SRE is responsible for that headcount. Production excellence is a multi-year investment. So engagements are not considered in isolation, but at the SRE product area level. Building an SRE team takes a minimum size, typically two sites with at least six SRE each, and time to build up that deep understanding of the services that the team is responsible for.

The service itself and its reliability are ultimately owned by the Dev team, even if the day-to-day production authority rests with SRE. Responsibility for having a reliable service is not offloaded onto SRE or thrown over the fence, so to speak. SRE’s job is to help the Dev team to meet their reliability and velocity goals and to meet the needs of our users first and foremost. So starting and can continuing with an SRE engagement is a joint decision for both the Dev and SRE organization and the teams involved, both sides need to agree to start an engagement and either side can end that partnership. Dev can’t force the service onto SRE and SRE has to give the service back when the Devs want that. It’s important to understand that at Google, both sides are doing this because they want it, they recognize the value of that partnership.

So what should SRE work on? So if SRE support is not universal, how do we decide what they should work on? So first and foremost, it always needs is to align with SRE’s mission. SRE is a specialist role. It does specific things. The SREs, what SREs work on should have a clear value proposition. So the idea is that SRE should only take on work that SRE can do significantly more efficiently than anyone else. What’s the value add? So if the work can also be done by the Dev team, and just as easily, Dev should keep that headcount to staff additional developers, giving them more flexibility and less overhead. The work that SREs do must also be impactful, interesting, and challenging for the SRE teams. So there have been circumstances where Devs think SRE support is someone holding the pager for you, but SRE is not an Ops team.

Our mission is not to handle operations, but to improve the inherent reliability of systems through engineering. Doing Ops work is a means to an end. So we want to understand what breaks, how to fix them, and how to fix them once and for all. And of course, it’s important to point out that Ops is not a zero-sum game anyway, instead of moving operational responsibilities from one place to another, an SRE engagement should focus on reducing the overall Ops workload. So it’s a win for everyone. So overall we aim at finding engineering opportunities that lead to sustained long-term value in terms of service health. And these are often not obvious to a typical engineer, but we basically need a specialist.

So how do we make SREs work more impactful? So now that you’ve found what to work on, how do you make sure you’re successful? So it’s important to think of managing a service as a shared endeavor. So SRE and Dev bring different expertise, but they work towards a common goal, the success of that service. And to avoid conflicts, it’s important to agree on what success means beforehand. So what is success in our user’s eyes? So SRE and Dev maintain a shared roadmap with goals that are, can be objectively measured and tracked, and this includes regular reviews of both service, health, and priority. Using Service Level Objectives or SLOs and error budgets, sort of fundamental SRE principles. This is a standard technique for ensuring both objectivity and balancing of reliability and velocity and getting everybody sort of rowing in the same direction. SLOs and error budgets promote a common understanding of reliability goals and a common language and basically a tool to measure success.

So once you’ve achieved your defined goals, it’s time to think about adjusting investments. Is this still the most impactful area for SRE to work on? Maybe the team can engage with new services, maybe new topics have come up with this particular service. Maybe the scope is broadened and you need more headcount. So it’s also possible that sometimes it doesn’t work out for whatever reason. We’ll talk a little bit more about that later, but depending on the situation you may want to double down on the investment or change SREs focus to a different topic. Whatever happens, engagements and their funding should be regularly reviewed. Headcount and resources should always be allocated to the most impactful work. All right, so SRE is about focusing on the important stuff.

For SRE there’s always more to do than there is time. So it’s really critical to focus on what matters most. SRE engagement should be scoped to a set of services with clear correlation and boundaries. You can’t boil the ocean, of course. SREs don’t work on production health for the sake of engineering merit. They’re an advocate for the user. They’re a champion for the user and for the user’s experience. So look at reliability end to end with customer-centric SLOs. However, there are also infrastructure improvements that may not be directly visible to the user. Things like converging towards standard platforms. Standard production platforms are important because they help you to move faster and really increase that feature velocity. It reduces the cost of implementing horizontal operating services and moving them between teams. Cognitive load from needing to know many different tools, and many different architectures is a major bottleneck for SRE teams to scale.

So standardization really helps with all of that. And finally, highly customized infrastructure also makes it harder for Devs to understand production. But to be able to build a reliable system, the Devs need decent production knowledge, of course. So SRE should always teach teams to fish rather than providing fish. Otherwise, there’s a risk that SRE will become a human abstraction layer for production. And behind that wall that’s an invitation for complexity to flourish. You can’t build a wall and then complain about a throw-it-over-the-wall mentality. Now I’m going to turn it over to Christof to talk about Engagement Types.

Dr. Christof Leng:

Thank you, Jennifer. Okay. Now that we learned about the core principles, how do we apply them in practice? As I said earlier, SRE can engage in any phase of the service lifecycle. We’ve all seen how important it is to integrate testing security and other topics earlier in the lifecycle, what they call “shift left”, the same applies to reliability. During design and implementation, you make many decisions that are incredibly hard or practically impossible to change later, architecture, technology, failover capabilities, and so on. When a production expert has a voice at the table, you can fix problems before they actually happen. So it’s super important to have SREs in the design discussions early on. For example, SLOs are often only discussed once the implementation is done. But does the architecture you picked scale to the number of nines of reliability your customers expect? If not, you either have to redesign your whole system or disappoint your users. Not a great decision to make.

Or your architecture is actually much more sophisticated, then what would be needed to satisfy your users? We have to, probably, waste precious time and resources that could have been invested in time to market or additional features. So having these conversations early on with the user lenses on with production knowledge can really help you to be more effective in development. But every life cycle stage and service is different. We can’t have a one size fits all approach to engagements. We categorize our engagement types into three broad buckets, baseline, assisted, and full. They require different levels of what we call commitment. Not only in terms of headcount funding for the SRE team, but also in terms of project time invested by the developer team, compliance with best practices, and coordination overhead between those two teams developers and SREs. Higher is not always better, comes at a higher cost, especially for the earlier life cycle phases with a higher rate of change, a lower-tier engagement can be more effective.

An engagement can transition from baseline to assisted to full support eventually but not all services do. It depends on the business priority and the need for SRE involvement. Services can also transition back to an engagement type with lower commitment. Sometimes, because they don’t meet the bar for SRE support anymore. And sometimes because they’re mature enough that the SRE work can be scaled back. The developers are happy to take on these additional responsibilities because they’re not very demanding anymore. There is no expectation that an SRE team has a specific mix of engagement types from these three types, or is even using all of them. It really depends on the situation. If you’re running the core infrastructure for your organization, you may focus on a handful of full support engagements. If you’re working with a new business area with many experimental services, you could do many baselines and assisted engagements instead.

Let me walk you through the different engagement types and give concrete examples. The baseline is the entry-level engagement. It’s tactical and reactive. It’s open to everyone in scope for a given SRE team. It’s included in the price of funding that SRE team. It consists of ad hoc support. For example, office hours or consulting projects. It provides easy access to production expertise, but the execution lies with the developers. It can also apply it to incident response. For example, we could give an escalation on call for the Dev on calls to ask for help during a major outage. These SRE on calls may not have detailed knowledge about their particular service, but they can often help with generic production knowledge, which the developers might not have to the same extent. Additionally, the SRE on call can handle escalations to backend dependencies or communication with stakeholders and allow the developer on call to focus on debugging and mitigation.

One possible example of how to implement baseline is what we call SRE Love. It’s a program for time box consulting projects, typically two to four hours per week, for say, a quarter. The developers submit a proposal for someone for the SRE team and they help them to execute it. So typically a call every quarter, you submit your proposal, the SRE team fixes whatever they can do. And the focus here is on knowledge transfer, not an SRE developing the project for the developers. It helps to improve the product organization early in the life cycle to offer services that don’t have any other type of SRE support. So the SRE mentors the Devs to do the work themselves and makes it a lot easier for them.

It’s also often the first time that the Dev team interacts with SRE. It helps to build a personal relationship between both fires, and put faces to the function. The Devs gain a better understanding of what SRE can do for them. And SREs learn about a new service that may become relevant to them at a later point in time, either because it moves to a different engagement type or it’s a dependency during an outage. So it’s good to know about it.

Next up is the Assisted Engagement. This type provides a longer-term strategic engagement. There is a dedicated point of contact on the SRE side and typically also on the Dev side, and the shared roadmap is defined. The focus is on engineering projects that improve production health, obviously. It can include core redesign work productionization, infrastructure migrations, and many other things. It does not usually include operations. The service is still operated by the Dev team. Sometimes individual SREs may join the Dev rotation temporarily to gain a deeper understanding of the service, which then helps them with their project work. When this type of engagement is applied at the right time, it can provide huge value that pays off for years to come even though it does not usually include any kinds of operations.

One example for Assisted Engagement is Embedded SRE. Typically, one or two SREs join a Dev team for a particular project. There’s still part of the SRE home team and participate in its own core rotation, but they spend all of their project time on this Dev team. As this is a significant investment, it is reserved for truly critical projects and for situations where these SREs can be force multipliers for the Dev team. For example, it can be used to apply SRE principles during the design and implementation of a major and new service, or it is used to prepare the onboarding of the service into the full support tier.

Full support [inaudible 00:23:45] is the most expensive type of engagement. It requires substantial headcount investment and continuous contributions from the left team. The services to get there will need to meet the high bar of production excellence, and SRE becomes the effective owner of production. SRE runs production for the Dev team. The development team is still also responsible that the service is reliable, responsibility is not offloaded, but SRE will do the bulk of the work. The most obvious attribute is that SRE has this on call, but the goal is to keep a broken service afloat with on-call heroics.

Any straightforward production issue can be fixed much more efficiently in Assisted Engagement. So if the service is broken, do not try onboard it to full support, fix the obvious stuff first. Once the low-hanging fruit has been dealt with, the SRE on-call work provides the additional comprehension necessary to solve the less obvious complex production problems. When you really need skin in the game, you have to see the service go up in flames during production to be able to analyze it and give good recommendations on how to re-architecture it. Both sides should work together towards simplification to reduce the cost of operating the service, up to a point when neither side cares who cares for the pager, and then it might be a good point in time to go back to the Assisted Engagement and save some SRE time.

A good example of this is SRE’s mantra of automating yourself out of your job every 18 months. Sounds ambitious, right? That can be done either through incremental improvements through service infrastructure or through pivotal changes to a new approach. Either way, it requires an intimate understanding of the service, which is typically built through day-to-day involvement in production and engineering. The goal is to reduce the need for continuous SRE involvement to make more time for more exciting projects. In some cases, handing back to service to Devs by moving to an Assisted engagement, or sometimes to simply keep up with the growing support load of a rapidly scaling system. If you’re on hyper-growth, the production work, the Ops work, will go through the roof.

So you have to cut down on it just to be able to keep up. It’s essential to reduce the cost of complex high-touch systems before they grow out of control and become what we call haunted graveyards. As the time investments, both for operations project work is high. Full support should be reserved for mature and business-critical systems. The most important thing is that the organization runs. But even for those, there’s typically plenty of opportunity for improvement, no matter how mature the system already is.

Okay, but what do you do when it doesn’t work out? For example, operation load gets out of control the services, unstable, SREs and Devs don’t see eye to eye anymore, or death has completely disengaged. Don’t panic. Apply the best practices from incident management to do strategic level. Don’t fight the symptoms, but understand the root cause and prevent a recurrence. What you would typically do during a production outage also applies to the engagement level. Try to come up with a strategic plan to fix the identified issue. Get buy-in from your Dev partners and potentially critical dependencies that you rely on. And if there isn’t enough engineering time to execute that, ask senior leadership to declare that the work required to fix the problem trumps all other project work. We call this a code yellow. If you can’t agree with your Dev partners escalate up on both sides of the management chain on the SRE side and on the Dev side.

And if that doesn’t help, perhaps it’s time to disengage, to hand back the pager. And if your leadership doesn’t want to do that either, mobility for SREs is typically higher inside of Google. Then it’s maybe time to start looking for a different SRE team. And if there are no SREs left on the team anymore, well, the pager is handed back, anyway. This is not what typically happens. Everybody understands that the SREs need to be kept happy as well. You can’t throw them under the bus and the developers understand the value that they get out of it. So normally you don’t end up there, you agree on a situation. Whatever you do remember those heroics are not sustainable. You can’t fire, fire production forever. Neither can you work day and night? It’s not sustainable. Solve the problem through smart engineering, not brute force. It also helps to remember the engagement principles. It is a shared endeavor. We must set a reasonable scope and adjust investments when needed to invest in your Dev SRE relationships. Spend more time together. United you stand, divided your fall. Okay. Let’s wrap.

Dr. Jennifer Petoff:

All right. Thanks Christof. So here’s Google SRE in a nutshell. So Google SRE is a specialist organization that takes a principled approach to balancing reliability and feature velocity while maintaining, keeping maintainability and efficiency in mind. SREs partnered with Dev teams to solve hard engineering problems. That would be a lot more difficult for the Devs on their own. SREs and Devs don’t work against each other with it conflicting incentives, but together towards shared goals, which are codified through things like service level objectives and error budgets. SRE helps Dev to build their production muscle. And ideally, the collaboration should begin early in the software development life cycle. So each engagement is different and we pick the approach that fits the needs of the service best, but of course, they share one, but of course, they all share the same principles. And because the service, the business, and outside factors are constantly changing, the engagement needs to be adapted regularly to stay impactful. So it’s not one and done. You have to keep revisiting and keeping it fresh.

As we wrap up here, I just wanted to call out if you want to learn more about Google SRE, we’ve actually published three books now with plenty of content that you can check out. So feel free to read them for free online at SRE.google. So original SRE book, Site Reliability Workbook, and Building Secure and Reliable systems. And to wrap things up as well. We’d like to engage with you of course, and puns are unintended in this particular case. So I’m interested in knowing how does SRE work at your organization? We’ve talked about how SRE works at Google, but we recognize that different places do SRE differently and there’s no single right way. So we would, of course, love to learn from you and what you found to work. Personally. I’d also love to know what are their SRE topics you’d like to hear about.

So we’re always looking for inspiration for conference talks, blog posts, and publications. And of course, if you’re wondering if we’ve already published on a topic, you’re interested in feel free to check out SRE.google and cloud dot.google.com/sre for the latest and let us know what those gaps will be and where we can potentially publish more. And finally, the great thing about conferences is the chance to connect with people. That’s certainly harder when everything is virtual, but we’d still love to try. Christof I’d love to see you in person at some point it’s been a while.

And Christof and I, of course, would welcome the chance to connect on Twitter or LinkedIn. We’ve included our coordinates on this particular slide. For me on LinkedIn, if you send a note with your invitation saying that you saw our DevOps Enterprise Summit Talk, that would be helpful as well. Because we do get quite a few requests to connect and it’s nice to have that context. So thank you, everyone. Thanks for tuning in today. And we’re looking forward to chatting with you all on Slack and getting your feedback and hearing your questions. Christof, it’s been great, and thanks so much.

 

- About The Authors
Avatar photo

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

Follow IT Revolution on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    Industrial DevOps: From Concept to Critical Need – Insights from the First Annual Report
    By IT Revolution

    The concept of Industrial DevOps, first introduced in 2018 by Dr. Suzette Johnson and…

    The Evolution of Industrial DevOps: From Concept to Industry Standard
    By IT Revolution

    Industrial DevOps emerged in 2018 as an innovative expansion of DevOps principles to large-scale…

    Reflecting on an Incredible August Conference and Looking Ahead to ETLS Connect
    By Gene Kim

    We had an amazing conference in Las Vegas this August. I learned so much…

    A Brief History of Happy Accidents
    By Stephen Fishman , Matt McLarty

    Quiet down, everyone. It’s pop quiz time! But don’t worry—there are only two questions;…