Engineering Lead, Google
Sr. SRE Manager, Google
Senior Director, SRE, Google
Engineering Manager, (SRE) Google
Technical Writer, Google
with help from Salim Virji, Site Reliability Engineer, Google
Google’s Site Reliability Engineering (SRE) team is a specialist engineering organization focused on designing, building, and maintaining large-scale production services. SREs can be software engineers or systems engineers but usually bring a blend of both skill sets.
Google SRE’s mission is to:
- Ensure that Google’s products and infrastructure meet their availability
- Subject to (1), maximize long-term feature velocity.
- Use software rather than human toil to accomplish (1) and (2).
- Engage only when (1) through (3) are accomplished more efficiently by SRE than developers.
Reliability and velocity are not mutually exclusive. Often, velocity can benefit from improved reliability and vice versa. However, when a tradeoff between reliability and velocity is necessary, SRE prioritizes reliability over velocity, but only until the product or service in question reaches the desired SLO. When the SLO is not met, working on reliability is more important for user satisfaction than feature velocity. When the product is within SLO, additional reliability at the expense of feature velocity is counterproductive. Instead of using a brute-force approach to fulfill its mission, SRE applies engineering and automation rather than repetitive human work (“toil”) to optimize operations.
As a specialist organization, Google SRE is in high demand by product development (hereafter shortened as “Dev”) teams; the opportunities for SRE to provide additional value are plentiful. SRE can be a force multiplier in many situations, but when a problem can be solved just as well by an engineer in the Dev organization, hiring a Dev instead of an SRE is a more flexible approach that creates less cross-organizational overhead. The goal is to staff just enough SREs to maximize the ratio of impact to overhead.
Google SRE should not be taken as a blueprint for implementing SRE elsewhere but rather as a case study. Every organization is unique; their needs and goals are unlikely to be exactly the same as Google’s. But almost twenty years of practical experience have provided many lessons that can help others to fast track their individual SRE journey.
SRE at Google is not static—it is constantly evolving, and different parts of Google apply the model differently according to their needs. Unlike similar functions at many other organizations, SRE is a centralized group at Google. From its humble beginnings, the SRE group has grown to a few thousand engineers. As shown in Figure 1, teams of SREs are dedicated to a specific Product Area (PA) and work closely with their Dev counterparts in that PA. SRE PAs vary in size and can consist of up to a few hundred SREs. SRE PAs are funded by the Dev partner organization and collaborate with them on every organizational level. An SRE PA is typically an order of magnitude smaller than the Dev partner organization, but the ratios can vary heavily. Most SRE teams are dual-homed in two locations, with six to eight SREs each and with a time zone difference of five to nine hours to enable a follow-the-sun on-call rotation.
Figure 1: The Google SRE Organizational Structure
SRE work is based on engagements with their Dev counterparts. An engagement is a collaboration between both sides, typically around a specific service or product. Most often, an engagement is a partnership between SRE and Dev to improve the reliability, infrastructure, and operations of a specific production system. Other engagements might be focused on the end-to-end user experience of a product or a horizontal infrastructure topic; either can span numerous production systems. A typical SRE team maintains a set of engagements with the Dev teams that develop the systems in scope.
The Engagement Model is one of Google SRE’s foundational concepts. It describes the principles for engagements and a set of best practices to facilitate efficient allocation of resources, communication, coordination, and cooperation between SRE and Dev. While not a strict rules-based model, it is aimed at providing clarity and setting mutual expectations for the involved parties and to allow easy identification of outliers or degradations of engagement.
This section describes the principles of the Engagement Model. We’ll then discuss the categories of engagements (“engagement types”) and how to apply them in practice.
Aligned with SRE’s Mission
SRE’s mission as indicated earlier is to improve the reliability, efficiency, and velocity of Google’s products, as well as maintain high team health. This mission should be at the core of every engagement, and each engagement should have a measurable positive impact on these goals.
Advocate for the User
SRE is an advocate for the user and for the user’s experience—whether that user is external or internal. The fact that SRE’s engagements may be enumerated by systems (or groups of systems) should not diminish SRE’s focus on how the user perceives reliability (or lack thereof). This focus is reflected in an emphasis upon end-to-end, or customer-centric, SLOs, as well as SRE’s responsibility to highlight reliability gaps and risks to Dev partners even when these are outside the immediate areas of responsibility of the SRE team. It may also suggest aligning first at the product level, then focusing SRE teams on particular critical user journeys (CUJs) or end-to-end experiences, even if their particular area of immediate responsibility is delineated by a (possibly wide) group of services.
Clear Value Proposition
SRE should only take on work that SRE can perform significantly more efficiently than anyone else. Adding a specialized team to partner with the Dev team introduces additional organizational complexity and increases the risk of silos. If the work can be done with similar quality and efficiency inside of the Dev team, that solution is preferred—it is not only simpler but also allows teams to shift work more flexibly when requirements change.
SREs are skilled, specialized engineers who are highly sought after talent and paid comparably to their Dev counterparts. In order to justify adding SRE headcount, an engagement should involve substantial reliability engineering work of enduring value, rather than mostly on-call work. Otherwise, adding Dev headcount makes more sense. A certain amount of exposure to on-call work is valuable in order to provide insight into which engineering streams provide the highest value, but providing mostly on-call work to a team of highly trained engineers is likely to lead to dissatisfaction within the SRE team. The fact that a Dev team is too small to provide its own on-call coverage or in a single location is not prima facie a sufficient reason to justify an SRE engagement.
SRE teams should be scoped to a set of services (or a set of CUJs) with clear correlation and boundaries. SRE does not have an obligation to take accountability for a specific service, but typically provides a base level of support to all products within the Dev team’s scope. Dev and SRE leadership regularly negotiate engagement scope.
Funded by Dev
SRE PAs receive headcount grants from their respective Dev orgs. SRE does not receive headcount through its own management chain or carry its own unallocated headcount. While SRE teams are funded by Dev, once headcount is transferred, SRE has responsibility for that headcount. The SRE PA lead has an obligation to use that headcount efficiently and effectively in consultation with the funding Dev partner. Headcount should be returned to the funding Dev org if it cannot be used to deliver substantially more value via enduring SRE work than the funding Dev partner could deliver.
Funding should be long term (but not permanent). It takes a long time to both hire SREs and to onboard SREs to a service. For that reason, Google SRE plans for headcount funding on a time horizon of two or more years and does not tie funding to short-term, time-bounded activities. Swings in headcount level will create inefficiencies and won’t allow the SRE team to engage deeply with the product.
The level of funding on an engagement (or a group of engagements) should be regularly reviewed by SRE and Dev leadership, e.g., annually. The review should consider whether the engagement type is correct and whether to reduce or increase funding—either via a grant or return of headcount or by reallocation within SRE. Decisions should be made by consensus. However, SRE leadership ultimately owns reallocation of project priorities within existing headcount limits. Otherwise, factors like adequate staffing for an engagement might not be fully accounted for.
Production excellence is a long-term investment. Engagements are not considered in isolation but at the SRE PA level. The SRE PA as a whole should have a strategic vision that is aligned with and complementary to that of the Dev org. Merely executing a series of unconnected engagements is an anti-pattern. The SRE PA lead owns the SRE PA vision and the task of priority negotiation with the Dev org lead.
Each individual engagement is built according to a multi-year planning horizon. Service engagements are expected to yield a shared road map between Dev and SRE. Work should move in both directions between SRE and Dev. SRE is not simply a repository for work handed to it by Dev.
Expectations should be set before there are issues in the arrangement; under duress it is more difficult to form a written agreement. Systems and their components change, merge, and diverge. SRE needs to carefully move with the product and can’t pivot instantly to support a new system without sufficient ramp-up time.
Irrespective of the type of engagement, the service itself and its reliability is ultimately owned by the Dev team, even if day-to-day production authority rests with SRE under some forms of engagement. This means responsibility for having a reliable service is not off-loaded onto the SRE team; rather, the SRE team members are specialists in reliability engineering who can help the Dev team attain their reliability objectives by working in partnership under one of the engagement types (which in turn set out SRE’s responsibility to Dev).
An active, robust Dev engagement is part of a healthy service. Since SRE doesn’t control headcount allocation, it cannot be solely responsible for a service, and historical cases where Dev engagement ended while the service was still live and SRE supported (“abandoned services”) have ended poorly. Accordingly, Dev teams intending to sunset their staffing of a service need to also plan to sunset the service itself and migrate remaining users to other services. SRE’s engagement with a service will cease once it no longer has Dev support, and any assigned headcount will be returned by the time Dev engagement ends.
Starting and continuing with an SRE engagement is a joint decision for Dev and SRE. SRE cannot be forced to take an engagement, Dev cannot be forced to fund one, and either Dev or SRE can end one.
Corollary: If either Dev or SRE wishes to end an engagement, it should end, and the headcount position should be reexamined (either redeployed or returned) in a manner compliant with the funding principles discussed above. Ending an engagement by a means other than consensus is something that both parties should seek to avoid.
SRE and Dev bring different expertise: SRE focuses on reliability principles, system architecture, and best practices for production, while the Dev org is typically more experienced in their business domain. The success of a service is a shared endeavor. Despite being separate teams and having different roles, both sides work toward a common goal. This includes joint OKRs (objectives and key results) where appropriate and adhering to an error budget policy (a.k.a. freezing feature releases when a service/CUJ is out of SLO). Dev and SRE have a shared interest to operate a service within SLO in the most cost-efficient manner possible, so SLO violations are a critical issue for Dev and SRE to address together.
SLOs and error budgets promote a common understanding of reliability goals and an objective tool to measure success. This allows SRE and Dev to jointly make informed decisions about whether the balance between reliability and velocity needs adjustment. Freeze policies provide a simple way to adjust that balance toward reliability when customer/user trust is in danger of being broken.
Operational and on-call responsibilities are also a shared endeavor, and as a service becomes more mature, the bulk (but not 100%) of operational responsibilities are often carried by SRE.
SRE Is Not an “Ops Team”
SRE’s mission is not to handle operations but to improve the inherent reliability of systems through engineering. Being on call is a means to an end to SRE; it often provides valuable insights that wouldn’t be available otherwise. However, on-call work has no long-term value in and of itself. On-call coverage is not at the core of SRE work, and it alone does not justify the formation of an SRE team. SRE has strict limits on ops work; toilsome work (interrupts, production clean-up, etc.) should not exceed 50% of the SRE team’s time. If toil exceeds this threshold, Dev must handle excess ops work. This mechanism guarantees that SRE has enough time to work on projects to reduce the ops workload.
It is expected that Dev always carries at least some of the operational responsibilities. Typical examples include a secondary on-call rotation for escalations, ownership of non-production environments, and/or handling noncritical ops work. The exposure to ops is essential to maintain and foster production knowledge in the Dev team. The split of responsibilities should be tracked in writing to avoid misunderstandings.
Ops Is Not a Zero-Sum Game
Instead of simply moving operational responsibilities from one place to another, an SRE engagement should focus on reducing the overall ops workload. A successful engagement reduces the ops load to a point where which team holds the pager is no longer critical. Independent of who is officially holding the pager, Dev is generally expected to maintain a 24/7 on-call escalation path.
Teach to Fish
SRE should not serve as a human abstraction layer for production. This approach is not scalable, reinforces silos, undermines critical feedback loops, and turns production complexity into an existential need to justify SRE’s existence. Instead, SRE helps Dev gain a deeper understanding of the production aspects of the service.
Promote Production Standardization
SRE should promote the use of common production platforms and standardized infrastructure. Such platforms have several advantages:
- Provide a consistent service management infrastructure, which reduces the cost of implementing cross-service requirements (“horizontals”) in production.
- Reduce the ongoing cost of operating individual services in production (e.g., onboarding time, engineer training time, toil).
- Reduce the cost of supporting all services in production in aggregate; by making skills portable, it is simpler for engineers to work on disparate services.
- Reduce the cost of moving services between Dev and SRE as well as between different SRE teams.
- Improve the mobility of engineers between teams.
- Simplify and reduce risk in production.
- Improve engineering velocity.
- Improve resource efficiency of production services in aggregate.
SRE should promulgate standards for production platforms at an SRE PA level—a principle that’s applicable to services irrespective of the level of SRE support.
Quality of work must be a priority. SREs at Google have the same opportunities around mobility as Dev and therefore require a novel, challenging, interesting environment to allow personal development. SRE aligns closely with Dev on OKR planning but ultimately owns its own OKRs.
Success Must Be Tracked
SRE engagements are a significant investment and require structured planning and success tracking. SRE and Dev maintain a shared road map and track progress toward goals. They regularly review service health, criticality, business justification, and priority. This can be facilitated through business reviews, quarterly reports, and production health reviews.
SRE engagements are possible at any phase of the service life cycle—not only after the production launch. Often, they are most impactful and efficient when they happen early in the life cycle (are “shifted left”)—for example, during design and implementation. Fundamental architecture and infrastructure decisions can be changed easily during the design phase but are often extremely hard or prohibitively expensive to revise for a fully productionized system. An early engagement with SRE can prevent significant headaches later.