The following is an excerpt from a presentation by Stephen Thorne, Senior Site Reliability Engineer at Google, titled “Getting Started with SRE.”
You can watch the video of the presentation, which was originally delivered at the 2018 DevOps Enterprise Summit in London.
Who am I?
My name is Stephen Thorne and this is Getting Started with Site Reliability Engineering. I am a site reliability engineer at Google
I worked as an IC, both in the Ad side and the Cloud side of the business. I ran App engine for a number of years. My new team, Customer Reliability Engineering, is working to approach our Cloud customers and help them understand how their products and applications are running on our platform. How our products impact them and help them move fast and do things better on our platform.
Now I’d like to give you a little bit of an introduction.
Google has found that site reliability engineering has been a very successful model for operating applications in production without being overwhelmed. Our VP, Ben Treynor has quipped:
“SRE is what happens when you ask a software engineer to define an operations function.”
So what I’m going to try to do today is take you through some of the philosophy of how SRE works. We believe that it’s a versatile model, which allows you to operate mission-critical systems, no matter if they’re small, medium, large, huge, or very, very fast growing.
What I’m not going to do is try to convince you that SRE is the right thing for your organization.
What I’m actually going to try and do is show you how it works, and the philosophy and principles behind it. Whether or not you think that’s the right thing for you, is up to you to decide.
In 2016, we have published one book on the subject of SRE. This book covers a lot about how Google invented SRE, where it comes from, and how Google runs systems in production. What we’ve learned since publishing that is the way Google runs things isn’t very relevant to a lot of people, but we want to talk to you about how you might implement SRE and how you would be able to implement some of the same things we’ve done.
That’s also why we published a new book The Site Reliability Workbook, which came out in July 2018.
This is my agenda for today.
We’re going to talk about Service Level Objectives, error budget policy, making tomorrow better than today. In essence, I’m going to share with you a very quick tour of the principles behind site reliability engineering, which are:
- SRE needs Services Level Objections, with consequences.
- SREs, have time to make tomorrow better than today.
- SRE teams have the ability to regulate their workload.
You can see very visibly that this is a progression. You can do site reliability engineering without actually having a site reliability engineer. And you can have site reliability engineers before you actually have a site reliability engineering team. Each of these builds upon the previous. You can’t run an effective site reliability engineering org unless you’re monitoring and reporting on your SLOs and actually worrying about the reliability of your system. It just doesn’t make any sense.
To begin, what are Service Level Objectives?
I believe they’re fundamental to the SRE practice.
Service Level Objectives define a goal for how reliable it needs to be to meet the needs of its users and the business. We measure how well the application is performing and aim to keep it above that level of reliability. There’s an ideal to defining an SLO, which I apply, that says if the customers are happy, then the SLOs are being met. This, of course, ends up with situations where you’re not meeting your SLOs and your customers are perfectly happy. That means you probably actually want to revise your SLOs.
These are typical, simple SLOs.
- Your site is up enough.
- Your HTP server responds with success often enough, fast enough.
- Your log processor processes enough log entries, fast enough.
Defining a good SLO means picking goals for measures that track your customer’s actual experience. Ideally, you want to set goals that your customers really care about. To give you a counterexample, you don’t want to set an SLO on, the CPU utilization on your backend, or your network throughput, because your customers don’t see those as errors. Your customers see HTP errors. It might be that you have all these causative things, and you want to monitor them, of course, but your SLO is how you think about your reliability of the system and how you account for it.
SLOs can get very complex, especially when you start to involve your user’s actual expectations. So these are just simple examples.
Every time I talk about SLOs, the immediate conversation jumps to the SLA. In my view and definition, SLAs are your legal agreements between organizations and with penalties for not meeting them, typically monetary.
My definition of an SLO, or at least a good SLO, is that when your applications stops meeting it, your customers are starting to get unhappy. This is well before the point of where an SLA should be, which is when they’re so unhappy they deserve their money back. You want to be well on this side of that. I also believe SLOs are best used within your company, not between organizations. You can use them between organizations but you have to have a very deep understanding that you’re not held accountable to them in terms of refunds and monetary response, etc.
ATTEND THE DEVOPS ENTERPRISE SUMMIT
Much like an SLA, an SLO has to have consequences. Otherwise, it’s just a metric and it can be met, not met, that’s fine. They’re not going to have any useful impact. So I strongly advocate using SLOs and your primary way that you’re on-call engineers get alerted for emergencies.
The thing that I want to talk to you about today, which is actually much more fundamental to how site reliability engineering works is…
Error Budget Policy.
If you ask “How reliable do you want to be”, the answer is always, “More. The most reliable. 100%”. But we know that the only truly reliable system is one that does nothing at all and can never change. Therefore, you need to have some budget for failure. You have to know what an acceptable level of failure is and you have to balance that against the needs of your business, the needs to do releases, have development velocity, to reduce costs, etc.
What we do is, much like you have a budget for you how much you’re going to spend on your Cloud services or your development, you have a budget for failure.
Now, I define an error budget as the difference between 100% reliability, the most reliable system you could ever have, and your SLO. For my example, 99.9%, if that’s your SLO for up timing, putting that into much more real terms, 0.1% of a month is 43 minutes. But, if you have a 20-minute downtime, that’s a terrible thing. The VP is on the phone. Everybody’s angry. Then when you get it fixed, got it back up, you still have 23 minutes of error budget left for the rest of the month. If you were counting monthly.
That’s essentially what an error budget is, and you can have error budgets on a percentage of request succeeded, failed, too slow, etc. But you end up having this gap between 100% reliability and your SLO and that’s budget to spend.
The idea is that you’re allowed to have errors. You account for them. You say “Okay, we agreed that this is the level at which our customers are going to understand how bad the system is and get angry at us, but before that, they’re probably just going to hit Reload on their web browser because they believe it’s a home network and not your website. Legitimately, that’s something you should think about. For instance, who wants to run 49s reliable mobile back end when mobile operators are having trouble reaching three?
Now, what do you do when you spend your whole budget?
It’s a budget. You’ve run out of it. You’ve spent it. What do you do? It means you didn’t meet your reliability goals and something needs to change. Now, this isn’t an SLA, and the response isn’t to say “Sorry”. It’s not to give anybody money. You have to do something to make your system more reliable. The short-term goal is to stop the problem from getting worse and long-term to make sure it doesn’t happen again.
I’ve worked with a product that has had excellent operations teams, great incident managers, heroes all around, but that wasn’t enough. Something had to change because our users were simply not happy with the reliability. What we had to do was institute our error budget policy.
Now, it’s always better to agree to these policies beforehand, but sometimes you don’t get that. So here are some example policies. The one that suits your business is probably very specific to what you need. As a policy, it has to be supported from the highest level of your organization. It’s not good having a policy if nobody follows it.
In my team’s case, we found the biggest cause of instability was pushing new versions of the code. We had a situation when it was a monolithic release, dozens of development teams all went into the same release and there was at least one problem and we had to roll back. It was a big issue. We’re burning through our budget all the time. So, we reduce the number of releases until we could break down the release into smaller releases and then move forward.
This is our first principle, the “SRE needs Service Level Objectives with Consequences.” And if you have good SLOs, and you have an error budget policy, and you follow that error budget policy, I would legitimately say that’s SRE. You can do that today in your organization without having a single site reliability engineer, but you are doing site reliability engineering.
But, I’ve got to point out, this goes the other way. I’ll tell you another story. I was engaging with a cloud customer that was running an ad system and it had a long processing pipeline. This long processing pipeline was the major cause of all of their concern. They had on-call engineers dealing with it whenever it had problems. Mostly it had throughput issues, latency would spike, it would go from 10 minutes to 20 minutes to 30 minutes, and they were getting woken up in the middle of the night to deal with this.
So I did a workshop with these folks and we worked out what an appropriate SLO was for this pipeline. It was a pipeline that regenerated the recommendation model, so the next time somebody came to the website, they would have much better ads for them. But you don’t come to the website again in 30 minutes. You typically come the next day or the day after. So we thought “What’s the closest time that somebody would experience pain?” In this case, we’re defining that as lower quality ads. We picked six hours. Then suddenly because we weren’t exceeding the error budget ever, because latency spike was spiking to the whole 30 minutes.
ATTEND THE DEVOPS ENTERPRISE SUMMIT
For the SLO of six hours, we have so much error budget. We can reduce the number of resources we spend on the pipeline and let it go into high latency more often because it doesn’t matter about throughput. We know that it’s acting reliably. We’ll respond if it ever goes over that threshold, but it doesn’t matter. They haven’t been paged about it since. So it does go both ways. It was can be within outside of error budget.
On to our next principle.
“Making tomorrow better than today.”
SLOs and error budgets are fine, but the next step is defining what your first site reliability engineers will actually go and do. So I believe SREs must have real responsibility. Meaning they must be both engaged with operating the system, but also empowered to do something about it when it goes wrong.
The first thing that your SREs should work on is defining and refining your Service Level Objectives. Even if you have them, they can probably be improved. That’s my experience with most error budget excursions — first, you check whether or not the service level objective was actually legitimately being a problem and if want to improve it.
They’re the best person to actually enact your error budget policy, and they need to be accountable and responsible to the application and the fact that it’s meeting its reliability expectations.
A major part of what SRE does is toil. Because they’re operating a system in production. Everybody knows you’ve always gotta do more things with systems running in production. You might have to get more servers, you might have to go into a new Cloud zone, you’ve got your weekly releases, not everybody has CICD releases. I know everybody wants that, but legitimately, somebody has to set that up.
One system I worked on was reporting its SLOs quarterly, and every quarter it was consistently meeting every single one of its SLOs. That meant that the team was freed from the reactionary operational work and in the spirit of ‘if you do a good job, your reward is more work,’ then they kept on being given more and more work to do. More systems to maintain, more responsibility, etc. They spent a lot of time burning down this toil because they had so much of it, but they had it capped. And the reason we cap it is that, if you’re just doing toil, you can’t improve anything. We say that at least 50% of your time should be spent on project work and at most 50% of that time be spent on toil. Of course, this is very qualitative and essentially if you ask somebody “Are you spending too much time on toil,” they’ll tell you.
SRE does project work and the focus is on making things better for themselves. So, first, they might do whatever is required to actually meet the SLO. That’s their first responsibility. Any project work required to do that. It might be, “Oh, we’re having too much instability in one Cloud region. Let’s run in two Cloud regions.”. It might be “Our release has caused too much downtime. Let’s actually do progressive releases.” Do the project work required to address the SLO and then move on to, they might improve the monitoring. They might work on automation. They might hold folks accountable to the post-mortem action items. Whatever is required in order to make running your systems less toilsome, more productive, more reliable. And that’s our second principle. It’s very simple.
“SRE’s must have time to make tomorrow better than today,” because if you’re not capping that toil and allowing them to actually go and implement that monitoring work, then all they’re doing is getting overloaded with toil and then they won’t be able to do any project work. The next time they need to do some things to improve the reliability of the system, they’re too overloaded. I think any org with one or a thousand SREs must be able to apply this principle. There must be this ability for the SREs to address the toil and do the project work.
On to our third principle.
Now, this is about SRE teams. I’ve got to emphasize here that this is just how I see Google doing things. I don’t want to be prescriptive and say “First of all what you have to do is restructure the way I do things.” Think about this with relationship to your own company and how you might implement it. This is how I see it possible for a team responsible for running mission-critical software in production and keeping it reliable.
I don’t think you can just “create” a site reliability engineering team, by taking an existing team and changing the name on the door. Getting started with SRE will work better if you…
- Start with SLOs
- Build up some principle approaches to engineering to those SLOs
- Prioritize giving the most mission-critical systems to your SREs to support in productions
Note that I say mission critical and giving. That means that there’s plenty of room for other teams to run things. SREs don’t implicit have the only keys to production and they’re there to force multiply not to gatekeep. Your SREs are there to take responsibility for the reliable running of your systems. They’re not there to take over running all reliable systems.
By sharing the responsibility of running your application in production with some products being partially or entirely owned by your Dev or your DevOps org, prior to SRE taking them on, you’re probably going to have a whole lot better synergy between your teams. We see SRE as a team that needs to be able to do a good job of dealing with the most important pieces as that team grows and take more responsibility for more production systems as they mature.
In order to be able to continue to do a good job, to make tomorrow better than today and engineer the SLO, (those two principles,) there have to be ways that this team can push back when their workload becomes too much to handle. We apply that pressure in many different ways. For example, giving 5% of the operational work is an excellent way of making your developers understand what it means to do that operational work. It stops things from getting put on to the backlog if they have to participate. You, of course, can do the usual project management thing, track completion.
SRE teams are often very good at analyzing new production systems, because they’ve been running all the old production systems for so long, and so you see all the same patterns coming up again and we all know that fixing things up front before they’re deployed is much, much cheaper than fixing them after they’re deployed.
I believe that this is required for every aspect of SRE.
- It’s required for the error budget policy, If your leadership doesn’t back up your error budget policy, you probably have to go all the way back to principle one and probably again and again.
- It’s required for a cap on toil.
- It’s especially important for shared responsibility because this is where your SRE team will interface with other parts of the organization and you need management coverage.
When your application misses your SLOs, it puts a large amount of load on the SRE teams who, by necessity, have to spend much more time fighting fires and accrue more tasks to improve the system, so your leadership is additionally best placed to help here. Either by funding more focus on reliability, funding software development where it’s required. Not necessarily by the SREs, but also with Dev buy-in, or, you can always loosen your SLO.
It might be that when you actually go to your CTO and your CFO, and you say “We need to spend this much extra money to meet this SLO, this target that we have” and they say “Oh well, that’s a lot of money. Why don’t we just do a worse job and eat that unreliability?” That’s actually perfectly reasonable. They’re the best-placed people to make that decision.
I have seen a team literally fall apart and be disbanded because the application was consistently out of SLO. And the development team refused to commit to actually addressing the problem. This is a direct quote:
“If the pager is going off so often, can’t you just hire more people to answer the pager?”
That doesn’t actually address the fact that the system is chronically unreliable and there’s nothing that we can do about it. So there is such thing as leadership helping in the wrong way. The correct thing to do there was to freeze the pushes, fund the effort on reliability, burn down the backlog, actually do the things the SREs were telling Dev to do, which they were ignoring, and improve the health of the application. I don’t have to tell you that fixing issues after releases are always more expensive.
This is another great thing that SRE can bring. If your SRE team is responsible for running many, many systems, the more alike those system are, the more systems they can run with fewer toil. SRE is often well placed to drive consistency across an organization.
I was talking to a computer gaming company who said that the one big thing that they enjoyed about SRE was the fact that they now had a team that could say “We’re running all these systems in production, but they’re all different. Every time you iterate, you have to make them more consistent so they could drive down this toil and drive up the productivity of the SRE teams.” It’s quite a little cost if you do it up front.
I’m sure you can appreciate how much good automation can benefit the running of a system in production. Automation doesn’t have to be written by SRE, I should emphasize that. SRE might do a lot of automation work, but it could be done by development teams, by DevOps teams, or whoever you have. Your company should specifically address automation where it helps drive down toil and prioritize it. This is how you’ll be able to get your SRE teams to be responsible for more services and yet not suffer from operational overload. This is how we scale up our team non-linearly with the number of services they support.
And that’s our third principle, “SRE teams have the ability to regulate their workload.” Once you have a team up and running, this is a principle you can apply to allow for growth and better harmony with the rest of your organization, scale up your operations, and develop even larger systems.
In summary, these are our principles, which I’ll restate for you:
- SRE needs Services Level Objections, with consequences, and you can do this in any organization
- SREs, have time to make tomorrow better than today, and even your very first SRE needs to be able to do this. They need to be able to both run your systems and make them better.
- SRE teams have the ability to regulate their workload so that your teams can flourish and grow.
ATTEND THE DEVOPS ENTERPRISE SUMMIT