In this post, I will summarize the two hours of interviews I did with Randy Shoup to fill in the gaps in my knowledge of the causal model of how organizations doing DevOps sustain their amazing performance.
Randy Shoup has helped lead engineering teams at eBay and Google, and is one of the best people I’ve seen who can articulate the leadership traits needed to replicate DevOps outcomes of fast flow and world-class reliability. My two favorite presentations of his include his 2013 Flowcon presentation (where I met him) and his amazing work on transforming the eBay architecture in the early 2000s.
To summarize, the four capabilities in Dr. Spear’s model are the following:
- Capability 1: Seeing problems as they occur
- Capability 2: Swarming and solving problems as they are seen to build new knowledge
- Capability 3: Spreading new knowledge throughout the organization
- Capability 4: Leading by developing
This was the basis of my interview with Randy Shoup, to uncover some of the practices at Google and eBay that are not discussed as widely.
(It’s difficult to overstate how much I learned from Randy Shoup. If you’re interested in learning more of this and putting this in place in your organization, Randy is currently doing consulting. You can get his contact information on his LinkedIn profile.)
Capability 1: Seeing problems as they occur
Dr. Spear writes:
High velocity organizations specify and design to capture existing knowledge and building in tests to reveal problems.
Whether the work is to be done by an individual or a group, with or without equipment, high-velocity organizations are uncomfortable with ambiguity. They specify in advance what (a) outcomes are expected; (b) who is responsible for what work in what order; (c) how products, services, and information will flow from the person performing one step to the person performing the next step; and (d) what methods will be used to accomplish each piece of work.
GK: In the DevOps space, I contend that one of the paragons is Google, especially in the area of automated testing.
Eran Messeri from the Google SCM team spoke at GOTOcon Aarhus in 2013 in a session called "What goes wrong when thousands of engineers share the same continuous build?". (My notes from his session are here).
Here are some the remarkable statistics (as of 2013) that he presented, showing how they create the fastest, earliest and cheapest feedback to developers as possible:
- 15,000 engineers (both Dev and Ops)
- 4,000 simultaneous projects
- all source code checked into a single repository (billions of files!)
- 5,500 code commits: 15K engineers
- 75MM automated tests run daily
- 0.5% of all engineers is dedicated to Dev Tooling
(More amazing statistics on how Dev works at Google can be found in this 2010 QConSF slide deck by Ashish Kumar.)
Q: Google is likely the exemplar of automated testing — tell me more about your experiences there.
It’s true. Google does a ton of automated testing — it had an order of magnitude better test discipline than any other place I’ve worked. You test “everything” — not just getter/setter functions; you test all your assumptions on anything that might break.
As humans, designing the tests is often challenging. You don’t want to spend time writing tests that you know are going to always work. Instead we want to test the hard things where things could go wrong.
In practice, that meant that we were testing for resilience. Our desire was to test our component in isolation, mocking out everything else. That allowed us to be able to test our components in a semi-real world, but more importantly, we could inject failures into the mocks.
This enabled us to continually test scenarios where components we depended upon went down. So those scenarios where those components go down, which is maybe once in a million (or more likely, billion) times, we're actually testing every day. (For example, where the two replicas of the service go down, when something fails in between the prepare and commit phase, or when an entire service goes down in the middle of the night.)
All of these practices enabled us to build in resilience testing into our daily work, and enabled those tests to be run all the time.
It was huge.
Q: Where did the automated testing principles at Google come from?
You know, I have no idea how it evolved at Google, but it was all there when I arrived. It was astonishing to see. Every component in this massive distributed system that is Google were constantly being tested in all these sophisticated ways.
As the new person, I didn’t want to be the guy who wrote something crappy that wasn't adequately tested. And as a director, I particularly wanted to set a bad example for my team.
Here's a concrete example to show how good some of these groups were. The common Google infrastructure services which people have probably read about in all the famous papers (Google File System, BigTable, Megastore, etc.) are each run by its own team — usually a surprisingly small team.
They not only write the code, but they also operate it. And as those components matured, they not only provided the service to anyone who wanted to use it, but they also provided client libraries that make it easier to use the service. Now with the client libraries, they can mock out the backend service for client testing, as well as inject various failure scenarios. (e.g., you could get the BigTable production libraries, as well as a mock, which actually would behave as it would in production. You want to inject a failure between the write and ack phase? It was in there!)
I suspect where these principles and practices came from was the "school of hard knocks." I'm guessing it is the emergent behavior from when you keep asking, "how could we have avoided this outage?”
Done with discipline, over time, you end up with an architecture that is extremely resilience.
Capability 2: Swarming and solving problems as they are seen to build new knowledge
Dr. Spear writes:
“High-velocity organizations are adept at detecting problems in their systems at the time and place of their occurrence. They are equally adept at (1) containing those problems before they have a chance to spread and (2) diagnosing and treating their causes so the problems cannot reoccur. In doing so, they build ever-deeper knowledge about how to manage the systems for doing their work, converting inevitable up-front ignorance into knowledge.
GK: In my research, two of the most startling examples of swarming are:
- The Toyota Andon cord (where work is stopped when work deviates from known patterns). It has been documented that in a typical Toyota plant, on an average day, the Andon cord is pulled 3,500 times per day.
- At Alcoa, the CEO, the Honorable Paul O’Neill, in order to drive down workplace accidents, instituted a policy that he must be notified of any workplace-related accident within 24 hours of the incident. Who needed to do the reporting? The business unit president.
Q: Was the culture at Google remotely similar to those that would support swarming behaviors, such as the Toyota Andon cord and the Alcoa CEO requirement of notification upon workplace accidents?
Absolutely. Both practices definitely resonate with me. Both at Ebay and Google, there was the cultural practice of the blame-free post-mortem.
(GK: Or as John Allspaw calls it, the blameless post-mortem.)
The blame-free post-mortem is such an important discipline — any time there is a customer-impacting outage, we held a post-mortem. As widely written by John Allspaw and others, the goal of the post-mortem is not to punish anyone, but instead, to create learning opportunities and communicate them broadly across the organization.
I’ve found that if the organizational culture makes it safe to do post-mortems, an amazing dynamic is created: engineers start competing with each other to publicize even bigger screwups . Like, “hey, we discovered that we never tested our backup recovery procedures” or “and then we found out that we didn’t have active/active replication working”. Or this is probably familiar to lots of engineers: “I wish we hadn’t had that outage, but now finally we have the opportunity to fix that broken system I’ve been complaining about for months!”
This creates massive organizational learning, and matches what Dr. Steven Spear describe: doing this enables us to constantly find things that go wrong, and then fix them, long before something catastrophic happens.
The reason I think this works is that we’re all engineers at heart, and we love building and improving systems. This environment of exposing problems creates a very exciting and satisfying work environment.
Q: What were the artifacts created by the post-mortems? It can’t just be a document that gets written, and then gets tossed into the landfill, right?
You may find it difficult to believe, but I think most important part is merely holding the post-mortem meetings. We know that the most important part of DevOps is culture, and even being able to hold the meeting, even if no outputs are generated, improves the system.
It becomes a kata — a part of our daily discipline that demonstrates our values and how we prioritize our work..
Of course, after the post-mortem meeting, you'll almost always end up with a list of things that went right and what went wrong. Whatever you call it, you have a list of action items that need to be put into some work queue (e.g., backlog, list of desired features or enhancements, improvements to documentation, etc.)
When you discover new improvements that need to be made, you eventually need to make changes somewhere. Sometimes it's the documentation, the processes, the code, the environment, or whatever.
But even without that, even writing those post-mortem documents have tremendous value — you can imagine at Google, everything is searchable. All the post-mortem documents are in places where other Googlers can see them.
And often those post-mortem documents are the first documents that are being read when a similar incident happens in the future.
Interestingly, post-mortem documents also serve another purpose. Google has had a long-standing tradition of requiring that all new services be self-managed by developers for at least six months. As service teams request to be “graduated” (i.e., get a dedicated team of SREs, or operations engineers), they’re essentially beginning a negotiation with the SREs. They're requesting them to consider taking over the service delivery responsibilities of their application.
(Gene: Here’s a video of Tom Limoncelli describing the “hands off readiness review” process, where the SREs will review the documentation, deployment mechanisms, monitoring profiles, etc. It’s a fantastic talk.)
The SREs often would first examine the post-mortem documentation, which they put considerable weight upon when deciding whether an application was ready to “graduate.”
Q: Do you see analogues between what Paul O’Neill and team did at Alcoa and Google? Are there examples of how the threshold for notifications/escalations being constantly reduced?
GK: Dr. Spear wrote about how Paul O’Neill and team at Alcoa famously decreased workplace-related injuries in their aluminum manufacturing plants (involving incredibly high heat, high pressure, corrosive chemicals) from 2% of the workforce per year to 0.07%, making them the safest in the industry.
Amazingly when the workplace accidents dropped below a certain point, O’Neill needed to be notified whenever there was a near-miss.)
Absolutely. Of course, our equivalent for a workplace accident would be a customer-impacting outage. And trust me, people were notified when we had a big outage that affected our customers.
When incidents occurred, there were two things that would happen. 1) you'd mobilize whoever was required in order to restore service, and they'd work until it was (of course, that's very standard).
2) We would also had a weekly standing incident meeting for management (in my App Engine team, it was the engineering directors, of which I was one of two, our boss, our team leads, as well as the support team and the product owner). We would review what we learned out of the post-mortems, review any follow-up actions required, and make sure we resolved the issue properly, and if necessary, decide whether we needed to do a customer-facing post-mortem or blog post.
Some weeks, we'd have nothing to talk about. But when teams that had this under control, there was always a desire to treat lesser incidents with the same level of scrutiny and desire for improvement.
For instances where we had a problem that wasn't "customer impacting," we’d call the incidents “team impacting."
Most of us have experienced these… the “near misses,” where there were six safeguards in place, all designed to prevent the customer from being affected by a failure, and all but one failed.
In my team (Google App Engine), in a given year, we might have one customer outage per year that hit the public radar screen. But you can safely assume that for every one of those, we had several near misses.
This is why we conducted our Disaster Recovery exercises, which Kripa Krishnan talked about here.
While Google did a good job, and we learned a lot (there’s a reason that we have three production replicas), the people who did this better than anyone was Amazon. They were doing things five years ahead of everyone else. (Jason McHugh was an architect for Amazon S3, who is now at Facebook, gave this QCon 2009 talk at about failure management at Amazon.)
Q: At Alcoa, workplace accidents had to reported to the CEO within 24 hours. What were the escalation timelines like at Google?
At Google App Engine, we had small enough team (around 100 engineers globally) that we only had two layers: the engineers who were actually doing work and management.
We’d wake people up in the middle of the night for customer affecting incidents. For everyone incident like that, one out of ten would escalate to the directors.
Q: How would you describe how swarming takes place?
Like in Toyota plants, not every problem warrants everyone dropping what they’re doing when something goes wrong. But it is true that culturally, we prioritized reliability and quality as Priority 0 issues.
That shows up in many ways, some not entirely obvious. It’s a bit more subtle than a work-stoppage.
When you check in code that breaks a test, you don’t work on anything else until you’ve fixed it. And you’d never go work on something else that would cause even more tests to break.
Similarly, if someone broke a test and they needed help, you’d be expected to drop whatever you’re doing to help. Why? Because that’s how we prioritize our work — it’s like the “Golden Rule.” We want to help everyone be able to move their work forward, which helps everyone.
And of course, they’ll do the same for you, when you need help.
From a systems perspective, I view it like a ratchet or the middle rack rail on a mountain train — these practices keep us from slipping backwards.
It wasn’t formalized in a process, but everyone knew if there was a significant deviation from normal operations, like a customer affecting incident, we’d sound the alarm, send out something on the mailing list, etc.
Basically, the message would be, “Hey, everyone, I need your help,” and we’d all go help.
I think the reason it worked, even without a lot of formality or policy was that everyone knew that our job was not just “write code,” but it was to “run a service.”
It got the point where, even for global dependencies (e.g., load balancers, misconfiguring global infrastructure), it would be fixed within seconds, and the incident would be resolved in 5-10 minutes.
Capability 3: Spreading new knowledge throughout the organization
Dr. Spear writes:
“High-velocity organizations multiply the power of their new knowledge by making it available, not only to those who discovered it, but also throughout the organization. They do this by sharing not only the solutions that are discovered, but the processes by which they were discovered—what was learned and how it was learned. While their competitors allow problems to persist and propagate into the larger system because the solutions, if they are found at all, remain contained where they were found, the high-velocity leaders contain their problems and propagate their discoveries. This means that when people begin to do their work, they do so with the cumulative experience of everyone in the organization who has ever done the same work. We’ll see several examples of that multiplier effect.
Q: When something goes wrong, how is knowledge spread? How do local discoveries turn into global improvement?
Part of it, although by no means the largest, is the documentation coming from the post-mortems. One indication is that Googlers are just as much into “disaster pron” as anyone else. When there was a high-profile outage at Google, you can be sure that almost everyone at the company would be reading the post-mortem report.
Probably the most powerful mechanism for preventing future failures is the single code repository for all Google properties.
(GK: mentioned earlier in this blog post. Billions of files!)
But more than that, because the entire code base is searchable, it’s very easy for knowledge to be reused. Regardless of how formal and consistent the documentation is, even better is seeing what people are doing in practice — “just go look in the code.”
However, there is a downside. Often, the first person who uses a service will likely use some random configuration setting, which then propagates wildly. Suddenly, for no good reason, arbitrary settings like “37” are found everywhere.
(GK: this is hilarious…)
Whenever you make knowledge easy to propagate and discoverable, it will spread, and hopefully converge to some optimum setting.
Q: Besides single source code repositories and blameless post-mortems, what other mechanisms were there that converted local learning into global improvements? How else was knowledge spread?
One of the best things at Google source code repo was that you could search everything. The best way answer to almost every question was “look at the code.”
We also had great documentation that you could access just by searching for it.
We also had fantastic internal groups. Like any external service, say “foo”, you’d have an internal mailing list called “foo-users.” You just ask a question on the list. Having access to the developers was great, but even better was that in most cases, fellow users would often come back with the answers. This is just like successful open source projects in the rest of the industry, by the way.
Capability 4: Leading by developing
Dr. Spear writes:
“Managers in high-velocity organizations make sure that a regular part of work is both the delivery of products and services and also the continual improvement of the processes by which those products and services are delivered. They teach people how to make continual improvement part of their jobs and provide them with enough time and resources to do so. Thus, the organization’s ability to be both reliable and highly adaptive becomes self-reinforcing. This is a fundamental difference from their also-ran competitors. High-velocity managers are not in place to command, control, berate, intimidate, or evaluate through a contrived set of metrics, but to ensure that their organizations become ever more self-diagnosing and self-improving, skilled at detecting problems, solving them, and multiplying the effect by making the solutions available throughout the organization.
GK: I also love this quote from David Marquet (author of “Turn This Ship Around”) who said, “The mark of a true leader is the number of leaders left in his/her wake?” This is from the person who captained a submarine crew that created more leaders than any other submarine on record.
The gist of his work is that some leaders fix problems, but then when they leave, the problems come right back, because they didn’t leave behind a system of work that could keep improving without them.
Q: How was leadership developed at Google?
Google had practices that you’d find in almost every healthy organization. We had two career tracks: an engineer track and a management track. Anyone who has “manager” in their title had as their primary role to make it possible and encourage others to lead.
I viewed my role as creating small teams, where everyone had an important role. Each team was a symphony, as opposed to a factory — everybody was capable of being a soloist, but more importantly, they all were capable of working with each other. (We’ve all been in those awful situations where our teams would scream at each other, or not listen to each other.)
At Google, I think the most powerful influence on leaders was the cultural expectation that we do serious engineering. One of the big cultural norms was, “everyone writes awesome tests; we don’t want to be the group that had crappy tests.” In the same way, there is a culture of “we only hire A players” — that was emotionally important to me.
At Google, some of this was codified in their evaluation and promotion process — that sounds lousy, because it implies we only do a good job because that’s what we need to get promoted. But, on the other hand, the evaluation process was extremely good and almost universally accepted as objective — people got promoted because they deserved it and were really good at what they did. I never heard of anyone getting promoted because they “sucked up to the right person.”
For manager and director positions, the primary criteria was leadership — in other words, did that person make an outsized impact, ideally beyond the team you worked in, and beyond someone “just doing their job.”
The Google App Engine service was created seven years ago by a bunch of amazing engineers in the cluster management group — they thought, “Hey, we have all these techniques on how to create scalable systems. Can we encode this in a way that other people can use?”
The title “founding engineer for App Engine” is given as much respect internally as, say, the founder of Facebook.
Q: How are first-time managers onboarded? If leaders must create other leaders, In what ways are the risks of first-time or front-line managers learning on the job mitigated?
At Google, you’re only given a job by “retrospective recognition that you’re already doing that job” — this is the opposite of most organizations, where they do “prospective hoping that you can do the job.”
In other words, if you want to be a principal engineer, then do the job of a principal engineer.
And at Google, like many large organizations, there are lots of training resources.
But in most cases, the cultural norm of how things are done are so strong, that’s probably the primary force that ensure that cultural norms are continued. It’s that and likely a self-selection process that reinforces the cultural norms and technical practices.
And of course, it also comes from the top. At Google, two quirky engineers founded the company, and the culture is continually reinforced at the top.
If you’re at a command and control company, where leaders hate people, well that’ll get reinforced, too.
Holy cow. Again, it’s difficult to overstate how much I learned from Randy Shoup. If you’re interested in learning more of this and putting this in place in your organization, Randy is currently doing consulting. You can get his contact information on his LinkedIn profile.