Netflix: Story #1
When I was at Netflix, I was on a team with three other amazing software engineers. We’d spent years building a platform to safely inject failure in production to help engineers understand and ask more questions about areas in their system that unexpectedly behaved when presented with turbulent conditions we see in everyday engineering. Like injecting failure latency. It was amazing. And we were happy to be working on such an interesting problem that could ultimately help the business understand its weak spots.
But there was actually a problem with the way that we implemented the tooling and the way it was being used. And when I took a good look at it, most of the time I realized that the four of us were actually the ones using the tooling. We were using the tooling to create chaos experiments, to run chaos experiments, to analyze the results. Which meant, what were the teams doing?
Well, they were receiving our results and sometimes they were fixing them and sometimes they weren’t. I’m sure all of us have been a part of an incident where the action items don’t get completed. It was kind of a similar situation, and it wasn’t the team’s fault.
So why is this a problem? Well, it was a problem because we were the ones doing most of the experimentation and generating the results. But we weren’t the ones on the teams for the chaos experiments we were running. We weren’t on the search team, we weren’t on the bookmarks team, but we were running experiments for them.
We weren’t the ones whose mental models needed refining or understanding, but we were the ones getting that refinement and understanding. Which actually didn’t provide much benefit to the organization. We were leading this horse to water, but we were also pretending to be the horse. We were also drinking the water. Sometimes teams would use what we were creating, but that would actually last only for a couple of weeks. And then we’d have to remind them to use it again.
Learning: Incident Analysis Isn’t About the Incident
So we approached this problem like any good software engineer would approach it: we started trying to automate away the steps that people weren’t taking in order to get them access to the easier parts. In order to get them easier access to the harder parts of the tooling. But that part isn’t what this talk is about. It’s about one of the other things we did.
We wanted to give them more context on how important a particular vulnerability that we found with the chaos tooling was or wasn’t important to fix. So to know if something was or wasn’t important to fix, I started looking at previous incidents. I started digging through some of them to try to find patterns, to try to find patterns of systems that were underwater, or incidents that involved a ton of people, or incidents that costed a lot of money, so we could help prioritize the results we were finding with these chaos experiments.
I wanted to use this information to feed back into the chaos tooling to help improve the usage of the tooling. But I found something that was much greater. Incident analysis had a much greater power in the organization than just helping them create chaos experiments and prioritize the results better. And spending time on it opened my eyes up to so much more, things that could help the business far beyond the technical.
And so here’s the secret I found. Incident analysis is not actually about the incident, it’s this opportunity we have to see the delta between how we think our organization works and how it actually works. Yeah, most of the time we’re not good at exposing that delta.Incident as a catalyst is showing you what your organization is good at and what actually needs improvement. Click To Tweet
It’s a catalyst to understanding how your org is structured in theory versus how it’s structured in practice.
It’s a catalyst to understanding where you actually need to improve the socio of your socio-technical system, how you’re organizing teams, how people in different time zones are working together, how many people you need on each team, how folks are dealing with their OKRs given all the technical depth that they’re working through as well.
Incident as a catalyst is showing you what your organization is good at and what actually needs improvement.
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.