Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
June 28, 2021
When I was at Netflix, I was on a team with three other amazing software engineers. We’d spent years building a platform to safely inject failure in production to help engineers understand and ask more questions about areas in their system that unexpectedly behaved when presented with turbulent conditions we see in everyday engineering. Like injecting failure latency. It was amazing. And we were happy to be working on such an interesting problem that could ultimately help the business understand its weak spots.
But there was actually a problem with the way that we implemented the tooling and the way it was being used. And when I took a good look at it, most of the time I realized that the four of us were actually the ones using the tooling. We were using the tooling to create chaos experiments, to run chaos experiments, to analyze the results. Which meant, what were the teams doing?
Well, they were receiving our results and sometimes they were fixing them and sometimes they weren’t. I’m sure all of us have been a part of an incident where the action items don’t get completed. It was kind of a similar situation, and it wasn’t the team’s fault.
So why is this a problem? Well, it was a problem because we were the ones doing most of the experimentation and generating the results. But we weren’t the ones on the teams for the chaos experiments we were running. We weren’t on the search team, we weren’t on the bookmarks team, but we were running experiments for them.
We weren’t the ones whose mental models needed refining or understanding, but we were the ones getting that refinement and understanding. Which actually didn’t provide much benefit to the organization. We were leading this horse to water, but we were also pretending to be the horse. We were also drinking the water. Sometimes teams would use what we were creating, but that would actually last only for a couple of weeks. And then we’d have to remind them to use it again.
So we approached this problem like any good software engineer would approach it: we started trying to automate away the steps that people weren’t taking in order to get them access to the easier parts. In order to get them easier access to the harder parts of the tooling. But that part isn’t what this talk is about. It’s about one of the other things we did.
We wanted to give them more context on how important a particular vulnerability that we found with the chaos tooling was or wasn’t important to fix. So to know if something was or wasn’t important to fix, I started looking at previous incidents. I started digging through some of them to try to find patterns, to try to find patterns of systems that were underwater, or incidents that involved a ton of people, or incidents that costed a lot of money, so we could help prioritize the results we were finding with these chaos experiments.
I wanted to use this information to feed back into the chaos tooling to help improve the usage of the tooling. But I found something that was much greater. Incident analysis had a much greater power in the organization than just helping them create chaos experiments and prioritize the results better. And spending time on it opened my eyes up to so much more, things that could help the business far beyond the technical.
And so here’s the secret I found. Incident analysis is not actually about the incident, it’s this opportunity we have to see the delta between how we think our organization works and how it actually works. Yeah, most of the time we’re not good at exposing that delta.
[bctt tweet=”Incident as a catalyst is showing you what your organization is good at and what actually needs improvement.” username=”@ITRevBooks”]
It’s a catalyst to understanding how your org is structured in theory versus how it’s structured in practice.
It’s a catalyst to understanding where you actually need to improve the socio of your socio-technical system, how you’re organizing teams, how people in different time zones are working together, how many people you need on each team, how folks are dealing with their OKRs given all the technical depth that they’re working through as well.
Incident as a catalyst is showing you what your organization is good at and what actually needs improvement.
Next: Story #2: The Blame Game…
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.
This post is based on her 2021 presentation DevOps Enterprise Summit-Virtual Europe, which you can watch for free in the IT Revolution Video Library.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
You've been there before: standing in front of your team, announcing a major technological…
If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…
Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…
We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…