Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
June 28, 2021
Now my last story is one that we might all be familiar with a little bit as a software industry. I was in an organization where promotion packets were due. Now promotion packets in this organization consisted of an engineering manager putting together a little packet for someone on their team that they thought deserved to be promoted.
As this organization was growing larger and larger, it became harder to read all the packets, so they became very number driven. Did this person complete the things that they said they were going to complete at the beginning of the quarter?
And so people were losing promotions when they hadn’t completed things at the beginning of the quarter. But I know we’ve all been at organizations where we’ve committed to something at the beginning of the quarter, but we get midway through the quarter and realize that that’s not the most important thing anymore. Yet this is what we were judging people on.
So what do you all think happened? Well, people would commit to things at the beginning of the quarter, realize they weren’t relevant anymore but knew that that’s what they were getting judged on for their promotions. So they’d rush to complete those things just before promotion packets were due.
Now we saw certain upticks in incidents in this organization during the year. And as I was analyzing the incidents for this organization, I was analyzing individual incidents, but I was also analyzing historic themes and if we could correlate them with certain events: traffic spikes, big uses of the application, etc.
[bctt tweet=”This engineering organization thought they were incentivizing the right things, but they were actually ending up creating poor incentive structures.” username=”@ITRevBooks”]
And I saw spikes in incidents around the time promotion packets were due, just a few weeks after, because we would see an uptick in things getting merged to production, maybe things that weren’t ready. I would sit in some of these incident reviews, and engineers would say, “Yeah, I wasn’t going to get promoted unless I pushed this in.”
And so this engineering organization thought they were incentivizing the right things, but they were actually ending up creating poor incentive structures. This was the organization they were creating.
But without actually looking into incidents, and without actually looking into incident analysis, they weren’t able to figure out that this is what was happening. And this is why this kind of stuff is important, it can help you structure your organization better.
A good incident analysis should tell you where to look. And I mentioned this before, we’re not trained as software engineers to analyze incidents, we’re trained in different pieces of software and distributed systems. We can figure out technically what happened, but we’re not really trained to figure out socially what happened.
It can be awkward, sometimes, figuring out what questions to ask, figuring out what people to talk to, but as leaders, we can help not make it awkward and we can help make it psychologically safer.
Now I mentioned a good incident analysis should tell you where to look. Well, you can see a heat map of all the chatter on the team. You can see a heat map of when the Slack conversations were going off, or when pager duty alerts were alert storming, or when certain pull requests are going through.
You might be interested in the absence of chatter on early Saturday morning, where it looks like management was the only one online. Maybe that’s a sign of actually good management taking one for their team there.
You might be interested in the fact that customer service seemed to be the only one online late Friday night. I wonder if they were getting supported.
You might be interested in some of the tenure of folks on the team and in their participation level. Are we relying solely on folks that have been here for awhile? What about folks that are fully vested? Are we relying on them a little bit too much? What happens when they leave?
You might be interested if we relied on folks that weren’t actually on call. That can tell us if we need to unlock tribal knowledge. If we have knowledge islands in the organizations.
You might be interested if people were on call for the first time ever and how we’re supporting them.
A good incident analysis should tell you where to look, but it can also help you with head count. If you’re always relying on people from a certain team, or people that weren’t on call, that can help you understand if you actually need to spin up a team there.
If you need to spin up training there. It can help you with planning promotion cycles, as we talked about earlier. Quarterly planning, unlocking that tribal knowledge, figuring out what people know.
[bctt tweet=”We pay a lot of attention to the customer costs of incidents and the repercussions of the incidents. We don’t pay a lot of attention to our coordination costs.” username=”@ITRevBooks”]
I was in an organization once where every time a certain guy came into the incident channel, everyone would react with the Batman emoji in Slack. And he was amazing, but it was actually a poor thing in this organization, because we relied on him a little bit too much. Those engineers are expensive, and they usually leave organizations quickly because they burn out and they take all that knowledge with them.
Incident analysis can help you see how you’re actually supporting that. You can see how much coordination efforts are costing you during incidents.
As an industry, we pay a lot of attention to the customer costs of incidents and the repercussions of the incidents. We don’t pay a lot of attention to our coordination costs. If we’re working with a team we’ve never worked with before, with people we’ve never worked with before, in the midst of an incident. And it can help you understand your bottlenecks, not just in your technical system, but in your people system.
Next up: How to Make Incident Reviews Better…
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.
This post is based on her 2021 presentation DevOps Enterprise Summit-Virtual Europe, which you can watch for free in the IT Revolution Video Library.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
You've been there before: standing in front of your team, announcing a major technological…
If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…
Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…
We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…