Story #3: The Promotion Packet Paradox
Now my last story is one that we might all be familiar with a little bit as a software industry. I was in an organization where promotion packets were due. Now promotion packets in this organization consisted of an engineering manager putting together a little packet for someone on their team that they thought deserved to be promoted.
As this organization was growing larger and larger, it became harder to read all the packets, so they became very number driven. Did this person complete the things that they said they were going to complete at the beginning of the quarter?
And so people were losing promotions when they hadn’t completed things at the beginning of the quarter. But I know we’ve all been at organizations where we’ve committed to something at the beginning of the quarter, but we get midway through the quarter and realize that that’s not the most important thing anymore. Yet this is what we were judging people on.
So what do you all think happened? Well, people would commit to things at the beginning of the quarter, realize they weren’t relevant anymore but knew that that’s what they were getting judged on for their promotions. So they’d rush to complete those things just before promotion packets were due.
Now we saw certain upticks in incidents in this organization during the year. And as I was analyzing the incidents for this organization, I was analyzing individual incidents, but I was also analyzing historic themes and if we could correlate them with certain events: traffic spikes, big uses of the application, etc.This engineering organization thought they were incentivizing the right things, but they were actually ending up creating poor incentive structures. Click To Tweet
And I saw spikes in incidents around the time promotion packets were due, just a few weeks after, because we would see an uptick in things getting merged to production, maybe things that weren’t ready. I would sit in some of these incident reviews, and engineers would say, “Yeah, I wasn’t going to get promoted unless I pushed this in.”
And so this engineering organization thought they were incentivizing the right things, but they were actually ending up creating poor incentive structures. This was the organization they were creating.
But without actually looking into incidents, and without actually looking into incident analysis, they weren’t able to figure out that this is what was happening. And this is why this kind of stuff is important, it can help you structure your organization better.
What is A Good Incident Analysis?
A good incident analysis should tell you where to look. And I mentioned this before, we’re not trained as software engineers to analyze incidents, we’re trained in different pieces of software and distributed systems. We can figure out technically what happened, but we’re not really trained to figure out socially what happened.
It can be awkward, sometimes, figuring out what questions to ask, figuring out what people to talk to, but as leaders, we can help not make it awkward and we can help make it psychologically safer.
Where to Look
Now I mentioned a good incident analysis should tell you where to look. Well, you can see a heat map of all the chatter on the team. You can see a heat map of when the Slack conversations were going off, or when pager duty alerts were alert storming, or when certain pull requests are going through.
You might be interested in the absence of chatter on early Saturday morning, where it looks like management was the only one online. Maybe that’s a sign of actually good management taking one for their team there.
You might be interested in the fact that customer service seemed to be the only one online late Friday night. I wonder if they were getting supported.
You might be interested in some of the tenure of folks on the team and in their participation level. Are we relying solely on folks that have been here for awhile? What about folks that are fully vested? Are we relying on them a little bit too much? What happens when they leave?
You might be interested if we relied on folks that weren’t actually on call. That can tell us if we need to unlock tribal knowledge. If we have knowledge islands in the organizations.
You might be interested if people were on call for the first time ever and how we’re supporting them.
Head Count & Coordination Efforts
A good incident analysis should tell you where to look, but it can also help you with head count. If you’re always relying on people from a certain team, or people that weren’t on call, that can help you understand if you actually need to spin up a team there.
If you need to spin up training there. It can help you with planning promotion cycles, as we talked about earlier. Quarterly planning, unlocking that tribal knowledge, figuring out what people know.We pay a lot of attention to the customer costs of incidents and the repercussions of the incidents. We don't pay a lot of attention to our coordination costs. Click To Tweet
I was in an organization once where every time a certain guy came into the incident channel, everyone would react with the Batman emoji in Slack. And he was amazing, but it was actually a poor thing in this organization, because we relied on him a little bit too much. Those engineers are expensive, and they usually leave organizations quickly because they burn out and they take all that knowledge with them.
Incident analysis can help you see how you’re actually supporting that. You can see how much coordination efforts are costing you during incidents.
As an industry, we pay a lot of attention to the customer costs of incidents and the repercussions of the incidents. We don’t pay a lot of attention to our coordination costs. If we’re working with a team we’ve never worked with before, with people we’ve never worked with before, in the midst of an incident. And it can help you understand your bottlenecks, not just in your technical system, but in your people system.
Nora Jones has been on the front lines as a software engineer, as a manager, and now runs her own organization, Jeli. In 2017, she keynoted at AWS Reinvent to an audience of around 50,000 people about the benefits of chaos engineering, purposefully injecting failure in production, and her experiences implementing it at Jet.com, which is now Walmart and Netflix. Most recently, she started her own company, Jeli, based on a need she saw for the importance and value add to the whole business of a good post-incident review. As well as the the barrier to entry she saw of getting folks to work on that. She started an online community called Learning From Incidents and Software. This community is full of over 300 people in the software industry sharing their experiences with incidents and incident reviews.