Skip to content

February 26, 2014

Kanbans and DevOps: Resource Guide for "The Phoenix Project" (Part 2)

By Gene Kim

This blog article continues the description of the “body of knowledge” that underpins “The Phoenix Project,” which started in Part 1.

Here are links to all four posts:

In short, kanban boards are awesome. They appear prominently in “The Phoenix Project” as the mechanism to visualize work, limit WIP and ensure fast flow, especially where work crosses organizational boundaries (e.g., between Dev and Ops) or bottlenecks (e.g., Brent, Brent, Brent).

In this post, I will describe the problem statement, which is framed in my favorite (and only) graph in “The Phoenix Project.” I’ll then talk about two books that I’d recommend to anyone on how kanban boards can manage the flow of work — the first book is Personal Kanban: Mapping Work | Navigating Life by Jim Benson and Tonianne DeMaria Barry, and the second book is Kanban: Successful Evolutionary Change for Your Technology Business by David J. Anderson.

(In particular, the case studies that David J. Anderson presents are amazing, which I summarize in this post. What’s interesting is the dramatic improvements made in lead time, not by using automation, but controlling work in process and modifying the policies of how work is performed. It’s great stuff that every DevOps practitioner should be familiar with.)

I’ll also describe Dominica DeGrandis’ Kanban for DevOps workshop that I’d recommend to anyone who wants to implement kanbans in their organization, and my favorite tool for implementing the kanbans in my own daily work.

I’ll also briefly describe how I use LeanKit kanban boards for my own daily work.

In short, I’ve been using kanban boards since 2010, and holy cow, I wish I had started using them 10 years earlier than that! I will never go back to life without one — it’s just too dangerous!

Why Do We Need To Visualize IT Work And Control WIP?

My favorite (and only) graph in “The Phoenix Project” shows wait time as a function of how busy a resource at a work center is. Erik used this to show why Brent’s “simple 30 minute changes” were taking weeks to get completed. The reason, of course, is that as the bottleneck of all work, Brent is constantly at or above 100% utilization, and therefore, anytime we required work from him, the work just languished in queue, never worked on without expediting/escalating.

Here’s what the graph shows: on the x-axis is the % busy for a given resource at a work center, and on the y-axis is the approximate wait time (or maybe more precisely stated, the queue length). What the shape of the line shows is that, as resource utilization goes past 80%, wait time goes through the roof.

In “The Phoenix Project,” here’s how Bill and team realized the devastating consequences of this property on lead times for the commitments they were making to the project management office.

I tell them about what Erik told me at MRP-8, about how wait times depend upon resource utilization. “The wait time is the ‘percentage of time busy’ divided by the ‘percentage of time idle.’ In other words, if a resource is fifty percent busy, then it’s fifty percent idle. The wait time is fifty percent divided by fifty percent, so one unit of time. Let’s call it one hour.

So, on average, our task would wait in the queue for one hour before it gets worked.

“On the other hand, if a resource is ninety percent busy, the wait time is “ninety percent divided by ten percent”, or nine hours. In other words, our task would wait in queue nine times longer than if the resource were fifty percent idle.”

I conclude, “So… For the Phoenix task, assuming we have seven handoffs, and that each of those resources is busy ninety percent of the time, the tasks would spend in queue a total of nine hours times the seven steps…”

“What? Sixty-three hours, just in queue time?” Wes says, incredulously. “That’s impossible!”

Patty says with a smirk, “Oh, of course. Because it’s only thirty seconds of typing, right?”

Bill and team realize that their “simple 30 minute task” actually required seven handoffs (e.g., server team, networking team, database team, virtualization team, and of course, Brent, Brent, Brent). Assuming that that all work centers were 90% busy, the graph shows us that the average wait time at each work center is nine hours — and because the work had to go through seven work centers, the total wait time is seven times that: 63 hours.

In other words, the total “% of value added time” (sometimes known as “touch time”) was only 0.16% of the total lead time (30 minutes divided by 63 hours). That means for 99.8% of our total lead time, the work was simply sitting in queue, waiting to be worked on (e.g., in a ticketing system, in an email, etc.).

My fellow co-author, George Spafford, and I were first introduced to this graph that so brilliantly shows the destructive nature of long queue times caused by high resource utilization when we both took the EM526 Constraints Management course at Washington State University from Dr. James Holt (described in more detail in Part 1).

Unfortunately, I don’t know the precise derivation of this graph. Some believe, like I do, that this graph is a simplified case of Little’s Law, where we assume a single work center, a uniform work queue (i.e., all tasks require the same time to complete), no delay between jobs, etc.

In the graph, I believe “wait time” is actually a proxy for “queue length.” In other words, because it’s not time elapsed, it has no time units (i.e., it’s neither minutes, hours, days, etc.).

The best discussion on the derivation (and validity/invalidity) can be found on the “The Phoenix Project” LinkedIn Group. The discussion, although sometimes a bit acerbic, is intellectually top-notch.

My opinion? The goal of science is to explain the largest amount of observed phenomenon with the fewest number of principles, and reveal surprising insights. I think the graph serves that purpose, and is the most effective way of communicating the catastrophic consequences of overloaded IT workers, and the fallacies of using typical project management techniques for IT Operations.

Two Great Books On Kanbans

Hopefully by now, I’ve convinced you that the problems associated with too much WIP in IT are truly devastating. Many practitioners believe that kanbans are one of the most effective countermeasures, as well as the simplest.

I have two favorite books on kanbans that I’d recommend to anyone who is even remotely interested in kanbans.

The first book is Personal Kanban: Mapping Work | Navigating Life by Jim Benson and Tonianne DeMaria Barry. This book is more of a personal productivity book than on complex value streams. In fact, I’d call this book the modern version of David Allen’s famous book Getting Things Done: The Art of Stress-Free Productivity.

Where Allen discussed the nature of work, the importance of calendars for keeping commitments, the theory of filing and contextual TODO lists, Benson and Tonianne discuss the need to visualize all our work and control the amount of work in process (WIP). They advocate that everyone should start their own kanban boards with three simple lanes: Ready, Doing and Done.

Although I remain a devout David Allen GTD fan, after reading Personal Kanban, I quickly retired my contextual TODO lists that I’ve been maintaining for nearly a decade in favor of a kanban board. In many ways, I’ve found that it solves one of most challenging aspects of the GTD methodology: the weekly executive review, where we’re supposed to re-prioritize our commitments, prune our TODO lists, etc. There were years in which I never did this supposed weekly activity.

On the other hand, with kanban boards, all my work is visible, and there are WIP limits in place that prevent the number going above a fixed limit. I’ve seen on Jim Benson’s kanban board in his office that his Doing WIP limit is 4 (i.e., no more than 4 cards are allowed in the Doing lane).

The second book I’d recommend is David J. Anderson’s book Kanban: Successful Evolutionary Change for Your Technology Business, and is more specific to the use of kanban boards in organizations.

For me, reading this book was unexpectedly delightful, as it further chronicled a Microsoft IT case study that I had chosen to study as part of my EM526 Constraints Management course. It was a 2005 paper called, “From Worst to Best in 9 Months: Implementing a Drum-Buffer-Rope Solution in Microsoft’s IT Department (PDF)” by David J. Anderson and Dragos Dumitriu. Small world, isn’t it?

(I apologize for my highlights showing up in the PDF — I can’t find the original paper online anywhere, and the only version I had was the one I annotated. You can tell how excited I was to read it back in 2007. 🙂

Both Anderson and Dumitriu were at Microsoft at the time, and they described an abysmal previous state that is likely familiar to most IT practitioners:

  • too long to finish work requested from the business: average lead time was 155 days
  • dissatisfaction with lateness and long lead times forced IT management to do “more work estimation,” forcing managers to spend all their time building PowerPoints (because business conclusion was that they didn’t do a good job estimating), instead of doing real work
  • whenever the business asked for anything, the response was “it’ll take 5 months”
  • every task was estimated at 20 days, but no one knew where the other 135 days went

Dumitriu created a new field in their ticketing system (actually, it was the Microsoft defect tracking system) called “Waiting For Dragos”, to capture when any work was blocked. He quickly concluded that 70% of all the team’s time was blocked on other people — in other words, 70% of the time, the work was in queue.

Dumitriu concluded that his team was only completing three work items per month, and that at that rate, it would require 3 years to complete all that work.

Here were the countermeasures that he put into place and the amazing results:

  • They stopped estimating their work, and instead used actual times based on historical data — they had 80 person-years of work in their ticketing system, so they used that. This resulted in an immediate 30% boost to Dev and Test productivity.
  • They stopped using cost accounting, using instead a simple “ROI based on budget contribution” — this time savings resulted in an immediate 20% boost to PM capacity.
  • Realizing that their constraint was Dev, PM took over many of the Dev tasks, increasing Dev capacity by 20%. It also led to happier Developers, because they were coding, instead of doing task estimation.
  • They brought in a usability expert to modify the change request forms (he quipped, “in order to get a glass of water, we had to fill out a 4 page form; we replaced it with a 2 page form, with lots of free-form fields, with the goal of making sure the person doing the work had all the information they needed.”)
  • They then reduced the WIP allowed in the system: initially, they had on average, 40-60 open items. They reduced this down to 5.
  • They then created a buffer of work, so that any blocked Dev or Test person could work on something in the buffer.
  • Their lead time went from 155 days to 22 days. Lead times were so good that they created a new SLA guarantee of 25 days (Wow!).
  • Their next surge in productivity came from increasing the number of developers, because for every 2 days of Dev work, it required 1 day of Test work. They promoted testers who wanted to move into Dev, and increased the Dev:Test ratio from 1:1 to 2:1.
  • The result of all of this? They completed their entire 3 years of backlog in 9 months; demand for their services went up, and they continued to deliver everything that was asked for each month; no one got fired, and instead people got promoted.

As Dumitriu said, “We succeeded because we focused relentlessly on reducing lead time, as opposed to Dev & Test optimizing for themselves.”

This story is just one of many astounding transformations that are described in brilliant detail. Amazingly, the transformations are not primarily based on automation. Instead, the incredible improvements come from modifying policies around the system of work and the policies that control work in process, ensuring that there are effective cross-functional teams, subordinating everything to the constraint, and managing handoffs well.

Incidentally, Anderson chronicles in his book the changes in his thinking about how to control work in IT value streams. It’s clear that he was a devout follower of Dr. Goldratt’s work (e.g., Theory of Constraints, Drum-Buffer-Rope, etc), but concluded that using kanban boards can achieve most of the benefits through emergent properties.

I highly recommend reading this book, as it chronicles the real-life improvements that he’s made at organizations such as Sprint, Motorola, Microsoft, and Corbis.

A Workshop On Kanbans

Hopefully by now, I’ve convinced you about how transformative using kanban boards can be. But, there’s no doubt in my mind that I learned the most about running effective kanbans by attending Dominica DeGrandis’ two-day workshop in 2012. It was fantastic and life-changing. DeGrandis is the most respected authority on kanbans that span the Dev and IT Ops value stream.

I highly recommend her workshop to anyone who wants to increase the flow of planned work, reduce WIP and starve unplanned work. (Tip: Bring your teammates so you can start changing the way your entire team works!)

Her workshops include a combination of:

  • lectures
  • team exercises
  • group scrutiny

Here’s some pictures I took during the 2 day workshop:

LeanKit Kanbans

Since 2011, I’ve been using LeanKit kanban boards for my own daily work. For me, LeanKit has replaced Toodledo (and before that, Outlook Tasks, back in my Microsoft Exchange days) for task management, and David Allen’s Getting Things Done methodology.

My assistant and I use it to do weekly sprint planning, moving cards from the Backlog into Ready and Doing as we commit to tasks, and then dragging them to Done.

When I was visiting Jim Benson’s office, I noticed that on his board, he has another lane called Learnings, which I think is clever — I have interpreted this lane to be cards that are worth discussing in a retrospective.

One of the primary values to me is that it becomes very obvious when tasks are languishing on the board, or when WIP starts piling up. In the ideal, this should result in a counter-measure of delegating tasks, putting cards back in the backlog, or spending time just getting these tasks done. (I love the quote “Stop starting; start finishing.”)

(By the way, anyone heavy user of LeanKit may also be interested in Zapier, which is an amazing automation framework that glues tools together. I use Zapier to allow myself to email new cards to my board, convert cards into new rows in Google Spreadsheets, and other neat stuff. My thanks to Jon Terry at Leankit for showing this to me — it’s changed my life!)

Up Next

Now that this post is up, I’ll be working on the resource guide for two other tools used in “The Phoenix Project”: the GAIT and GAIT-R methodologies (Institute of Internal Auditors) and the Risk-Adjusted Value Management (Gartner), which both explain John’s seemingly miraculous transformation as a CISO.

- About The Authors
Avatar photo

Gene Kim

Award winning CTO, researcher, and author.

Follow Gene on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    Serverless Myths
    By David Anderson , Michael O’Reilly , Mark McCann

    The term “serverless myths” could also be “modern cloud myths.” The myths highlighted here…

    What is the Modern Cloud/Serverless?
    By David Anderson , Michael O’Reilly , Mark McCann

    What is the Modern Cloud? What is Serverless? This post, adapted from The Value…

    Using Wardley Mapping with the Value Flywheel
    By David Anderson , Michael O’Reilly , Mark McCann

    Now that we have our flywheel turning (see our posts What is the Value…

    12 Key Tenets of the Value Flywheel Effect
    By David Anderson , Michael O’Reilly , Mark McCann

    Now that you've learned about what the Value Flywheel Effect is, let's look at…