A couple of weeks ago I had the pleasure of meeting Jim Stoneham, currently the is CEO of Opsmatic. In 2009, he was the general manager of the Yahoo Communities business unit, of which Flickr became a part of.
Yes, 2009 was also the year of the famous John Allspaw and Paul Hammond Velocity “10 Deploys A Day” presentation about the amazing work they had done at Flickr.
Jim’s story gave me an entirely new perspective on the amazing Allspaw/Hammond work, especially around the experiential differences it enabled from someone trying to achieve business goals. He was able to compare and contrast Flickr with all the other properties he was responsible for, including Yahoo Answers, which at the time was doing only one deployment every six weeks.
My conversation with Jim gave me a bunch of “aha” moments, especially when he talked about a whole set of business value and organizational learning that the Flickr transformation created, which were only alluded to by Allspaw and Hammond.
I asked Jim if I could interview him to clarify my own understanding – some of my big takeaways are listed below:
Fast feedback loops are critical to employee engagement and sparking learning
How flow is be impacted by dedicated vs. centralized ops
The impedance mismatches that can be created when fast flow meets slow flow
Creative energy needs to be spent on helping the organization win, and that can’t happen when every deployment is a challenge
How corporate-wide initiatives are adopted across the enterprise, spanning the spectrum of fast vs. slow release cycles
(I feel so fortunate to have talked with Jim as it helped me round out my understanding of organizational learning, which is timely as we try to get the first draft of The DevOps Cookbook manuscript completed this year! (To subscribe to updates about the DevOps Cookbook release and receive an advance draft of the manuscript, sign up here.)
Q: How were you involved in the famous 2009 Allspaw/Hammond story?
In 2009, I was running the Yahoo! Communities business (which included Answers, Groups, and managing our partnerships with Facebook and Twitter). I wasn’t directly involved with Flickr until it joined the Communities business unit in late 2009.
Of course, the Flickr “10 deploys per day” story was already well known inside of Yahoo – it was being shared long before the famous 2009 Velocity talk. When I was asked to lead Flickr, it was a time of transition; John Allspaw had left shortly before, and Paul Hammond announced that he was leaving to join Typekit. Cal Henderson (who wrote a significant chunk of Flickr) had already left months prior, so there was a lot of institutional knowledge that had to be transitioned.
Flickr was a codebase that evolved to operate at high scale over 7 years – and continuing to scale while adding and refining features was no small challenge. During this transition, it was a huge advantage that there was such an integrated dev and ops team; so many of the developers worked side-by-side with ops, making everything work. I met my Opsmatic co-founder, Mikhail Panchenko at that time, and he was one of the many smart people keeping things moving.
For my entire career, I’ve always been a believer in Agile processes and shipping code quickly, so being able to work closely with the Flickr team was inspiring. There were constant challenges to Flickr’s culture of course – Yahoo was creating a centralized IT Operations group, which spawned some additional process and politics and a demand to have all ops people in a central space. But our ops people refused to change seating locations, and kept coming to our core meetings – this was enabled by lots of air cover by me and other senior managers.
(There were benefits in creating the centralized IT operations group, because it helped create platform excellence and standardized processes that were uneven around Yahoo. At times, I had to deal with four different processes to scale up infrastructure, which was a headache. Certain things did become slower as a result, like capacity planning, infrastructure changeover, data center moves, and so forth.)
Compare and contrast what was happening at Flickr to Yahoo! Answers and Yahoo! Groups
At Flickr, any engineer could commit code into production at the push of a button. By comparison, in the rest of Yahoo, there were far more formalized and traditional build and release processes, where developers didn’t have keys to servers, and were separated from the IT Operations group.
As you’d expect, at Flickr there was a very high level of trust between Dev and Ops.
On the other hand, in the rest of Yahoo it varied. There was often less trust, and less awareness inside of development about what it took to run their code in production. And of course, the reciprocal was true: there was less ops awareness of how the services were built.
This meant longer release cycles, more work bouncing back and forth (code push, try, roll back, try again). Which meant lower velocity in general, taking longer to get features out the door.
Surprisingly, availability was about the same. How? It was done by saying “no” more frequently. You’d often hear things like, “No, that code isn’t ready to go out yet. I don’t understand it well enough.”
So what? What made you care about that?
I found that the slower the release cycles, the more opportunities we missed in the marketplace. The worst part was we weren’t learning fast enough.
This was especially frustrating inside of Yahoo Answers. Players like Quora, Aardvark and other Q&A services were launching with features that we had planned on building and launching. But our slower velocity meant that they got to market first.
Sometimes this was due to our feature velocity. Sometimes it was because we would require some sort of experimental infrastructure in order to build and test it, but IT Operations wouldn’t or couldn’t support it. (We were built on top of Oracle RAC, and that created lots of constraints, too.)
Yahoo Answers was and continues to be one of the biggest social games on the Internet; tens of millions of people are actively trying to “level up” by providing quality answer to questions faster than the next member of the community. There were many opportunities to tweak the game mechanic, viral loops, and other community interactions. When you’re dealing with these human behaviors, you’ve got to be able to do quick iterations and testing to see what clicks with people.
These are the things that Twitter, Facebook and Zynga did so well. Those organizations were doing experiments at least twice per week – they were even reviewing the changes they made before their deployments, to make sure they were still on track.
So here I am, running largest Q&A site in the market, wanting to do rapid iterative feature testing, but we can’t release any faster than once every 4 weeks. In contrast, the other people in the market had a feedback loop 10x faster than us.
You said that the worst part about this was that you weren’t learning enough? Tell me more.
Yes, the absolute worst part was that we were learning less than other people in the market.
It was a feeling of missed opportunity – it’s not like we were being killed by competitors; we were still highest traffic site, but we could have done so much better.
It was a feeling that we could have grown more, that we could have learned more about our customers, that our feature development efforts could have been better informed, that we could have tried more new things.
Tell me about the Yahoo Answers transformation.
Initially, Yahoo Answers was on a quarterly planning cycle, with releases happening every 4-6 weeks. Over time, we were able to get to weekly releases, and later, even more frequently than that. It took lot of work, and a lot of heroes to make that happen. It also required some profound changes in our product architecture and our ops processes and tooling.
But it was amazing. Suddenly, we were able to try things out and experiment in ways we hadn’t been able to do before. As a result, we were able to double the number of visitors and more deeply engage the active answerers. We launched a mobile website that got tens of millions of monthly visitors shortly after launch.
It was huge, because we could figure out the important features that drove the right customer interactions. Our team became very much in tune with the numbers – we would look at them as a team on a daily and weekly basis, and use that to inform feature conversations and plans.
This was exactly the learning that we needed to win in the marketplace – and it changed more than our feature velocity.
We transformed from a team of employees to a team of owners. When you move at that speed, and are looking at the numbers and the results daily, your investment level radically changes. This just can’t happen in teams that release quarterly, and it’s difficult even with monthly cycles.
Tell me about the other group you had, which had another view on release cycles.
(Laughing.) That other group was the social integration team that was responsible for wiring up things like Facebook and Twitter into all of Yahoo. No easy task!
They had to deal with an entire spectrum of teams inside Yahoo. On one end of the spectrum, you had Flickr, who could make changes relatively quickly. At the other end was the membership team (think single sign-on), where it was a deliberate, quarterly process because it was so fraught with risk – if you screwed up something here, you could cause a global failure across every Yahoo property.
This created a tangible impedance mismatch with Facebook and Twitter – who were used to working at lightning speed. It took a lot of work to thread the needle of managing expectations with the partners while also pushing our internal teams to get things done.
There were other internal factors, of course. Flickr was one of those groups that tried to keep corporate initiatives at arms length – but as it was part of my team, it was one of the first to implement the integrations. Some groups did amazing things and really exploited having access to all the social data. Other groups, you’d often hear, “We’ll get to it in a month – our roadmap is already baked,” month after month.
In hindsight, for a horizontal integration effort like this, we should have pushed harder for a clearer top-down executive mandate. Every group had their own business goals and constraints, which we’d be competing with – but when they had long release cycles, that was exacerbated.
Often, to get a commitment from the business units, we had to get the CEO involved, which was the worst way to get things done given we were trying to forge lasting relationships with these teams to keep investing and iterating their integrations.
How has all this informed the work you’re doing at Opsmatic?
Opsmatic wouldn’t exist without these experiences – it’s why we started the company. After living with the contrasts at Yahoo between Flickr and Answers it was clear to me that we needed a major leap in tooling and visibility for technical teams so they could release and learn much faster.
Our goal is to give teams absolute certainty about their configuration state across their systems, so new releases can flow and any infrastructure issues can be quickly resolved. We store our customers’ data forever, and integrate with many automation, monitoring, and alerting services to provide richer context – so we can create more opportunities for learning. It’s been gratifying to see the use of our service spread beyond ops to the entire technical team within many of our customers. Having that shared view pulls everyone together, collaboration improves, and the velocity increases.
Also, since we’re supporting ops teams, we actively put ourselves in harm’s way – taking servers out of rotation, tearing down and rebuilding our infrastructure, and helping our customers do the same. By the time a real crisis arrives we hope we’re ready for anything. And of course we can release any time we need to, and often do it 10 times per day (or more). 😉
About Jim Stoneham
Jim is the co-founder and CEO of Opsmatic, a configuration monitoring company working on increasing the MTBLS (mean time between loss of sleep) for people managing operations (details at opsmatic.com).
He has led and grown both startups and large-scale businesses for over 25 years, and has spent the past decade driving agile culture and processes at Flickr, Yahoo, Adobe, Payvment (acquired by Intuit), and several other startups.