Excerpted from the guidance paper DevOps Case Studies. Download the full paper here.
Written by: Jim Stoneham, Paula Thrasher, Terri Potts, Heather Mickman, Carmen DeArdo, Thomas A. Limoncelli, and Kate Sage
Technology Practices Journey
Yahoo Answers was created in 2006 as a place to share knowledge on the web and bring more knowledge to the Internet. It’s basically a big game: people compete to answer visitor questions, and the most approved questions help them work their way up to higher levels.
In 2009, their growth was flat, at around 140 million monthly visits. In addition, they had declining user engagement, flat revenue, and a contentious team of employees. They used waterfall development with four to six week cycles because there were quality issues in Operations and Development, and people were obstructing releases. Fourteen months later, Yahoo Answers was getting over 240 million monthly visits and over 20 million people answering questions.[FN15] It was also available globally in twenty languages. It was a very large-scale property and a significant part of Yahoo traffic. They were able to grow traffic by 72%, user engagement was up 3x, and revenue was up 2x. They had daily releases and better site performance, and they had moved from a team of employees to a kick-ass team of owners.
So, what is the backstory of the transition? Yahoo Answers had an amazing team of four to five leaders across Engineering, Product, Design, and Operations who helped transform the business. Everyone sat down together and came to the conclusion that they could no longer run the business like this. They all developed a plan. The first step was to get everybody closer together. When Jim Stoneham arrived in 2009 as VP of Communities, they had people in London and France, while Jim was in the United States. The odd triangle created slow movement and arguments, and people were constantly missing each other due to being in different locations and different time zones. Because there was not a lot of technology to help facilitate remote teams in 2009 (Slack, etc.), it was essential to come together geographically. To that end, they consolidated the functional teams in London and France by bringing them all to London.
Once they were all in the same place, they decided it was necessary to focus on a few key metrics. Their old dashboard tracked every single metric, meaning, of course, that nobody paid attention to anything. So, they simplified. They asked customers what mattered. The responses they received revealed that customers were primarily concerned with time to first answer, time to best answer, upvotes per answer, answers/week/person, second search rate, and trending down (negatively correlated). Revenue was not a key metric, nor was page views. Those would follow if the other metrics were doing well.
Then they knew they needed to architect to enable velocity of deploying and independence. In 2008, Answers was built on top of an Oracle RAC database and five-year-old legacy code. They re-architected in place. They couldn’t shut down the business, and they didn’t want to build a new system next door and migrate, so they built a MySQL-based read cache to take stress off RAC systems in back, built data access layer for read/write to core database, and refactored one page at a time. There was a lot of interacting going on. It was essential to start with less-used pages. The last page refactored was the actual question/answer page, which carried most of the traffic. They broke down a monolithic app into a service oriented architecture, which paid dividends with their Agile process, all while serving billions of pages each month. Altogether, the transition took four months, with sixty days of planning before they began actually writing code.
The next step was to reduce the size to small units of work focused on a key metric, which would have the effect of making the unit of work smaller and smaller. Yahoo had been working waterfall at four to six week releases. The Operations team at the time viewed Agile as a whole bunch of people throwing stuff at them. They were very resistant to trying it out. To make matters more difficult, the Operations team was in a different functional organization than the business leadership, which meant that leadership had no actual organizational power over Operations.
To overcome this, they got everyone into a room and came up with a process that would work for all of them all as a team and would quickly drive experiments so that people would own the quality. This involved all stakeholders. Their new product process included weekly sprints; daily deploys (except Fridays); reviewing metrics daily or more, which was key to moving from a team of employees to a team of owners and helped create a cultural transformation; and weekly iteration planning. This weekly planning kept up a cadence of looking forward and backward by a week. The new process also included monthly business reviews (all hands) during which they would take five core metrics and revenue and look at the information together. IT was made up of extended Operations people and community managers, which turned out to be a group of eighty people or so. It took them 60 to 90 days to get the process working well.
Another big step in this type of transformation is to allow people to screw up, to roll forward or back. If you want to take risks and hit for the fences, you have to give people the permission to make mistakes, and you need to make it really easy to recover from those mistakes. In 2009, rollbacks looked awful. It was like putting out a major fire. In 2010, they implemented changes that allowed rollbacks to happen really quickly. By using Hudson, they could roll back a set of scripts. Ninety percent of the time they would roll forward because it was easy to deploy from trunk. It was essential to give people an environment where they could roll back or forward quickly. They also rewarded people for taking risks, failing, realizing failure, and killing things when they knew those things had failed. This all encouraged experimentation.
Of course, no plan is exact, and anytime you lay one out, it never ends up being as simple as a five point plan. Other things always come into play. For the Yahoo Answers team, some of those things were coaching managers on “soft skills,” exiting people who weren’t on board, utilizing an A/B testing framework, reporting a few more metrics upward, tooling for change monitoring.
To download the full DevOps Case Studies Guidance Paper, click here.