At Velocity London 2012, I saw one of the top five presentations I’ve ever seen in my life. In their talk “Continuously Deploying Culture,” Michael Rembetsy @mrembetsy, LinkedIn) and Patrick McDonnell (@mcdonnps, LinkedIn) described the story of their amazing IT transformation that started in 2008.
Etsy is often spoken in the same breath as companies like Netflix, Facebook, Twitter, Amazon, Joyent. These are companies that live and breathe DevOps and showing the rest of the world what performance outcomes are possible by leaving old mental models behind. High deploy rates, amazing stability, reliability and security, and most importantly, a culture that the rest of the world admires.
Community and culture, Rembetsy asserts in the talk, is the foundation of any company. And how does one go about fostering community and encouraging positive culture? You begin by eliminating barriers, getting rid of silos, and encouraging collaboration across the entire company.
For Etsy, it wasn’t always this way. How Etsy got to where it is today is what this presentation is all about. At the time this was presented, Etsy had 350 employees, with 125 engineers, of which 12 are in IT Operations.
Their presentation is structured as a year-by-year retrospective, describing the pains that the organization felt, how they attacked the problem, the outcomes they achieved, and the TODO list that they dragged into the following year.
2008: The Year Of Pain And Living In The Sea Of Our Engineering Filth
Rembetsy and McDonnell’s story begins in 2008, back when Etsy only had 30-35 employees, of which half are engineers. The company was doing approximately $87MM in revenue, with 250 servers in two data centers. “This was our Year Of Pain. Deploys took hours. Code didn’t work. There was little or no communication between developers writing the code and the IT operations team responsible for getting it out. Pushes routinely failed, restarting and rolling back changes was a challenge, resulting in HTTP 500 errors across the entire site.”
Deployments would routinely fail, causing “HTTP 500 errors” across the entire site, with no easy way to restart the services or rollback changes. “After a deployment day,” Rembetsy said, “I’d be completely spent and frustrated.”
They were stuck in a very siloed culture, most evident between IT Operations and Development. One of the most egregious examples of this was a tool called “sprouter,” which was specifically designed to prevent engineers from directly doing production database changes. In hindsight, “sprouter” almost guaranteed to create the wrong culture and outcomes by creating an artificial wall between Dev and IT Operations.
One of the first things they did was open up communications between Etsy developers and the customers. They created the http://fix.etsy.com blog to encourage communication and add transparency during outages, not only for the company but also for their customers. They started communicating outage updates including reasons, duration, progress, and expected resolution. This allowed our customers to know why they’re down and for long, and most importantly, “we’re doing this for you.”
On the day before Cyber Monday (the highest volume day for e-commerce sites), Rembetsy kept thinking, “WTF did I get myself into?!” Looking at their project backlog, Rembetsy observed, “We realized that we had to fix our technical debt, and that we couldn’t keep living in a sea of engineering filth,” Rembetsy said. This required focusing on high-improvement projects that mattered, instead of projects with low or no value.
Allow me to interject for a moment, by rephrasing what Rembetsy said, using the terms we’re using in the upcoming book, The Phoenix Project,” because their thinking mirrors our is terms of the necessary prerequisite steps to make meaningful improvements. We believe that the very first thing that an organization must do when embarking on this journey is to do the following:
- Create slack time for important improvement projects
- Keep batch sizes small and the planning horizon short (e.g., weeks, not months)
- Keep prioritizing higher “the system of work” over “doing work”
Back to the talk: They left 2008 with the following promises: 1) Gain support from the top and bottom to change culture, 2) Increase transparency both within the organization and to the public, and 3) Pay back technical debt as soon as possible.
2009: The Year Of Sea Change and Deployinator
They started the year by looking at issues caused by the offices. To encourage collaboration, Etsy put everyone under the same roof in a new office in DUMBO, a borough that was closer and more attractive to the people they were hiring in Brooklyn. “The place where you work must fit with you culture. You can’t have lean, creative, agile in a plain, dull office.”
They also created a “DevTools” team, starting an effort called “Deployinator” to automate their continuous deployment process, which they’ve open sourced. Deployinator enabled quick and safe deployments, by allowing them to create “the smallest number of steps, with the smallest number of people and the smallest amount of ceremony required to get new code running on your servers.”
This was also the year when genuine collaboration started to happen between “people in management and people in the trenches,” ending the era when management would just say, “Go do this” or “Go do that.” People were happy to come to work, and contribute in ways beyond their job description. And interestingly, this was also the year where they eliminated scheduled downtime, because deployments were going so much more smoothly.
They summed up 2009 as the beginning a “DevOps” culture, where the Berlin Wall between Development and IT Operations fell.
They left 2009 with the following action items: 1) Find the parts of your own organization that are causing you the most pain and try to stabilize them; 2) Hire staff that will make a difference; 3) Pick the projects that will make impact; 4) Get it done, just ship it.
(Again, when they say, “get it done, just ship it,” this is what we’re calling “keep batch sizes small and planning horizons short.”)
They left 2009 with $177MM gross merchandise raise (up 103% vs. 2008), 320 million visitors (up 96%), and 9.45 billion page views.
2010: The Year Of Standardizing
This was the year that Etsy brought on Kellan Elliott-McCrea as VP of Engineering and John Allspaw as SVP of IT Operations, who also had worked with CTO Chad Dickerson (now CEO of Etsy). This is when they created the Code As Craft blog, one of the premier software engineering blogs, to share their practices and lessons learned.
This is also when they created the continuous integration and delivery team, creating a fully automated test program, which enabled developers to have the confidence to do high rates of deploys.
They also started to standardize and very deliberately reduce the supported infrastructure and configurations. One decision was to switch everything to PHP and MySQL. This was a philosophical decision, not a technology one: they wanted both Dev and Ops to be able to understand the stack, so that everyone can contribute if they wanted to, as well as enabling everyone to be able to read, rewrite and fix someone else’s code.
They also adopted the mantra, “If it moves, graph it.” Etsy installed screens around the office which showed in real time graphs of what was up and running and what was down, shown in priority order. This way there was rarely questions of what one should be working on or which fire needed to be fought first.
Here’s one of their famous graphite metrics, aggregating metrics in a single-timeline that also shows when all the deployments occur, to enable constant situational awareness of activity inside the organization (e.g., web pushes, search pushes, etc.) and the health of all the services.
In the picture below, note all the vertical lines in the monitoring graphs — each of those vertical lines is a code push, color coded by application/service. This is the famous “vertical line technology that we helped pioneer.” (Haha.)
They currently have over 7000 checks for 700 hosts. “We used to have lots more, but we got rid of unimportant checks that were resulting in unimportant 3am wakeup calls. Stop being woken up at 3am for broken things!”
They created and implemented the “Developer on Call” program, to address the problem of IT Operations asking, “why should I be the only person waking up at 3am?” To create more developer responsibility and accountability, and to ensure that IT Operations had the necessary resources on hand during deployments, each developer rotated to be on call for one week. With the company’s current size, this translates to one week every three years that a developer would have to be on call 24/7.
This furthered the cultural norm that developers take responsibility for rollbacks and fixing forward when deployments go wrong.
They also started continual A/B testing, prototypes (users who opt-in to beta test features will be part of the A/B population that gets served new features), feature flags and ramp-ups, and “Schema Change Thursdays.” This last one is fascinating. Rembetsy says, “We stopped changing schemas whenever we feel like it. Now we batch them all up and do them only once per week.”
The management ideals that started getting formed included:
- Accept failures but don’t lower standards. Failures happen, and it’s best if they’re visible, understood, and used a springboards to greatness.
- Trust but verify.
- Blameless post-mortems
- Welcome one-on-ones
- Career planning
- Happy company = happy community
Their lessons for the year included included:
- Don’t guess at what’s wrong with your infrastructure–graph it.
- Empower developers with responsibility: let them deploy, have them on call, no passwords, etc. ( “Make sure that developers stick around to ensure that their deploy worked.”)
- Clear documented standards and processes are a must, but they aren’t set in stone. They can and should change as your business grows.
- Management should continually check in to make sure employees are happy and satisfied in their work. Rembetsy calls this a critical part of “human management.”
2010 Etsy business stats: Gross merchandise sales: $307MM (up 73%), 534MM visitors (up 66%), 147MM unique visitors (up 58%), and 9.3 billion page views (up 43%).
2011: The Year Of The Reaping: The Death Of Non-Standard Technologies
This is the year where they eradicated non-standard technologies that the entire company couldn’t get entirely behind. This is when Mongo, Scala, CoffeeScript, Python and many more great technologies were taken out of production. (They recommend checking out Ross Snyder’s Surge 2011 talk, “Scaling Etsy: What Went Wrong and What Went Right”).
Incidentally, this is also the year when the toxic tool Sprouter was killed. Rembetsy showed a graph of the number of Sprouter calls over time: “we were able to drive it to zero by opportunistically removing it bit by bit, and finally it died, too.”
This is also the year when many core Etsy technologies were outsourced, joining Deployinator (now being used by Rackspace): statsd, logster and many more. The company decided that all engineers should be contributing back in one of three ways each year:
- Writing blog posts for fix.etsy.com
- Speaking at conferences.
- Open-source something.
(And by the way, this really shows. I’ve been going to many conferences this year, and my observation is that there are always fantastic talks being given by folks from Etsy and Netflix.)
Some of their current initiatives and achievements:
- “IT Operations switched their configuration management from SVN to git,” enabling to have a combined repository with Development. “This has significantly increased our ability by increasing conformity, allowing us to grow the team.”
- “We did the svn to git transition in one weekend, where everybody worked on this to make sure it was successful.” (Nice. Another recurring theme: singular focus among the entire team to get important things done, like Facebook Hack Days)
- Increase signal/noise to noise ratio
- Writing Schemanator to automate schema chgs to reduce risks
- Focus on information security: Nick Galbreath helped create an information security and compliance program.
- They conducted “game days” to test failures before they happen
- More dashboards, with the framework on github
- Improved weekly financial reporting
Here’s some of the 2011 accomplishments:
I noted with interest that despite their use of continuous deployment, and developers having doing routine deployments, they still achieved PCI DSS compliance in six weeks. Rembetsy said, “We haven’t allowed PCI DSS to change the culture of the company. All the separation of duty requirements can still be fulfilled.”
They closed 2011 with $526MM in gross merchandise sales (up 71%), 895MM visitors (up 67%) and 12 billion page views (up 40%).
Their action items list for the next year:
- “Senior management at any technology company should be technology focused”
- “You need to do configuration management, even if you only have two servers”
- “Don’t let PCI compliance change the culture of the company”
2012: Current Challenges
Explosive growth in hiring and in the company has led to some novel and wonderful cultural hacks. They have a game they called “Guess That Admin,” to address the fact that with so many new faces, people don’t get a chance to meet everyone. This is an internal game where the person who recognizes the most new people wins. (Guess That Admin was also during one of their Hack Weeks. The page below uses pictures pulled from their LDAP servers.)
They have events like “Meetsy” (suggested lunch groups to meet people you may not work directly with) and “Eatsy” (where the entire company eats together). They now invite local NYC companies, such as Tumblr, to have mini-conferences onsite at Etsy, creating a community of practice between practitioners.
And because of new engineering challenges, they’re looking at using non-standard technologies, such as Redis.
Current works in progress include:
- Curbing developer boredom by allowing transfers between teams, including between divisions (e.g., engineering to product)
- Developers can now read data from production databases for development work, removing one of the last anomalies between Dev and Production
- Creating a Front End Performance team to minimize site load times
Their list of action items coming out of the year includes:
- Know when not to try things
- Focus on performance early (and by planning in advance that tools will be open sourced, it forces developers to sanitize their code early)
- Allow dynamic allocation of resources
- Never allow size to dictate culture
Rebutting Future Predictions Of Doom
Rembetsy then took some time to analyze and rebut some of the things he’s heard, mostly that predicted that “when Etsy hits 500 people, that’s when everything will fall apart.” He argued articulately that they’ve created a culture that can perpetuate itself, and I certainly agree.
I wish I had some of these insights while I was at Tripwire, where I was CTO at for 13 years. What a wonderful difference it would have made!
Way to go, guys. Amazing talk.
I mentioned that this post was one of the best presentations I’ve seen on transforming IT. For the record, the other presentation that I’d put up there in this league was given by Kevin Behr in 2003, describing the transformation he helped lead at IP Services. This was actually the basis of what became the Visible Ops Handbook!
I include it here for posterity.