Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
November 13, 2012
At Velocity London 2012, I saw one of the top five presentations I’ve ever seen in my life. In their talk “Continuously Deploying Culture,” Michael Rembetsy @mrembetsy, LinkedIn) and Patrick McDonnell (@mcdonnps, LinkedIn) described the story of their amazing IT transformation that started in 2008.
Etsy is often spoken in the same breath as companies like Netflix, Facebook, Twitter, Amazon, Joyent. These are companies that live and breathe DevOps and showing the rest of the world what performance outcomes are possible by leaving old mental models behind. High deploy rates, amazing stability, reliability and security, and most importantly, a culture that the rest of the world admires.
Community and culture, Rembetsy asserts in the talk, is the foundation of any company. And how does one go about fostering community and encouraging positive culture? You begin by eliminating barriers, getting rid of silos, and encouraging collaboration across the entire company.
For Etsy, it wasn’t always this way. How Etsy got to where it is today is what this presentation is all about. At the time this was presented, Etsy had 350 employees, with 125 engineers, of which 12 are in IT Operations.
Their presentation is structured as a year-by-year retrospective, describing the pains that the organization felt, how they attacked the problem, the outcomes they achieved, and the TODO list that they dragged into the following year.
My notes are below, but you can find the full recorded video recorded by Damon Edward here and their Slideshare link here.
Rembetsy and McDonnell’s story begins in 2008, back when Etsy only had 30-35 employees, of which half are engineers. The company was doing approximately $87MM in revenue, with 250 servers in two data centers. “This was our Year Of Pain. Deploys took hours. Code didn’t work. There was little or no communication between developers writing the code and the IT operations team responsible for getting it out. Pushes routinely failed, restarting and rolling back changes was a challenge, resulting in HTTP 500 errors across the entire site.”
Deployments would routinely fail, causing “HTTP 500 errors” across the entire site, with no easy way to restart the services or rollback changes. “After a deployment day,” Rembetsy said, “I’d be completely spent and frustrated.”
They were stuck in a very siloed culture, most evident between IT Operations and Development. One of the most egregious examples of this was a tool called “sprouter,” which was specifically designed to prevent engineers from directly doing production database changes. In hindsight, “sprouter” almost guaranteed to create the wrong culture and outcomes by creating an artificial wall between Dev and IT Operations.
One of the first things they did was open up communications between Etsy developers and the customers. They created the https://fix.etsy.com blog to encourage communication and add transparency during outages, not only for the company but also for their customers. They started communicating outage updates including reasons, duration, progress, and expected resolution. This allowed our customers to know why they’re down and for long, and most importantly, “we’re doing this for you.”
On the day before Cyber Monday (the highest volume day for e-commerce sites), Rembetsy kept thinking, “WTF did I get myself into?!” Looking at their project backlog, Rembetsy observed, “We realized that we had to fix our technical debt, and that we couldn’t keep living in a sea of engineering filth,” Rembetsy said. This required focusing on high-improvement projects that mattered, instead of projects with low or no value.
Allow me to interject for a moment, by rephrasing what Rembetsy said, using the terms we’re using in the upcoming book, The Phoenix Project,” because their thinking mirrors our is terms of the necessary prerequisite steps to make meaningful improvements. We believe that the very first thing that an organization must do when embarking on this journey is to do the following:
Back to the talk: They left 2008 with the following promises: 1) Gain support from the top and bottom to change culture, 2) Increase transparency both within the organization and to the public, and 3) Pay back technical debt as soon as possible.
They started the year by looking at issues caused by the offices. To encourage collaboration, Etsy put everyone under the same roof in a new office in DUMBO, a borough that was closer and more attractive to the people they were hiring in Brooklyn. “The place where you work must fit with you culture. You can’t have lean, creative, agile in a plain, dull office.”
They also created a “DevTools” team, starting an effort called “Deployinator” to automate their continuous deployment process, which they’ve open sourced. Deployinator enabled quick and safe deployments, by allowing them to create “the smallest number of steps, with the smallest number of people and the smallest amount of ceremony required to get new code running on your servers.”
This was also the year when genuine collaboration started to happen between “people in management and people in the trenches,” ending the era when management would just say, “Go do this” or “Go do that.” People were happy to come to work, and contribute in ways beyond their job description. And interestingly, this was also the year where they eliminated scheduled downtime, because deployments were going so much more smoothly.
They summed up 2009 as the beginning a “DevOps” culture, where the Berlin Wall between Development and IT Operations fell.
They left 2009 with the following action items: 1) Find the parts of your own organization that are causing you the most pain and try to stabilize them; 2) Hire staff that will make a difference; 3) Pick the projects that will make impact; 4) Get it done, just ship it.
(Again, when they say, “get it done, just ship it,” this is what we’re calling “keep batch sizes small and planning horizons short.”)
They left 2009 with $177MM gross merchandise raise (up 103% vs. 2008), 320 million visitors (up 96%), and 9.45 billion page views.
This was the year that Etsy brought on Kellan Elliott-McCrea as VP of Engineering and John Allspaw as SVP of IT Operations, who also had worked with CTO Chad Dickerson (now CEO of Etsy). This is when they created the Code As Craft blog, one of the premier software engineering blogs, to share their practices and lessons learned.
This is also when they created the continuous integration and delivery team, creating a fully automated test program, which enabled developers to have the confidence to do high rates of deploys.
They also started to standardize and very deliberately reduce the supported infrastructure and configurations. One decision was to switch everything to PHP and MySQL. This was a philosophical decision, not a technology one: they wanted both Dev and Ops to be able to understand the stack, so that everyone can contribute if they wanted to, as well as enabling everyone to be able to read, rewrite and fix someone else’s code.
They also adopted the mantra, “If it moves, graph it.” Etsy installed screens around the office which showed in real time graphs of what was up and running and what was down, shown in priority order. This way there was rarely questions of what one should be working on or which fire needed to be fought first.
Here’s one of their famous graphite metrics, aggregating metrics in a single-timeline that also shows when all the deployments occur, to enable constant situational awareness of activity inside the organization (e.g., web pushes, search pushes, etc.) and the health of all the services.
In the picture below, note all the vertical lines in the monitoring graphs — each of those vertical lines is a code push, color coded by application/service. This is the famous “vertical line technology that we helped pioneer.” (Haha.)
They currently have over 7000 checks for 700 hosts. “We used to have lots more, but we got rid of unimportant checks that were resulting in unimportant 3am wakeup calls. Stop being woken up at 3am for broken things!”
They created and implemented the “Developer on Call” program, to address the problem of IT Operations asking, “why should I be the only person waking up at 3am?” To create more developer responsibility and accountability, and to ensure that IT Operations had the necessary resources on hand during deployments, each developer rotated to be on call for one week. With the company’s current size, this translates to one week every three years that a developer would have to be on call 24/7.
This furthered the cultural norm that developers take responsibility for rollbacks and fixing forward when deployments go wrong.
They also started continual A/B testing, prototypes (users who opt-in to beta test features will be part of the A/B population that gets served new features), feature flags and ramp-ups, and “Schema Change Thursdays.” This last one is fascinating. Rembetsy says, “We stopped changing schemas whenever we feel like it. Now we batch them all up and do them only once per week.”
The management ideals that started getting formed included:
Their lessons for the year included included:
2010 Etsy business stats: Gross merchandise sales: $307MM (up 73%), 534MM visitors (up 66%), 147MM unique visitors (up 58%), and 9.3 billion page views (up 43%).
This is the year where they eradicated non-standard technologies that the entire company couldn’t get entirely behind. This is when Mongo, Scala, CoffeeScript, Python and many more great technologies were taken out of production. (They recommend checking out Ross Snyder’s Surge 2011 talk, “Scaling Etsy: What Went Wrong and What Went Right”).
Incidentally, this is also the year when the toxic tool Sprouter was killed. Rembetsy showed a graph of the number of Sprouter calls over time: “we were able to drive it to zero by opportunistically removing it bit by bit, and finally it died, too.”
This is also the year when many core Etsy technologies were outsourced, joining Deployinator (now being used by Rackspace): statsd, logster and many more. The company decided that all engineers should be contributing back in one of three ways each year:
(And by the way, this really shows. I’ve been going to many conferences this year, and my observation is that there are always fantastic talks being given by folks from Etsy and Netflix.)
Some of their current initiatives and achievements:
Here’s some of the 2011 accomplishments:
I noted with interest that despite their use of continuous deployment, and developers having doing routine deployments, they still achieved PCI DSS compliance in six weeks. Rembetsy said, “We haven’t allowed PCI DSS to change the culture of the company. All the separation of duty requirements can still be fulfilled.”
They closed 2011 with $526MM in gross merchandise sales (up 71%), 895MM visitors (up 67%) and 12 billion page views (up 40%).
Their action items list for the next year:
Explosive growth in hiring and in the company has led to some novel and wonderful cultural hacks. They have a game they called “Guess That Admin,” to address the fact that with so many new faces, people don’t get a chance to meet everyone. This is an internal game where the person who recognizes the most new people wins. (Guess That Admin was also during one of their Hack Weeks. The page below uses pictures pulled from their LDAP servers.)
They have events like “Meetsy” (suggested lunch groups to meet people you may not work directly with) and “Eatsy” (where the entire company eats together). They now invite local NYC companies, such as Tumblr, to have mini-conferences onsite at Etsy, creating a community of practice between practitioners.
And because of new engineering challenges, they’re looking at using non-standard technologies, such as Redis.
Current works in progress include:
Their list of action items coming out of the year includes:
Rembetsy then took some time to analyze and rebut some of the things he’s heard, mostly that predicted that “when Etsy hits 500 people, that’s when everything will fall apart.” He argued articulately that they’ve created a culture that can perpetuate itself, and I certainly agree.
I wish I had some of these insights while I was at Tripwire, where I was CTO at for 13 years. What a wonderful difference it would have made!
Way to go, guys. Amazing talk.
I mentioned that this post was one of the best presentations I’ve seen on transforming IT. For the record, the other presentation that I’d put up there in this league was given by Kevin Behr in 2003, describing the transformation he helped lead at IP Services. This was actually the basis of what became the Visible Ops Handbook!
I include it here for posterity.
Gene Kim has been studying high-performing technology organizations since 1999. He was the founder and CTO of Tripwire, Inc., an enterprise security software company, where he served for 13 years. His books have sold over 1 million copies—he is the WSJ bestselling author of Wiring the Winning Organization, The Unicorn Project, and co-author of The Phoenix Project, The DevOps Handbook, and the Shingo Publication Award-winning Accelerate. Since 2014, he has been the organizer of DevOps Enterprise Summit (now Enterprise Technology Leadership Summit), studying the technology transformations of large, complex organizations.
No comments found
Your email address will not be published.
First Name Last Name
Δ
"This feels pointless." "My brain is fried." "Why can't I think straight?" These aren't…
As manufacturers embrace Industry 4.0, many find that implementing new technologies isn't enough to…
I know. You’re thinking I'm talking about Napster, right? Nope. Napster was launched in…
When Southwest Airlines' crew scheduling system became overwhelmed during the 2022 holiday season, the…