The DevOps Handbook (download the free excerpt) is now available. This is one of over 40 case studies you will find in the book.
How we organize our teams affects how we perform our work.
Dr. Melvin Conway proved this with a famous experiment he performed in 1968 with a contract research organization that had eight people who were commissioned to produce a COBOL and an ALGOL compiler.
He observed, “After some initial estimates of difficulty and time, five people were assigned to the COBOL job and three to the ALGOL job. The resulting COBOL compiler ran in five phases, the ALGOL compiler ran in three.”
These observations led to what is now known as Conway’s Law, which states that:
“Organizations which design systems…are constrained to produce designs which are copies of the communication structures of these organizations…. The larger an organization is, the less flexibility it has and the more pronounced the phenomenon.”
Eric S. Raymond, author of the book The Cathedral and the Bazaar: Musings on Linux and Open Source by an Accidental Revolutionary, crafted a simplified (and now, more famous) version of Conway’s Law in his Jargon File:
“The organization of the software and the organization of the software team will be congruent; commonly stated as ‘if you have four groups working on a compiler, you’ll get a 4-pass compiler.’”
In other words, how we organize our teams has a powerful effect on the software we produce, as well as our resulting architectural and production outcomes.
In order to get fast flow of work from Development into Operations, with high quality and great customer outcomes, we must organize our teams and our work so that Conway’s Law works to our advantage. Done poorly, Conway’s Law will prevent teams from working safely and independently; instead, they will be tightly coupled together, all waiting on each other for work to be done, with even small changes creating potentially global, catastrophic consequences.
An example of how Conway’s Law can either impede or reinforce our goals can be seen in a technology that was developed at Etsy called Sprouter.
Etsy’s DevOps journey began in 2009, and is one of the most admired DevOps organizations, with 2014 revenue of nearly $200 million and a successful IPO in 2015.
Originally developed in 2007, Sprouter connected people, processes, and technology in ways that created many undesired outcomes.
Sprouter, shorthand for “stored procedure router,” was originally designed to help make life easier for the developers and database teams. As Ross Snyder, a senior engineer at Etsy, said during his presentation at Surge 2011:
“Sprouter was designed to allow the Dev teams to write PHP code in the application, the DBAs to write SQL inside Postgres, with Sprouter helping them meet in the middle.”
Sprouter resided between their front-end PHP application and the Postgres database, centralizing access to the database and hiding the database implementation from the application layer.
The problem was that adding any changes to business logic resulted in significant friction between developers and the database teams.
As Snyder observed:
“For nearly any new site functionality, Sprouter required that the DBAs write a new stored procedure. As a result, every time developers wanted to add new functionality, they would need something from the DBAs, which often required them to wade through a ton of bureaucracy.”
In other words, developers creating new functionality had a dependency on the DBA team, which needed to be prioritized, communicated, and coordinated, resulting in work sitting in queues, meetings, longer lead times, and so forth.
This is because Sprouter created a tight coupling between the development and database teams, preventing developers from being able to independently develop, test, and deploy their code into production.
Also, the database stored procedures were tightly coupled to Sprouter—any time a stored procedure was changed, it required changes to Sprouter too.
The result was that Sprouter became an ever-larger single point of failure. Snyder explained that everything was so tightly coupled and required such a high level of synchronization as a result, that almost every deployment caused a mini-outage.
Both the problems associated with Sprouter and their eventual solution can be explained by Conway’s Law.
Etsy initially had two teams, the developers and the DBAs, who were each responsible for two layers of the service, the application logic layer and stored procedure layer. Two teams working on two layers, as Conway’s Law predicts. Sprouter was intended to make life easier for both teams, but it didn’t work as expected—when business rules changed, instead of changing only two layers, they now needed to make changes to three layers (in the application, in the stored procedures, and now in Sprouter).
The resulting challenges of coordinating and prioritizing work across three teams significantly increased lead times and caused reliability problems. And then, in the spring of 2009, as part of what Snyder called “the great Etsy cultural transformation,” Chad Dickerson joined as their new CTO.
Dickerson put into motion many things, including a massive investment into site stability, having developers perform their own deployments into production, as well as beginning a two-year journey to eliminate Sprouter.
To do this, the team decided to move all the business logic from the database layer into the application layer, removing the need for Sprouter. They created a small team that wrote a PHP Object Relational Mapping (ORM) layer, enabling the front-end developers to make calls directly to the database and reducing the number of teams required to change business logic from three teams down to one team.
As Snyder described:
“We started using the ORM for any new areas of the site and migrated small parts of our site from Sprouter to the ORM over time. It took us two years to migrate the entire site off of Sprouter. And even though we all grumbled about Sprouter the entire time, it remained in production throughout.”
By eliminating Sprouter, they also eliminated the problems associated with multiple teams needing to coordinate for business logic changes, decreased the number of handoffs, and significantly increased the speed and success of production deployments, improving site stability.
Furthermore, because small teams could independently develop and deploy their code without requiring another team to make changes in other areas of the system, developer productivity increased.
Among many things, an ORM abstracts a database, enabling developers to do queries and data manipulation as if they were merely another object in the programming language. Popular ORMs include Hibernate for Java, SQLAlchemy for Python, and ActiveRecord for Ruby on Rails.
Sprouter was finally removed from production and Etsy’s version control repositories in early 2001.
As Snyder and Etsy experienced, how we design our organization dictates how work is performed, and, therefore, the outcomes we achieve.
Endnotes and citations:
- Ross Snyder, “Scaling Etsy: What Went Wrong, What Went Right,” SlideShare: posted by Snyder, Oct 5, 2011: http://www.slideshare.net/beamrider9/scaling-etsy-what-went-wrong-what-went-right
- Ross Snyder, “Surge 2011—Scaling Etsy: What Went Wrong, What Went Right,” YouTube video, posted by Surge Conference, December 23, 2011, https://www.youtube.com/watch?v=eenrfm50mXw
- Patrick McDonnell and Michael Rembetsey, “Continuously Deploying Culture: Scaling Culture at Etsy Velocity Europe 2012,” Slideshare.net, posted by Patrick McDonnell, October 4, 2012, http://www.slideshare.net/mcdonnps/continuously-deploying-culture-scaling-culture-at-etsy-14588485