Over the past 2 blogs in this series, we have discussed the necessary steps to start your DevOps transformation.
- We covered the three key components to consider in choosing a starting place in this post: Selecting Which Value Stream to Start With
- We covered how value is delivered to the customer and how to improve flow in this post: Understand the Work in Our Value Stream and Improving Flow
This week, based on the newly updated and expanded second edition of The DevOps Handbook, we are learning how and why to design with Conway’s law in mind.
Our goals in this post will be:
- Understanding Conway’s Law and its impact on the performance of our value stream
- Evaluating our organizational archetypes
- Developing the habits and capabilities in people and the workforce as a means of facilitating these structures
Conway’s Law has a tremendous impact on the performance of our value stream.
To illustrate this, let me share a story—in 1968, Dr. Conway was performing a famous experiment.
Together, with a contract research organization of eight people, they were commissioned to produce a COBOL and an ALGOL compiler. During the experiment, he observed, “After some initial estimates of difficulty and time, five people were assigned to the COBOL job and three to the ALGOL job. The resulting COBOL compiler ran in five phases, the ALGOL compiler ran in three.”
These observations led to what is now known as Conway’s Law, which states:
“Organizations which design systems…are constrained to produce designs which are copies of the communication structures of these organizations… The larger an organization is, the less flexibility it has and the more pronounced the phenomenon.”
In other words, how we organize our teams has a powerful effect on the software we produce, as well as our resulting architectural and production outcomes.
In order to get fast flow of work from Development into Operations, with high quality and great customer outcomes, we must organize our teams so that Conway’s Law works to our advantage.
We begin this process by evaluating the organizational archetypes.
In the field of decision sciences, there are three primary types of organizational structures that inform how we design our DevOps value streams with Conway’s Law in mind: functional, matrix, and market.
They are defined by Dr. Roberto Fernandez as follows:
- Functional-oriented organizations optimize for expertise, division of labor, or reducing cost. These organizations centralize expertise, which helps enable career growth and skill development, and often have tall hierarchical organizational structures. This has been the prevailing method of organization for Operations, (i.e., server admins, network admins, database admins, and so forth are all organized into separate groups).
- Market-oriented organizations optimize for responding quickly to customer needs. These organizations tend to be flat, composed of multiple, cross-functional disciplines (e.g., marketing, engineering, etc.), which often lead to potential redundancies across the organization. This is how many prominent organizations adopting DevOps operate—in extreme examples, such as at Amazon or Netflix, each service team is simultaneously responsible for feature delivery and service support.
- Matrix-oriented organizations attempt to combine functional and market orientation. However, as many who work in or manage matrix organizations observe, matrix organizations often result in complicated organizational structures, such as individual contributors reporting to two managers or more, and sometimes achieving neither of the goals of functional or market orientation.
In traditional IT Operations organizations, we often use functional orientation to organize our teams by their specialties.
However, there are several problems that can occur by overly function orientation (“Optimizing for Cost).
For example: when we put the database administrators in one group, the network administrators in another, the server administrators in a third, and so forth – one of the most visible consequences is long lead times. Especially for complex activities like large deployments where we must open up tickets with multiple groups and coordinate work handoffs, resulting in our work waiting in long queues at every step.
In addition to these long queues and long lead times, this situation results in poor handoffs, large amounts of re-work, quality issues, bottlenecks, and delays.
This gridlock impedes the achievement of important organizational goals, which often far outweigh the desire to reduce costs.
Similarly, functional orientation can also be found with centralized QA and Infosec functions, which may have worked fine (or at least, well enough) when performing less frequent software releases.
However, as we increase the number of Development teams and their deployment and release frequencies, most functionally-oriented organizations will have difficulty keeping up and delivering satisfactory outcomes, especially when their work is being performed manually.
Therefore, broadly speaking, to achieve DevOps outcomes we need to reduce the effects of functional orientation (“optimizing for cost”) and enable market orientation (“optimizing for speed”).
Which means having many small teams working safely and independently, quickly delivering value to the customer.
Taken to the extreme, market-oriented teams are responsible not only for feature development, but also for testing, securing, deploying, and supporting their service in production, from idea conception to retirement.
These teams are designed to be cross-functional and independent—able to design and run user experiments, build and deliver new features, deploy and run their service in production, and fix any defects without manual dependencies on other teams, thus enabling them to move faster.
This model has been adopted by Amazon and Netflix and is touted by Amazon as one of the primary reasons behind their ability to move fast even as they grow.
To achieve market orientation, we won’t do a large, top-down reorganization, which often creates large amounts of disruption, fear, and paralysis. Instead, we will embed the functional engineers and skills (e.g., Ops, QA, Infosec) into each service team, or provide their capabilities to teams through automated self-service platforms that provide production-like environments, initiate automated tests, or perform deployments.
This enables each service team to independently deliver value to the customer without having to open tickets with other groups, such as IT Operations, QA, or Infosec.
However, having just recommended market-orientated teams, it is worth pointing out that it is possible to create effective, high-velocity organizations with functional orientation.
Cross-functional and market-oriented teams are one way to achieve fast flow and reliability, but they are not the only path. We can also achieve our desired DevOps outcomes through functional orientation, as long as everyone in the value stream views customer and organizational outcomes as a shared goal, regardless of where they reside in the organization.
In fact, many of the most admired DevOps organizations retain functional orientation of Operations, including Etsy, Google, and GitHub.
What these organizations have in common is a high-trust culture that enables all departments to work together effectively, where all work is transparently prioritized and there is sufficient slack in the system to allow high-priority work to be completed quickly.
Now that we’ve evaluated the archetypes of your organization, we will look at developing the habits and capabilities in people and the workforce as a means of facilitating these structures.
Developing the Right Habits and Capabilities in your Team
To be able to employ this correctly, testing, operations and security needs to be, first and foremost, everyone’s job, everyday.
In high-performing organizations, everyone within the team shares a common goal—quality, availability, and security aren’t the responsibility of individual departments, but are a part of everyone’s job, every day.
This means that the most urgent problem of the day may be working on or deploying a customer feature or fixing a Severity 1 production incident.
Alternatively, the day may require reviewing a fellow engineer’s change, applying emergency security patches to production servers, or making improvements so that fellow engineers are more productive.
Secondly, we need to enable every team member to be a generalist.
In extreme cases of a functionally-oriented Operations organization, we have departments of specialists, such as network administrators, storage administrators, and so forth.
When departments over-specialize, it causes siloization, which meaning they end up operating more like “sovereign states.”* (add a footnote that it’s Dr. Spears term)
Any complex operational activity then requires multiple handoffs and queues between the different areas of the infrastructure, leading to longer lead times (e.g., because every network change must be made by someone in the networking department).
Because we rely upon an ever increasing number of technologies, we must have engineers who have specialized and achieved mastery in the technology areas we need. However, we don’t want to create specialists who are “frozen in time,” only understanding and able to contribute to that one area of the value stream.
One countermeasure is to enable and encourage every team member to be a generalist.
We do this by providing opportunities for engineers to learn all the skills necessary to build and run the systems they are responsible for, and regularly rotating people through different roles.
The term full stack engineer is now commonly used (sometimes as a rich source of parody) to describe generalists who are familiar—at least have a general level of understanding— with the entire application stack (e.g., application code, databases, operating systems, networking, cloud).
When we value people merely for their existing skills or performance in their current role rather than for their ability to acquire and deploy new skills, we (often inadvertently) reinforce what Dr. Carol Dweck describes as the fixed mindset, where people view their intelligence and abilities as static “givens” that can’t be changed in meaningful ways.
Instead, we want to encourage learning, help people overcome learning anxiety, help ensure that people have relevant skills and a defined career road map, and so forth. By doing this, we help foster a growth mindset in our engineers—after all, a learning organization requires people who are willing to learn.
Next, we’ll look at how we fund our teams can also affects our outcomes.
How Funding and Team Size Affects Outcomes
One way to enable high-performing outcomes is to create stable service teams with ongoing funding to execute their own strategy and roadmap of initiatives. These teams have the dedicated engineers needed to deliver on concrete commitments made to internal and external customers, such as features, stories, and tasks.
Contrast this to the more traditional model where Development and Test teams are assigned to a “project” and then reassigned to another project as soon as the project is completed and funding runs out.
This leads to all sorts of undesired outcomes, including developers being unable to see the long term consequences of decisions they make (a form of feedback) and a funding model that only values and pays for the earliest stages of the software life cycle—which, tragically, is also the least expensive part for successful products or services.
Our goal with a product-based funding model is to value the achievement of organizational and customer outcomes, such as revenue, customer lifetime value, or customer adoption rate, ideally with the minimum of output (e.g., amount of effort or time, lines of code).
Contrast this to how projects are typically measured, such as whether it was completed within the promised budget, time, and scope.
Finally, by creating loosely-coupled architectures and designing team boundaries to enable developer productivity and safety, we can improve deployment outcomes.
When we have a tightly coupled architecture, small changes can result in large scale failures.
As a result, anyone working in one part of the system must constantly coordinate with anyone else working in another part of the system they may affect, including navigating complex and bureaucratic change management processes.
In contrast, having architecture that is loosely coupled means that services can update in production independently, without having to update other services.
Randy Shoup, former Engineering Director for Google App Engine, observed that “organizations with these types of service-oriented architectures, such as Google and Amazon, have incredible flexibility and scalability. These organizations have tens of thousands of developers where small teams can still be incredibly productive.”
One way to keep team sizes small is to design our team boundaries in accordance with Conway’s Law.
As organizations grow, one of the largest challenges is maintaining effective communication and coordination between people and teams.
All too often, when people and teams reside on a different floor, in a different building, or in a different time zone, creating and maintaining a shared understanding and mutual trust becomes more difficult, impeding effective collaboration. Collaboration is also impeded when the primary communication mechanisms are work tickets and change requests, or worse, when teams are separated by contractual boundaries, such as when work is performed by an outsourced team.
Conway’s Law helps us design our team boundaries in the context of desired communication patterns, but it also encourages us to keep our team sizes small, reducing the amount of inter-team communication and encouraging us to keep the scope of each team’s domain small and bounded.
As part of its transformation initiative away from a monolithic code base in 2002, Amazon used the two-pizza rule to keep team sizes small—a team only as large as can be fed with two pizzas—usually about five to ten people.
This limit on size has four important effects:
- It ensures the team has a clear, shared understanding of the system they are working on. As teams get larger, the amount of communication required for everybody to know what’s going on scales in a combinatorial fashion.
- It limits the growth rate of the product or service being worked on. By limiting the size of the team, we limit the rate at which their system can evolve. This also helps to ensure the team maintains a shared understanding of the system.
- It decentralizes power and enables autonomy. Each two-pizza team (2PT) is as autonomous as possible. The team’s lead, working with the executive team, decides on the key business metric that the team is responsible for, known as the fitness function, which becomes the overall evaluation criteria for the team’s experiments. The team is then able to act autonomously to maximize that metric.†
- Leading a 2PT is a way for employees to gain some leadership experience in an environment where failure does not have catastrophic consequences. An essential element of Amazon’s strategy was the link between the organizational structure of a 2PT and the architectural approach of a service-oriented architecture.
Amazon CTO Werner Vogels explained the advantages of this structure to Larry Dignan of Baseline in 2005. Dignan writes:
“Small teams are fast…and don’t get bogged down in so-called administrivia….Each group assigned to a particular business is completely responsible for it….The team scopes the fix, designs it, builds it, implements it and monitors its ongoing use. This way, technology programmers and architects get direct feedback from the business people who use their code or applications—in regular meetings and informal conversations.”
With these pieces in place, we can see how architecture and organizational design can dramatically improve our outcomes.
Done incorrectly, Conway’s Law will ensure that the organization creates poor outcomes, preventing safety and agility.
Done well, the organization enables developers to safely and independently develop, test, and deploy value to the customer.