The “monoliths vs microservices” debate often focuses on technological aspects, ignoring strategy and team dynamics.
Instead of technology, smart-thinking organizations are beginning with team cognitive load as the guiding principle for modern software. In this presentation, Matthew Skelton & Manuel Pais explain how and why, illustrated by real case studies.
My name’s Matthew Skelton, and this is Manuel Pais, and together, we are the co-authors of a new book called Team Topologies. We’re here today to share with you some insights, advice, experiences on how to size software services, with a focus on team cognitive load. Today,
- We have a section where we’re looking at monoliths and microservices, different kinds of sizes of software.
- We’ll then look at what we mean by team cognitive load.
- Manuel will then take us through some case studies. Organizations that have used team cognitive load as a way of helping them to evolve their software systems.
- And then, right at the end, we’ll look at a few tips for getting started with this approach.
In the past few years, many organizations have started to adopt microservices as a way of being able to deploy their software systems more rapidly with a greater focus on specific areas of the system.
But there’s often a lot of debate around what size microservice you should be. Should it be ten lines of code? Should it be a hundred lines of code?
And it starts to look a little bit like Mortal Kombat.
In the blue, we’ve got Tammer Saleh who says, “Start with monolith and extract microservices.” And then, over on the other side of the arena, we’ve got Stefan Tilkov saying, “Don’t start with a monolith when your goal is a microservices.” And then the wise words of Simon Brown who says, “If you can’t build a monolith, what makes you think microservices are the answer?”
Where should we focus? I think that Daniel Terhorst-North has it right, in talking about software that fits in your head. And there’s an awful lot of experience and awareness behind that recommendation or that phrase. If we’re thinking about building software within the context of teams, teams owning and running software, we might rephrase this to be software that fits in our heads as a team, but the intent is the same.
If you’ve yet to buy or read a copy of Accelerate, you need to get yourself a copy. It’s very straightforward.
These are the four key metrics from Accelerate based on five years worth of the State of DevOps Reports and assessments from many thousands of companies around the world. And these four key metrics are strongly indicative of high organizational performance:
Lead time, deployment frequency, mean time to restore, and change fail percentage.
The problem is if the software we’re working with does not fit in our heads, these things are going to be very difficult to improve upon.
For example, if the lead time is the time from version control to production, and the software is too big, we’re likely to distrust these types of tests, we’re likely to want to take more time to find out what’s going on, the lead time is going to extend.
Same with the PROC frequency. If we don’t understand the software well enough, are we going to have the confidence to deploy more and more frequently? Probably not. We’re probably going to want to restrict how many times we deploy.
Or, if the software we’re working with is too complex and too complicated, and fails in awkward ways in production, it’s going to be difficult for us to restore that service quickly. So, again our MTTR will extend.
If we want to start to move towards improving these types of metrics, as recommended by the Accelerate book, then we need to start thinking about the size software that we’re expecting teams to work with.
Software that is too big for our heads works against organizational agility.
And this is a different starting point compared to how many organizations and many people have started thinking about software and architecture and so on. In the past, we’ve thought we’ve started with bits of technology. We’ve started with a database, etc. But if we start with the team and the cognitive load for that team, we get some different results.
Let’s have a little look at what we mean by Team Cognitive Load
In 1998, psychologist John Sweller defined the term cognitive load to mean the total amount of mental effort being used in the working memory. And there are three kinds of cognitive loads that John Sweller identified: intrinsic, extraneous, and germane.
In the context of software development, we can think of them in these three ways.
We can think of intrinsic like ‘how are classes defined in Java?’ We don’t have that front and foremost all the time. Once we’ve spent six months or a year doing Java development, it comes naturally and becomes an intrinsic part of how we work.
Extraneous is something that works against what we’re doing. Something like a distraction. So, ‘how do I deploy this app again? I can’t remember. It’s really awkward. I’ve got to set this concrete property. Etc.’ This is extraneous cognitive load and it’s effectively valueless. We don’t want to have this type of cognitive load on our teams.
Germane cognitive load is a type of load that we have to deal with because this is the part of the business problem that we’re trying to solve. If we are building an app for online banking, then part of the germane cognitive load of the software developer or tester is building the application to do bank transfers. Because you need to have that type of load in your head as you’re building the software.
You can see these in a software context. As intrinsic is the skills that we bring to the table. Extraneous is stuff to do with the mechanisms of how we do things in a software world. And germane is the domain focus. It’s a bit more involved than that, but that’s how you could think of it.
What we’re trying to do is maximize or give the most space to the germane cognitive load. The intrinsic we have to deal with, we can’t get rid of it. We’re working with software, we’re working with computers, so these are things we just need to know. We’re trying to minimize and squeeze the extraneous cognitive load, and get rid of that as much as possible. If possible, just get rid of it entirely, so that we have the most space available for the germane cognitive load, the business focus of the problem we’re trying to deal with.
If you want to know more by the way about this in some detail, there’s a great presentation by Jo Pearce called ‘Hacking your Head’ with lots of slides, lots of videos etc. There’s some good material there.
This is the implication of what we’ve just been talking about. We should be thinking about limiting the size of software, services, and products to the cognitive load that the team can handle. So, we’re starting to take a socio-technical approach to building our software systems here. We don’t just pretend that we can throw any kind of software architecture or design or technology at a team and they’ll have to deal with it. We’re using the constraints, or properties, of the human systems that we have in our organizations, and working with them to produce a more effective delivery and software systems.
This, again, is software that fits in our heads. This is quite a different approach to thinking about software boundaries. This may feel very unfamiliar to many people, but not to everyone, and there are organizations already doing this as we’ll see very shortly.
But it does feel a bit unusual. When we’re talking about teams, we’re talking about a group of people that have probably fewer than about nine people in size. There are evolutionary reasons for this. Some organizations have found patterns where you’re able to bring two of these kinds of teams together in close harmony.
If you think about a rugby team. You have effectively two closely operating teams together. You’ve had the forwards and the people at the back. I don’t play rugby, but I spoke to people who do and they do say it feels a little bit like there are two separate teams, but working closely together. Some organizations have found ways in which they can do that, but generally speaking, we’re talking about a cohesive, long-lived group of people that work together on the same set of business problems for an extended period and around nine people in total.
We hear a lot about ownership of software services and how important that is. It needs to the point where every service must be fully owned by a team with sufficient cognitive capacity to build and operate it. In the words of Andy Burgin from Sky Betting and Gaming, “you build it, you run it, you fix it, you support it, you diagnose it.” That’s what we’re talking about here. There are no services, there are no products which do not have an owner.
There are techniques to help us do that.
- Mobbing: We have techniques like mobbing, which apply to the whole team. Which help the team to own that service.
- DDD: We have techniques like domain-driven design (DDD) to help us choose domain boundaries in an effective way that really works for the business context.
- Developer Experience: We’ve heard many people talk about the importance of developer experience. Particularly when building a platform, making sure that platform is compelling and very easy and natural for product teams, development teams to use. So, we’re making sure we’re explicitly addressing developer experience particularly when we’re building a platform, sure — but to be honest, when we’re doing anything inside our organization where other people need to use our software.
- Operator Experience: But we also need to think about the operator experience. What about the people who need to run this stuff? People who are on call? How easy is it to diagnose these systems and so on. If we’ve built a system that’s fine for our team, but when we’ve handed it over to another team it’s terrible, the cognitive load is way too high, and we’re in a bad place. We need to focus on operability to make our stuff work.
- TVP: Finally, there’s another technique what we’ve in the book called ‘Thinnest Viable Platform.’ This is an approach where we explicitly define what the platform looks like while being super explicit about what our platform is. It’s also important to make sure that it’s not bigger than necessary, hence thinnest viable.
If you’re a start-up, and you’re quite small and there’s only maybe ten, fifteen people in your organization, the underlying platform is going to be something like AWS or Azure or Google Cloud, etc. But you might decide to build an extra platform layer on top of that. Your platform might simply be a Wiki page listing the five services that you are going to use from AWS. And if you don’t need to build anything more, don’t build anything more. That’s enough. That is your thinnest viable platform, just a wiki page with the list of five services.
We’re not trying to build a huge great thing. We need to make sure that whatever we build is compelling to use, has a strong developer experience, we’re treating product teams as we’re treating them as customers.
I’ve talked about a few different team types. In the book, we’ve identified four different kinds of teams which, as far as we can see, are the only types of teams needed in the context of building modern software systems.
The first team type is the most fundamental and this is the Stream-aligned team.
The team that is aligned to part of the value stream for the business and they have end-to-end responsibility for building, deploying, running, supporting, and eventually retiring that slice of the business domain or that slice of service. The other types of teams listed below are effectively there to reduce the cognitive load of the Stream-aligned team. That’s how we can see it.
If we’ve chosen our domain boundaries well, the Stream-aligned team should have everything they need to deploy changes for that part of the business system. But they can’t do everything, they need some supporting services from a platform, for example. We need some support from the platform so we don’t have to think about ‘how do we spin a Kubernetes cluster?’ Because that would increase the cognitive load compared to deploying something more business-focused.
Likewise for complicated subsystem team, if there’s a part of the system where, in the case of media streaming, we need to write a specialized video transcoding component, we’ll probably hire some people with PhDs in math and get them to work on a complicated subsystem. We’re taking the cognitive load off the Stream-aligned team to focus on more custom end to end experience.
Enabling teams help to up-skill the Stream-aligned teams, typically on a temporary basis. And also to detect if there are any gaps in the platform or gaps in what the Stream-aligned teams are expected to do.
Here’s an example:
This is an organization here where we’ve got three Stream-aligned teams. We’ve got a platform underneath. We’ve got a complicated subsystem on the left in red. And towards the right-hand side, we’ve got one of the enabling teams facilitating two of the Stream-aligned teams. Perhaps they’re moving from one container platform to another, something like that, and they just trying to get up to speed.
Another key idea in the book that we’ve identified is the need to be much more explicit about the ways in which teams interact. Because what we can see from our experience, and what we hear from other people talking about their experiences is that in many organizations, teams don’t understand why or how they should interact with other teams.
What we’ve defined are three interaction modes. Part of the purpose of these three interaction modes is to help reduce confusion and effectively reduce the irrelevant cognitive load so that it’s easier for teams to understand how they should be operating effectively. If the complicated subsystem, our transcoding component let’s say, if that team is busy building, then we can set up the expectation that they’re simply providing that component as a service to these two teams at the bottom. Then all three teams involved in that interaction have a clear understanding about how they’re supposed to interact, i.e. how they’re supposed to provide something or consume something, and we’ve minimized the cognitive load around how we should operate as a team.
Similarly, if the Stream-aligned team at the bottom is currently collaborating with the platform to discover something about, let’s say, logging or a better way of doing Kubernetes, etc. They know that for a period of time, they’re cognitive load is going to be higher because they’re working together closely with another team. But perhaps after three months, we finish that discovery and go back to consuming the container platform as a service. So, there are mechanisms here that if we can define clearer ways of working with other teams, we can address cognitive load and minimize that in different parts of the organization.
Now, we’re going to look at some case studies from organizations.
I’m going to share a few of the case studies from the book.
The first one is a large worldwide retailer, and they’re still growing into new markets.
Back in 2016, they decided they wanted a new mobile site for one of these new markets. They put a team together from scratch. It was a cross-functional team with business people directly involved in the team. They had all the technical skills to have this end-to-end ownership that Matthew was talking about. They had good DevOps practices. Everything was in the cloud. The typical success story that you would include in a presentation like this. And so given that success, they were able to quickly release working versions of the mobile website and then iterate frequently.
After a while they were asked to do the same for a new market, with a new mobile site. Though, they wanted this to be rather independent, so that it could evolve the different sites for different markets more or less independently. In the backend, they started to have a need for a little bit more complexity. They needed a content management system so they could upload content to different sites, but overall, this was working quite well still.
And of course, over time, they were asked to do even more markets, and more sites. And the backend started to get a little bit more complicated. They needed a system to handle product management, product catalog. Different markets were going to have different sets of products and versions available and pricing, etc. which they needed to manage as well.
They also started this framework which was a collection of common services to all the sites. Things like searching for a product or uploading static files to a CDN, things that all the sites would need, but you wouldn’t want to repeat it for every code base.
I think you can tell probably what’s happening here.
As the system is growing, the team is growing along with it. By now, they have far more people than they did in the beginning. And so, it’s becoming a little bit of a monolith. And some of the people on the team started to realize that they had different workstreams going through the teams. So, you could have feature requests for one of the market’s sites, and another feature requests for other markets. You also might have changes that need to be done in the CMS for the content editors and so on.
The fact that the system was a little bit monolithic by now meant that these workstreams were also impeding each other. There were dependencies that were slowing down the pace of delivery. The thing that had made them so successful in the beginning was now harder to achieve.
People also had to start specializing in certain parts of the system. While before it was pretty fluid, you’d get a change request or a feature and it would go, you would know exactly which parts of the system to change and get it out, now people were starting to specialize in specific parts.
There were two people in particular on this team who were in a senior architect role who started to realize this and even though the team worked quite well together, and were a high-performing team, they began to notice these dependencies.
Those two senior architects proposed to split the team in two and they got a lot of pushback because the team members felt that they were working well together. But eventually, they did split the team up, and they got into this pattern Matthew mentioned a paired team.
After doing some refactoring of the system and re-architecting a bit, they were able to split into two teams. One team that is more focused on the customer-facing applications and markets. And the other team focusing more on the CMS and this framework.
This worked quite well for them, and now these two teams were able to deliver more independently. There was still some correlation between the roadmaps for these two teams. And they had this communication going on on a regular basis, but they were much more independent at this point.
They realized at this point that there was too much cognitive load. The system was too large to handle as efficiently as before. And from what we’ve heard, they’ve gone on to breakdown the teams even further. I believe now they have smaller teams aligned to markets on the customer-facing side and they have split the CMS and the frame, which is a platform team as Matthew was mentioning with common services. Overall, this worked quite well for them.
The key point here was as they grew and they were successful, the system became larger and the team became larger, and things were starting to not work as well. So, their flow of work was getting blocked or at least significantly delayed.
The critical thing is that some people in the team were listening to the signals that something was not as efficient as it was before. The software was getting too large in this monolithic architecture. And some people were overspecialized. If you read the Phoenix Project, it’s the ‘brand syndrome’ where only this person or these couple of people know how to change that part of the system. This means that you’re introducing this dependency, even inside one team, you have this dependency that only when those people are available are you able to get this out the door, which overall increases the need to coordinate releases and introduces delays in delivery.
But it’s not always just about the size of the software that teams are responsible, there other types of responsibilities.
In the case of OutSystems who are one of the leading low-code platform vendors in the world.
A few years ago, they started an engineering productivity team. In the beginning, this team worked to enable teams to build, continuous integration, and test automation.
That’s what they started with. Their goal was to reduce cognitive load for the other engineering teams who were their ‘customers,’ if you like. They were helping them adapt good practices in these areas, setting up tooling in a good way, and just overall helping the engineering teams increase their maturity in these areas.
And again, they were quite successful.
What happened was that they took on more domains. Particularly, infrastructure automation and continuous delivery enablement. The team grew to cope with that. And the interesting fact here was that as this was happening, the other engineering teams were getting more mature, more advanced in the way they used test automation, CICD, etc. And so they were coming back to them with requests for help that were much more domain-specific for those teams.
What this productivity team now faced was a large number of requests across different domains coming in from different teams with specific needs. They were barely able to keep afloat and respond in a timely enough basis to these requests.
Inside the team, it became very difficult for any team member to understand all these different domains. People were in the practice of working on only one, or perhaps two, domains and motivation went down significantly. Some of the people felt like they didn’t have enough effort available to master the domains that they’re supposed to support and understand them in detail. And at the same time, they were spending a lot of time in planning meetings. In standup meetings where most of the things being discussed were not directly related to the work that they were doing.
At this point, and this is quite recent, so late 2018, they made a bold decision to split into smaller teams, almost micro-teams, where any one team was only responsible for one of these domains and the early results were quite positive.
Motivation went up, people felt like they had more autonomy to decide what the priorities were for their domain of responsibility. Also, they were able to interact much more closely with the other engineering teams. They began to understand what the problems they had were. What the best solutions they could find. They were also able to have a little bit of breathing space to master this domain, understand good practices, perhaps come to conferences like this, and get to know what other people are doing. So, naturally, the motivation went up, and there was a feeling of shared purpose inside each of these teams.
Obviously, there were still issues and maybe requests that were cross-cutting some of these domains as they are closely related, but it turns out that those are the exception. When that happens, people from different teams will come together. If needed, they will create a temporary team to work on that specific problem, and then go back to their original teams. In fact, before they were optimizing for this situation, when it’s the exception
This has worked quite well for them for now. There’s still communication going on between the different teams, but the bandwidth required is much lower. The key is that it’s not always about software size, but aligning the number and complexity of the domains that the team is responsible for their cognitive capacity.
If you aim for this type of pattern with smaller teams with high cohesion internally, high communication internally, and shared purpose — then you need some synchronization with other teams, but that can have much lower bandwidth. You don’t need to be communicating across all teams all the time. That can work quite well.
Finally, they were listening to the signal — what worked for them in the past or in the beginning, is now becoming a problem. Some people were not invested. Some people may be almost burned out because they were trying to keep up with all the different domains. They’d have to put in a lot of extra time to understand all of this. And definitely frequent context switching inside the team.
The last example is not from the book, it’s from a recent talk again from Sky Betting and Gaming. Is this always a good pattern to split into smaller teams? Well, not necessarily. In this case, they decided to keep a large team of twelve people because they had different applications. Some older applications that were what making money today and new applications, more experimentation, trying new markets.
What happened was that within the same business domain, the demand for working on one part, older applications or newer, would change over time. In one quarter, maybe we need to increase the resilience of the older systems most of the time, and you would spend most of the time on that. Then the next quarter, maybe they would want to push out new applications and try new things so, it made sense to keep the same team, but within the team, there were clear workstreams. People knew who was focusing on which part, whether that be and the older systems or newer systems.
How do we get started with this type of approach?
A few ideas here.
- Simply speaking, just ask team members, just do a survey of members in a given team how well they understand the software they’re working on, and give it a score of 1-5. Get a very rough idea of which teams are currently struggling with the cognitive load of the systems they’re being asked to own and develop.
- Could there be something that’s a candidate for pushing into a platform? Don’t rush ahead and do it, but come up with a candidate list to start and have some conversations.
- Look for missing skills or capabilities that could be within the team. It could be that the organization as a whole is missing skills.
What would happen if we adopted these three team interaction patterns that we saw earlier on:
- Closed collaboration, so we know our cognitive load is going to be higher.
- x-as-a-service, where we know we’re just supposed to consume something.
- Facilitating, we’re helping or being helped.
How would your teams react and behave in these contexts? Because you need to sense your organizational situation. The maturity or the dynamics within your organization as to where to start to apply some of these sorts of practices. Don’t rush in and do it.
Is your platform well defined? If not, go ahead and define it and quite carefully. You’ll probably be surprised that there are far more services that are being run by a small group of nearly burned out platform engineers, and so, it’s time to do something about that. What is the thinnest platform that could work in your context? It doesn’t have to be thin, but the thinnest and no more.
This was an excerpt from a presentation by Matthew Skelton and Manuel Pais, authors of Team Topologies: Organizing Business and Technology Teams for Fast Flow.