Architecture as the Organizing Logic for Components, and the Means for their Construction
With Michael Nygard
On This Episode
In the latest episode of The Idealcast, Gene Kim is joined by Michael Nygard, a senior vice president at Sabre and author of the bestselling Release It! Nygard has helped businesses and technology leaders in their transformation journeys over his long career and was even one of the inspirations behind The Unicorn Project’s protagonist, Maxine.
In their discussion, Kim and Nygard explore how we can enable thousands or even tens of thousands of engineers to work together toward common objectives, including the structure and dynamics required to achieve it. They also examine what truly great architecture looks like and the continuing importance and relevance of Conway’s Law.
About the Guest
Michael Nygard strives to raise the bar and ease the pain for developers around the world. He shares his passion and energy for improvement with everyone he meets, sometimes even with their permission. Living with systems in production taught Michael about the importance of operations and writing production-ready software. Highly-available, highly-scalable commerce systems are his forte.
Michael has written and co-authored several books, including 97 Things Every Software Architect Should Know and the bestseller Release It!, a book about building software that survives the real world. He is a highly sought speaker who addresses developers, architects, and technology leaders around the world.
Michael is currently Senior Vice President, Travel Solutions Platform Development Enterprise Architecture, for Sabre, the company reimagining the business of travel.
You’ll Learn About
- How to build great architecture for large teams.
- The real implications of Conway’s Law.
- Architecture as an organizing logic and means of software construction.
- Real-life stories of technology leaders’ transformation journeys.
- Decentralized economic decision making.
- The fear cycle and predictability.
- The after effects of the Yegge memo.
- A great definition of what great architecture is.
- Leadership and the relationship between the business’ architecture and the technology architecture of the business.
- Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers) by Michael T. Nygard
- Clojure programming language
- Transaction Processing Facility (TPF) operating system
- Totality Corporation
- The Principles of Product Development Flow: Second Generation Lean Product Development by Donald G. Reinertsen
- MCDP1: Warfighting
- Conway’s law
- Team of Teams: New Rules of Engagement for a Complex World by General Stanley McChrystal with Tantum Collins, David Silverman and Chris Fussell
- The Fear Cycle by Michael T. Nygard
- State of DevOps Report
- DevOps Enterprise Summit 2020
- Coherence Penalty for Humans by Michael T. Nygard
- Michael Nygard on Cognicast podcast
Gene Kim (00:00:00): This episode is brought to you by IT Revolution, whose mission is to help technology leaders succeed through publishing and events. You're listening to the Idealcast with Gene Kim. Brought to you by IT Revolution. In this episode of the Idealcast, I'm so delighted that I have on Mike Nygard, currently VP of Enterprise, Architecture and Platform Development at Saber, the software and technology company at the heart of the business of travel. Incidentally, it is Sabre in the 1970s who pioneered the entire category and capability of travel reservations and booking. It is difficult to overstate how much of my own work was deeply influenced and inspired by Mike Nygard.
I remember reading his seminal book, Release It. It was on a plane and I remember feeling not only sick reading it, but also angry. I wanted to throw it across the cabin because Mike was describing so many problems I had seen in my career with such incredible clarity. And he was describing the incredibly elegant solutions required to mitigate so many of those problems. His work has pioneered patterns that we're all familiar with now, such as circuit breakers, the bulkhead pattern, chaos engineering, and so many more.
And here's one particularly memorable interaction I had with Mike. It was at the Velocity 2013 Conference in Santa Clara. Among many other things, Mike showed me closure, the programming language that reintroduced the joy programming back into my life. It is now why I self identify not as an OPs person, which I've done for 20 years, but as a developer. In this episode, I learned so much from Mike about so many things, including how we can enable thousands or even tens of thousands of engineers to work together towards common objectives. What are the structure and dynamics required to do that? What are the real implications of Conway's law? What great architecture looks like?
And his definition of architecture, which includes the notion of the organizing logic and means of software construction and specific examples of how he's helped business and technology leaders in their transformation journeys over the years. Mike, I am so glad that you're with me on this podcast. So I've described you in my words, could you describe yourself in your own words and tell us what you've been working on these days.
Micheal Nygard (00:02:22): It's definitely easier to describe what I've been working on, but I'll try to describe myself a little bit. Almost more than anything, I think of myself as someone who tries to understand how things work, the systems that we work on as information systems and technology systems. But also the systems of culture and business and finance and physics that really define the world around us. So I often feel like too much of a generalist to really put myself in any one category. If you look at my experience, the work experience is primarily as an architect and developer with some time spent in operations and sales and management along the way.
The thing that I'm working on right now is a major technology transformation effort at Sabre, where we are trying to recreate the business of travel and recreate ourselves and our systems dealing with a considerable legacy of what we variously call heritage technology, classic technology, legacy technology, and so on. Where at any given time we'll have a graph QL query coming in via HTTP at the front end and at the back end, we have a teletype message going out to an airline or a baggage claim.
Gene Kim (00:03:44): At the heart of it is a famous TPF system that goes back. I had imagined 40 plus years, is that right?
Micheal Nygard (00:03:50): Absolutely. In fact, it appears that TPF was in some way factored out of the original Sabre system back when Sabre was not the name of a company, but it was actually just the name of the machine. It was the semi-automated business research environment, which was that specific mainframe that was being used for the reservation systems for American airlines.
Gene Kim (00:04:12): And TPF is the operating system that runs on the main firm that does not the z/OS that has some magnificent properties, is that right?
Micheal Nygard (00:04:19): Yeah. Sometimes, think of it as the land that time forgot. It's built like a realtime operating system. It loves interrupts. It loves very quick transactions. It hates batch jobs. That's kind of the opposite of z/OS and MVS.
Gene Kim (00:04:36): Gene, here. Before we leave the topic of TPF, I feel like I need to explain the Marvel that it is. It's a real time operating system that runs on IBM mainframes. Wikipedia describes it like this. TPF delivers fast, high volume, high throughput transaction processing, handling large continuous loads of essentially simple transactions across large geographically dispersed networks. I was talking with Roseland Rachleff, a distinguished engineer at IBM who the DevOps enterprise community knows so well, whose area of specialty includes mainframes. She told me this. Google is famous for its scale, processing something like 70,000 searches every second.
Around the world, the z/OS mainframe platform collectively does 2 billion transactions every second. But she added, if you include TPF, that number would be around 200 billion transactions every second. Which is collectively seven orders of magnitude, more transactions than what Google handles each second. This should give you an idea of the scale that TPF operates at. All right back to my interview of Mike. Great. So you've mentioned in your career, the strange path you've taken. And I mentioned the released book. I think it's such a marvelous book. And with some hindsight, it seems now obvious to me that this is a book that could only have been written by a true boundary spanner. Someone who has walked in the shoes of a developer of an operation person and an architect.
And by the way, when I was writing the unicorn project, the protagonist Maxine was in many ways modeled after you.
Micheal Nygard (00:06:10): Oh, wow.
Gene Kim (00:06:10): That's so amazing to me that when people read the book kind of their reactions like, "Holy cow, she seems like a superhero."
Micheal Nygard (00:06:17): That must be why I liked her so much.
Gene Kim (00:06:22): Can you tell us the story of how you ended up gaining this unique type of perspective that allowed you to write the first edition of release?
Micheal Nygard (00:06:28): It's pretty early on in my career, I had the experience of being an operator for networks of Linux machines. Now it happened that these were all virtual machines running on a UNIX's mainframe as part of a federal government contract. And I was the only UNIX person in this entire UNIX's facility of mainframe programmers. So I was the only one who cared about the Linux machines, which also meant I could break them and recreate them. And they were all virtual machines. So I could create as many of them as I liked. Each one had their own file system. So I got to do a lot of administrative work in the safest possible environment. It was [crosstalk 00:07:11]-
Gene Kim (00:07:11): By the way what year was this?
Micheal Nygard (00:07:12): ... mainframe. This would have been '92. I think '91, '92, somewhere around there. All of that got delivered and we did some crazy stuff. As part of a positive compliance effort, I wrote a version of VI that worked on a full screen synchronous terminal. So imagine you edit your whole screen full of text, you go down to the command line, you hit submit. And that sends the whole page of text to the mainframe and then turn it into VI commands and all of that. So that experience sort of set me up, understanding running the software as well as building the software. And I took that with me into all of my early jobs. It really, at that time, there wasn't as strong a divide between development and operations.
As we started to see in the late '90s, as the web and large scale systems really spun up. It was around 2003, I think 2002 or 2003, when I was looking for a contract, I had come out of a consulting partnership and was looking for something to kind of fill the time. Let me relax a little bit, bring in some money, but also re-energize and heal a little. It didn't end well. So I took a contract that was in operations, supporting applications in what turned into a very large commerce system launch and my kickback and relax contract turned into 70 hours a week for four or five months at a time. Including some really dire situations.
So, we had a lot of software that was getting rushed out. It wasn't all fully tested. Some of it's scaled very badly. The first load test we did, we were expecting to get to 25,000 sessions, but we got to 400 and everything locked up across every node in the cluster.
Gene Kim (00:09:20): Right. And then went down to zero.
Micheal Nygard (00:09:22): Yeah. It went to zero. Yeah. So what's interesting is how much you learn when things are broken. If you're faced with a piece of technology that just works, you learn almost nothing about how it's constructed, but when it's broken and you have to fix it repeatedly. And it keeps breaking in different ways, you learn a tremendous amount about that. Because I was someone in operations who had come from development, I was able to go in when all of our threads were locked up and pull stack traces and just find out why it was locking up. So I would feed that back to the dev team with line numbers in the source code. And they said, "But you don't have access to the source code." Like, "Come on, really." You give me a jar file, five minutes later, I've got A version of the source code.
So, that was this interesting experience where I was in operations. And I thought that most of our break fix work was going to be around dead CPU or burnt out fans or failing hard drives. And absolutely not. The overwhelming majority of our problems were software created. So I looked at this and I'm like, "This is embarrassing. This is costing us a lot of money. Why would we put this software out here when an hour of downtime is going to cost us $10 million?" We saved a couple of weeks of development, but we're losing $10 million every other week or so when this stuff crashes.
Gene Kim (00:11:00): Yeah. In fact, you tell the story in your amazing 2018 DevOps enterprise talk on tempo, maneuverability initiative, which we'll get back to later. But there was a part of that story that really struck me, which was that immediate ability to have a high bandwidth communication with your development counterpart. Because you were talking in the language of a... Was it ports, sockets and something. So that you could actually quickly work together to solve the problem together. Do I understand correctly that you were actually part of a certain outsourced operations organization? You were able to span a very complex boundary?
Micheal Nygard (00:11:33): Yeah, that's true. This was a startup called Totality, which eventually became acquired by Verizon and became a function called remote application management. So our premise was operation software at the time was very expensive, certainly staffing an operation center, 24/7 is very and in equipment in personnel. So our value proposition was you can amortize the cost of doing that by sharing the expense. So you're renting a portion of our operations capability and yeah, that's another place where I was one of the software people in operations. So I was initially chief engineer for my commerce client and eventually for the whole central region of the US. And the ability to talk in a language that met the developers really helped.
And this is something that I've done a lot. You mentioned boundary spanning. And I think when you are encountering a boundary between disciplines, you don't need to learn everything about the other discipline, but it does help to learn enough of the language to be able to communicate with the denizens of that domain. So when you're talking to database people, you probably talk about tables and columns and data types. Maybe if you're an experienced developer, you can talk about whole table spaces or increasingly these days, shards and segments and spawn. But you're probably not talking about cylinder layouts or LUNs or the sand behind the HPAs and that sort of thing.
Likewise, when development and ops talk, it's really best to meet those concrete artifacts. Like you mentioned, machine names, ports, files, locations, that sort of thing. If you come in and you say, "My visitor pattern is overtaxing the garbage collector, the Ops person is likely to kind of go, "Well, okay, that's interesting. I don't know what you want me to do about that."
Gene Kim (00:13:45): I think that's really one of the marks of a great boundary spanner. Is the ability to speak in the language of the denizens of the domain that you were speaking to. One of the things that I started to pick up in your areas of interest, it was a notion of how do you get large groups of teams to work towards a common objective? And in the very last moments of your talk, you start talking about how do you get everyone moving towards a direction, accomplishing things in the easy path, rather than screwing up and having to re-corral things and moving back into a realm of centralized control. Is that a problem, really? And if so, what makes it so hard?
Micheal Nygard (00:14:20): So it is definitely a problem. And it's something I think of as a day two problem, or maybe a year two problem. And it's one of these interesting transitions that you need to go through. Many organizations in need of transformation are too much in the command and control space. And a lot of what they need to do is decentralize and liberate their people, create some autonomy. Once you have done that, then you're faced with the question of, "Okay, now we've got autonomy, but we're faced with an existential challenge in the marketplace." How do we realign around that? And I can use metaphors. I think of it as you create a magnetic field, so all the iron filings lineup in the right direction. Within Sabre, I've been talking about how, okay, we've got a super tanker, which is very efficient, but not very flexible.
It can only go certain places. It takes half of Texas to turn around. What we want is something that's much cheaper to replace or that we can replace individually that can go more places. We want a speed boat. Actually, we want a lot of speed boats, but we want them all going in the same direction because otherwise they'll crash into each other and sink. Once we get past the metaphors, it gets a lot more challenging. If you think about the tools available within an organization, one of them is vision setting. So communicating a vision that people can feel connected to and see the connection to their actual work. And this is vision of a different type than saying we're going to be, I don't know, the greatest company to work for in the US. That doesn't really say anything.
We're going to be customer focused. Okay. That's great. But I can undertake almost any initiative I want and tie it to the notion of being customer focused. So you have to create a vision that is audacious, consistent, and that people can tell if they're proceeding towards it or not. So again, with the metaphors, some people will refer to that as a North Star, but imagine you set out a vision that says a customer should be able to get up and running within our system in 10 minutes and deactivate just as quickly.
Well, that's going to require action by a large number of different parts of your organization. So you've got some UI work. You've got some backend work, whatever your billing system is, has to be aware of this. You may need to change some of your customer care procedures, instead of having a recovery script and a special group that exists just to annoy them until they hang up. You need the ability to hit a button so that they're deactivated. So it's still a vision, but it's a vision of a different sort. It's somehow more attainable or more actionable. Another thing that really helps, maybe one of the greatest accelerators of all is trust in your comrades and in the other teams. And that's trust, laterally and vertically within the organization. I've read a lot on how that trust gets created.
And it turns out there are different ways of doing it, but they don't mix all that well. So one approach is you have a group of people who've all been through the fire together. This is common in the hyper-growth stage of startups. You're kind of going through this expansionary phase where you take the dozen or so people that formed the nucleus of the company before and now each of them has a group of 100 or 200 that they're trying to align. Well, that doesn't has a great deal of trust for a while and that can help. If you have a common training method and doctrine, as you would see in a military organization that can create trust as well. So there are different mechanisms to achieve it, but trust is also pretty essential. And then Don Reinertsen talks about another method for creating a distributed action with autonomous teams, all heading in a direction.
And that's what he calls decentralized economic decision making, that you might have read this in his fabulous book. The principles of product development flow, an example from the development of the triple seven aircraft, where an engineer was authorized to add to the purchase cost of the plane, if he or she could reduce the weight of the plane. Because they had done the analysis of the economics and said, a pound of weight has to be carried on every trip, back and forth that the plane makes. So customers will pay an extra, I don't know, $300 for one pound less of aircraft weight. If you can figure it out what your sensitivity is and publish that kind of a metric, then it's fabulous to get people aligned in the right direction. Sadly, in a lot of fields, it's not quite as clear cut to define as with the triple seven.
Gene Kim (00:19:48): Gene, here. Mike mentioned the amazing works of Don Reinertsen and his incredibly brilliant and Seminole book: Principles of Product Development Flow, which framed and codified how so many of the lean principles could be applied to product development. I love that story of the decentralized decision making in the triple seven engineering teams. And I had the privilege of seeing Don Reinertsen present at the Scaled Agile Safe summit in 2017. It was incredible because he describes his evolving thinking on how to enable decentralized decision making, which he feels is so important in the modern age. So much of this was informed by his experience in the US Navy as a submariner. He spent six years in active duty and 16 years in the reserves, retiring as a captain. So given that, it was a little funny to hear how influenced he was by the US Marine Stockton, which he held up as the gold standard of what effective decentralized decision making looked like.
Specifically, he talked to MCDP1, that's Marine Corps Doctrinal Publication 1. I'll put a link to it in the show notes. What caught my eye was how he described what high trust looks like in the US Marines. They place a great emphasis on trusting superiors, subordinates and peers. Relying on four things, dependability, the certainty of proper performance of duty, integrity, nothing less than complete honesty in all dealings with subordinates, peers and superiors is acceptable. Unselfishness, looking out for the needs of subordinates before your own is the essence of leadership and loyalty. Faithfulness to country core unit seniors, subordinates, and peers, centrify. And then he went on to talk about mission type tactics or Alf trucks tactic. Wikipedia defines it as a form of military tactics where the emphasis is on the outcome of a mission rather than a specific means of achieving it.
Don Reinertsen talked about the need to tell people why they're doing things not what to do and how to do it. What goal larger than the battle itself are we trying to achieve? And I love this phrase, understand the commander's intent, at least two levels up in the organization. And he says, "If you know why you are doing things, you can adapt better to changing conditions." It was an amazing breathtaking presentation. And I'll just mention one more slide that really caught my attention. He talked about boldness and initiative, specifically about cultivating the ability to take calculated risks. That errors of over boldness are dealt with leniently and he cites MCDP1. And I quote, "Initiative, the willingness to act on one's own judgment is a prerequisite for boldness. There must be no zero defects mentality. Abolishing zero defects means that we do not stifle boldness or initiative through threat of punishment."
I thought this was absolutely fantastic. And I haven't read all of MCDP1, but it's going back to the top of my pile of books to read. I will put a link to all of these resources in the show notes. All right, let's go back to the interview. I love this. And I feel so privileged that to be able to see how your thoughts on this has evolved over time, including both that DevOps enterprise talk, as well as your writings on... I think you called it maneuverability in one phase. In the last podcast I had interviewed my mentor, Dr. Steve Spear. And one of the things that I'm trying to do is be able to learn about what he calls, structure and dynamics, with the goal of as parsimoniously as possible, explain how organizations work, how organizations behave.
One of the things that I had shared with you two weeks ago was positing this notion that structure is the way the teams are structured as well as the architecture that we work within. Which results in a whole bunch of dynamics that are almost a function in totality of the structure. So there's something that I found so magnificent, re-listening to your 2018 presentation. I would call it the tale of two outages. One organization in order to resolve an issue. They had to escalate five levels up and then down three. And it might take hours or even days for the right two people to talk to each other to solve the problem. And then you paint this other extreme, where those two engineers are immediately able to contact each other and solve the problem together much more quickly. But you said that-
Gene Kim (00:24:00): ... and solve the problem together much more quickly. But you said this one phrase. You showed the org chart, and you said something like, "If you look at this diagram, regardless of what it shows, the comm structure, the service architecture, the software architecture, they're probably isomorphic." And I remember laughing when I heard this because it matches one of my own 'aha' moments, which is really that you need an org chart and a software architecture that are congruent. And I love your phrase 'isomorphic' because you're making an even stronger claim between the relationship between the org chart, and how teams are organized, and the architecture they work within. Mike, can you defend that claim?
Micheal Nygard (00:24:42): Well, first I'm going to appeal to authority because I'm certainly not the first person to make that claim. I'm going to point to Melvin Conway and Conway's Law, where he said that the organization is constrained to build a system that recapitulates the communication structure of the organization. I think that there's some really interesting parts to that because, for a while, people quoted Conway's Law as a cynical joke, it was in the same category as Murphy's Law. And then, around the same time as we started talking microservices, people started to appreciate the deep truth under Conway's Law, that it's not just a cynical observation.
Gene Kim (00:25:26): The joke is, to paraphrase, I think this comes from the Devil's Dictionary, "If you have five teams working on the compiler, you get a 5-pass compiler."
Micheal Nygard (00:25:33): Yeah, exactly.
Gene Kim (00:25:34): It's based on a famous series of experiments he did in 1968. He actually did a controlled experiment: three teams working on a compiler and four teams working on a compiler, and the 3-team generated a 3-pass compiler, 4-team compiler generated a 4-pass compiler. So I think that was a very profound observation.
Micheal Nygard (00:25:52): It was. And when you look at the mechanics that caused that to be true, what happens is: two groups that don't talk all the time have to coordinate on a deliverable that causes them to negotiate an interface. That interface is then a boundary in the structure of the software. So, the boundary between the teams becomes a boundary in the software. Now, in terms of the org chart, this is where I think there's something really profound in what Conway said, which is that it is the communication structure of the org that creates that software architecture. The communication structure doesn't always follow the formal structure of the org.
In fact, a very common pattern I've seen is a company goes through a reorganization, for whatever reason or whatever purpose, but they focus entirely on the static structure, who reports to who, but they don't talk about how they want the behaviors to change or who your new collaborators are meant to be. Or, more bluntly, how do you still get your job done when you're located in a different part of the organization? So, lacking that, people still need to get their job done. They will use the tools they already know. They'll talk to the people they already know. So, you get the formal structure and a shadow network or shadow structure. It's the shadow structure that will determine the true architecture of your software.
Gene Kim (00:27:31): I wrote down a couple of things that we'll follow up on. What distinguishes in the tale of two outages? How is the structure different in the one that you have to go up five, down three, versus the one where two engineers can just work together to solve the problem?
Micheal Nygard (00:27:50): Well, that's the thing: the structure can be exactly the same. The formal structure can be exactly the same, but the incentives applied will create the communication structure. Or I should say the disincentives applied. If you communicate directly with somebody and then get called on the carpet for it, or your manager hears from their manager that you were taking up time without providing a cost center number to charge the time to, et cetera, you're not likely to do that again. And it doesn't take very many instances of that before everybody just perceives and models the behavior and emulates the behavior that you do things by going up, over and down.
Gene Kim (00:28:31): That's interesting. So there's a couple of variations on that. I can't get a feature in unless there's a project code associated with it, or I can't talk to the ops person, most of it is a... There's some authorized service, something that will get logged against and charged to. Those are the type of interfaces that are defined to make sure that you stay in your own lane, for whatever purpose.
Micheal Nygard (00:28:52): And what's fascinating is: with the best of intention, you can change the org structure, but the old structure may be baked into tools that become the operating system of the organization. So one that's on my mind, we use service now and we have a service catalog of things that you can request, including a playground or a sandbox environment in one of our platforms. The form to request that included a field asking for your cost center, even though the playground tier didn't do any chargeback. And so just the existence of that field meant a dev who wanted to try that out had to go find out what their cost center was. That means asking your manager or your director, or whatever, and most people just don't want to do that.
Gene Kim (00:29:51): Right.
Micheal Nygard (00:29:51): By the way, we did get rid of that field, but it's really an example of something, I think, Churchill said: "We shape our buildings and thereafter they shape us," but you can say the same thing about the tools. You create the tools to facilitate processes for a certain org structure. Then you change the org structure but you keep the tools. So you're stuck with the old processes.
Gene Kim (00:30:15): And the tools are where we formalize processes and enforce, we hold people accountable, we just find what the communication pads are. So maybe just to help with my own advocation, in the case where the two engineers are able to talk together, it is a part of structure. Even though the org chart might look the same, there is actually an explicit interface that says these people are allowed to talk to each other to achieve a common goal, and that is not contradictory to the goals of each one of those silos. There's actually a common mission tying them both. So that is an element of structure.
Micheal Nygard (00:30:51): I'll buy that. Sure. Absolutely.
Gene Kim (00:30:54): It just reminds me of the book Team of Teams, where the silos were Army Rangers, Navy seals, et cetera. But in the before case, you had the Army Rangers going out in rage, putting intelligence and evidence in duffle bags. Then you have the Navy seals doing the same thing and you ended up with profoundly bad outcomes and was actually this famous story in the book, how they were embedded into common mission teams. So the Army Rangers kept on reporting through the department of the army. Navy Seals reported to the secretary of the Navy. But when it came to short and medium term objectives, they were tied towards a very common mission. And now citing to capture, went from months, if ever, to, in one case, 45 minutes.
Micheal Nygard (00:31:37): Wow. That's phenomenal.
Gene Kim (00:31:38): Oh yeah. Great. So there was a structural change made. Even though the org chart is probably the same, the way they interface with each other resulted in profoundly different outcomes.
Micheal Nygard (00:31:48): I might liken that to creating product teams, that each have their own reporting hierarchy, but together are bound around the construct of the product.
Gene Kim (00:31:58): Right. And the product goals. So let's go back to architecture again. I had this startling 'aha' moment when I really heard and studied the story of the birth and death of SPRouter at Etsy. The short story goes: back in 2008 and the bad, dark days of Etsy, they had this problem where in order for the company to ship a feature, two teams would have to work together. It would be the devs in the front end, kind of a PHP in their case, and then the DBAs would have to work inside of the stored procedures inside of Postgres on the backend. And so, that meant that features required two teams to communicate, coordinate, synchronize, marshal and so forth. And so their countermeasure was to create something called SPRouter. That stood for a Stored Procedure Router. And the goal was to be able to allow the devs and DBAs to work independently.
They would just meet in the middle, inside of SPRouter, and as one of the senior engineers said, "This required a degree of synchronization and coordination that was rarely achieved, to the point where almost every deployment became a mini outage." And if you're doing 30 to 50 deployments a day, this was a real problem. There was actually many attempts to replace SPRouter, but then they decided to kill SPRouter entirely. The goal was to fully enable developers on the front end to make all the required changes and to be able to obviate the need for any backend changes because it used an ORM for that. And they found that whatever part of the property they eliminated SPRouter, suddenly deployment lead times went way down, the deployment outcomes and fess outcomes went way up. And so what I found so marvelous about that was I thought this was such an amazing example of Conway's law.
You go from two teams to three teams. They have to communicate and communicate and prioritize as the quality outcomes went down. And when you can get to a point where every team can make all the changes themselves without any dependencies on anyone else, then you get all these great outcomes because then you don't need to communicate and coordinate with anybody outside of the team. And so that really shaped this 'aha' moment, that it is not enough to move those teams around on an org chart, that the role of architecture is as important as the way we organize teams. And incidentally, one of the biggest surprises for me, in the state of dev ops research, was that architecture is actually one of the top predictors of performance.
And the marks of great architecture in the study were these: to what extent can teams make large scale changes to their parts of the system without permission from anyone else outside the team; to what extent can they do their work without a lot of fine grain communication and coordination with people outside the team; can they release their service or product on demand, independent of other services they may depend upon; can they do their own testing on demand without the use of an ingredient test environment; a point of incredible coupling, if that exists.
And if all those things are true, then suddenly you can do deployments during normal business hours with negligible downtime.It was such an 'aha' moment to understand that these are the characteristics that allow for developer productivity, for better safety, for liability.Mike, does that resonate with you and what are the other attributes of good architecture?
Micheal Nygard (00:35:10): First of all, it definitely resonates. I had not heard the story of Etsy SPRouter before, but it doesn't particularly surprise me that those outcomes resulted. In terms of the door of research and the architecture qualities, I think all of those qualities that you described are the results of architecture, the observable characteristics of architecture that provides safety. In other words, if you lack safety and you try to make a change in an environment that doesn't protect the rest of the environment, then your actions have outsized impact on everyone, or even the organization as a whole. Therefore, you will not be allowed to do a deployment without scrutiny, without fine grained communication and so on.
Gene Kim (00:36:10): Just to confirm, this is the phenomenon where I make a small change to my part of the system, and then some of the system blows up, something terrible happens. Right?
Micheal Nygard (00:36:21): Mm-hmm (affirmative).
Gene Kim (00:36:22): That's a mark of poor safety?
Micheal Nygard (00:36:24): Yes. In fact, I describe safety as: the characteristic that the change you make within your team can only harm your team. And to the extent that the architecture prevents you from even harming yourself, you'll be even more courageous and able to make faster, more fine grain change because you get that immediate rapid feedback. So, there are definitely levels of safety that I would say enable these characteristics that are described here as the observable predictors of success. And as we've often heard, process is the scar tissue of past incidents. So it doesn't take very many incidents before those review processes and the scrutiny ramp back up, and the fewer fault isolation barriers you have, or the larger your failure domains are, the faster those review processes are going to ramp back up. So I think these findings don't particularly surprise me and they do definitely describe characteristics of good architecture.
There's something I don't see here though, which is an attribute of a predictability. In other words, am I able to predict the ramifications of the change that I'm making? Or, if I make this change, is something far distant going to be affected? Or, am I going to be harmed in the future in ways that I don't yet understand? Inability to change higher cost of maintaining the old stuff, that sort of thing. And so I'll give a couple of examples to try to make that concrete. I worked with an e-commerce as a service company some time ago. They had a very rich templating language in their front end. If you were a storefront owner, you could put quite a lot of logic into your templates. They also had a lot of customization in the backend, and many of the templates would invoke third party services that would rewrite parts of the pages.
So the effect was that the backend developers never quite knew if any change they were making was going to be safe because they had 2.8 million storefronts with every possible combination of templates. The templates were also stored purely in data in the production database and relied on a proprietary item data, pricing data and customer data. So you couldn't even bring that data into lower environments to test out all the permutations. This created tremendous fear of change, which led to increase in cruft and legacy and build up. So kludges cluster. I wrote a blog post about this that I called: The Fear Cycle. If you fear to change your technology, or you fear your technology, you will soon have reason to fear your technology. So, that aspect of predictability is one.
Gene Kim (00:39:35): And if I could concretize that even further, that's the notion... Is that of isolation? In other words, to what degree can you actually reason about and run your code in isolation without the other components? Is that-
Micheal Nygard (00:39:47): That sounds ideal.
Gene Kim (00:39:48): Right.
Micheal Nygard (00:39:50): Actually, it is. That's a great insight, Gene. The changes people made in the back end had no isolation because it was possible to break the storefront of a small business owner in Saskatchewan and not even know it.
Gene Kim (00:40:06): Because, you can't test in isolation, you can't run in isolation. That's what seems to drive up the fear. It's easier to build things when you have that characteristic. It's easier to test it. And I think the assertion would also be that you can actually be more competitive when you have those properties in your architecture.
Micheal Nygard (00:40:23): Absolutely. And don't forget the value of being able to delete things or shut things down. I mean, it's a little bit tongue in cheek, but we focus a lot on building, but I'm in a situation now where I can see what five decades of building with very little deletion creates.
Gene Kim (00:40:45): Awesome. I interrupted you. Please keep going: other marks of great architecture.
Micheal Nygard (00:40:49): I would say that another aspect of it is that you don't need global knowledge. So I guess this is also isolation, but it's the flip side of the predictability. It's saying: no human can grok the full breadth of the environments. And so, in some ways, the environment, the architecture either protects me from needing to have that global knowledge, or it makes that information visible at the point in time when I need it, so that it's an evident architecture.
Gene Kim (00:41:26): I love that portion of the interview. I'll put a link to Mike's post, called: The Fear Cycle, in the show notes. Here's the pattern that he writes about, and I quote: "Small changes have unpredictable, scary or costly results. And so therefore, we begin to fear making changes. And then we try to make every change as small and local as possible. And then the code base accumulates warts, knobs and special cases, and fear intensifies." Who hasn't lived with a code base where we fear to change it in exactly the way that Mike talks about?
But he goes on to talk about what the real implications of this are, and again I quote: "Add a few of these terrible events into company lore, and you'll find that developers and project managers become loath to touch anything outside of their narrow scope. They seek local safety. The trouble with local safety is that it requires kludges. The code base will inevitably deteriorate as pressure for larger changes and broader refactoring builds without release. The vicious cycle is completed when one of those local kludges is responsible for someone else's "What? I didn't know that!" Moment. At this point, the fear cycle is self-sustaining. The cost of even small changes will continue to increase without limit. The time needed to get changes released will increase as well."
Then describes what is horribly the inevitable outcome. "One of several things will happen," he writes. "1. A big bang rewrite, usually with a different team and the focus will be "This time we'll do it right." The second possibility is: "Large scale outsourcing." And the third possibility is: "Selling off the damage assets to another company." As humorous and a terribly true that is, he writes about something that is even more unsettling in the next paragraph. He says, "The worst response to one of these negative events is a tribunal. Sadly, the difference between a technical SWAT team and a tribunal is mostly in how individuals in that group approach the issue. Wise leadership is required to avoid the fear cycle. Look to people with experience in operations or technical management."
I love what Mike wrote and it reminds me of one of the most delightful and amazing findings in the State of Dev Ops Report. At least for me. And it's this: that there's one question you can ask that has a surprising ability to predict every one of the metrics that we talked about, whether it's deployment frequency, deployment lead time, change assess rate, meantime repair, or organizational performance metrics, like employee net promoter score, or the absence or presence of technical practices, cultural norms or architectural practices. And so there's one question that you can ask is simply this: on a scale of one to seven, to what degree do we fear doing deployments?
One is, "We have no fear at all; we just did one." Seven is, "We have existential fear of doing deployments, which is why in our ideal, the next deployment we would do is never. We're done deploying this code." And so I think that really shows that so much of what Mike talks about you can actually see in the State of Dev Ops research. So I love this post. I actually studied this post this week, which fortified me to actually tackling refactoring a code base that has been increasingly difficult and scary to change. And I'm so pleased with the outcomes even after only a couple hours of work. So I encourage all of you to read this. Post back to the interview.
Awesome. There's another example that comes to mind of a phenomenal before and after picture, and that's the famous Amazon re-platforming. By some accounts, Amazon spent a billion dollars re-platforming the Obidos platform in 2004. Werner Vogels wrote in the ACM that, "It complected all of the display logic, all of the business logic, the recommendations tools. Everyone had to teach the same things in order to get things done. It became untenable, unsafe and a huge productivity hit to implement anything anymore."
And as reported by Steve Yegge, the big mandate went something along these lines, "1. All teams will henceforth expose their data and functionality through service interfaces. 2. Teams must communicate with each other through these interfaces. 3. There will be no other form of interprocess communication allowed: no direct linking, no direct reads of another team's data store, no shared memory, no backdoors whatsoever. 4. It doesn't matter what technology you use, HTTP, Corba, Pubsub, Bezos doesn't care. 5. Service interfaces without exception must be designed from the ground up to be externalizable, that is to say: the team has planned and designed to be able to expose their interfaces to developers to the outside world, no exemptions. 6. Anybody who doesn't do this will be fired. 7. Thank you, have a nice day." And then Steve Yegge writes, "Obviously number seven is made up because Bezos doesn't care whether you have a good day or not." But the impact of this is so obvious now, looking 15 years in the future.
Gene here. I realized I could have been clearer in that explanation. So Werner Vogels is a CTO of Amazon web services. I mentioned a famous ACM paper that he wrote that described the re-platforming of the Obidos system. I'll include a link to that in the show notes because it's great and I cite it heavily in the Dev Ops Handbook. The problem in 2004 was that every team needed to touch the Obidos system, whether it was changing the display logic, the business logic, the recommendations engine, everything. And the consequence of that is that they were only able to do dozens of deployments each year. And so this was a huge problem. He wrote in a 2009 article that every one of these deployments required a long unwieldy process, requiring complicated coordination and that limited our ability to innovate fast and at scale.
Steve Yegge is a famous engineer who spent many years in the early days of Amazon, who then went on to work at Google. The Steve Yegge memo that I read from is a famous memo he released to Google engineering, describing some of the key components of that transformation. I think that this Amazon story is one of the most breathtaking examples of how an organization was able to change their architecture to enable developers and development teams to become orders of magnitude more productive. So in 2004-
Gene Kim (00:48:00): ... orders of magnitude more productive. So in 2004, they were doing dozens of drawings per year, by 2011, they were doing 11,000 deployments per day. By 2015, they're doing 136,000 deployments a day. That's stunning. So Verna Vogel writes to main competitive companies must increase their agility so that they can continually uncover new opportunities and create better products. Okay, so with that explanation of what they did and why they did it? Let's go back into the interview. So talks into that resonate with your own experiences. It is to a great extent architecture that enabled Amazons to achieve their goals.
Micheal Nygard (00:48:42): I love the clarity, precision and succinctness of that memo. I'm always a little cautious about survivorship bias when we're looking at case studies. I remember reading a book called Sunburst, which was all about the absolute genius of Sun Microsystems and Scott McNealy in particular. And I picked it up, on the remainder shelf at the bookstore around the time that Oracle was buying the remains of Sun. And there are plenty of examples of books, like Good to Great, where 90% of the companies profiled are gone or would no longer make the list.
Gene Kim (00:49:28): I noticed that.
Micheal Nygard (00:49:30): Yeah. So I'm cautious to say, "This memo was responsible for a tremendous technological change that's succeeded." I don't know that it would produce the same success in other companies by just emulating the behaviors in this memo. I think there's a lot of Amazons DNA that's enormously hard to replicate. There are famous stories about Toyota giving tours to managers from General Motors and explaining all of just-in-time production and lean manufacturing to them. And when asked, "Why are you teaching us, we are your competitors being told? If you get better, we all get better." But we also think that you're probably not going to be able to do this. You'll copy the behaviors that are obvious and easy to copy, but you won't be the mindset out of which those behaviors were produced.
Gene Kim (00:50:33): So what are the takeaways that you think are universal?
Micheal Nygard (00:50:38): So paying attention to the structure is definitely universal. Creating and maintaining boundaries is universal. In fact, I go on at some length about the flaws in the typical layered architecture pattern for applications that the boundaries erode very quickly, and they all turn into a big ball of mud. In every organization where we've seen services and microservices succeed, they've paid tremendous attention to the nature of those interfaces and the boundaries that they represent. And this is where a lot of enterprises get service APIs, exactly backwards. They take the existing system, figure out what they can expose from that existing system. And they turn it into an API, right?
What you should do instead is say, " What does the color want to achieve? That's my API." Stripe does this extremely well. I love Stripe's APIs. I use them as an example all the time. They are perfectly consumer focused. So that focus on the interfaces very, very important. I absolutely believe that. I think the other piece of their DNA, that's not expressed in this memo, but grew out of the transformation that this memo created, is the complete willingness to tear down and destroy things that you just got done building. In other words, that creative destruction process or continuing renewal process is really important. And that's another thing that most companies don't do.
Most companies will take, say you have two projects at most companies. One of them delivered under budget is performing well, has three for developers because they're using a language like Closure and they have superpowers, naturally. Another team, 15 people behind schedule over budget, which one does the company put its funding into? Almost always it's the over budget behind schedule team. Most companies will starve the winners and force feed the losers. Amazon doesn't do that. So if you talk to the people working on the services, if your service is not getting traffic, if it's not getting customers, it's going to get deleted and you get reassigned. If you're getting customers, then you get to add onto your service. It's just the opposite of the way most companies work.
Speaker 1 (00:53:10): We are so much looking forward to the DevOps Enterprise Summit, Vegas, Virtual, which will now be held on October 13th to the 15th. As always, the goal of the programming committee is to bring you the best experience reports and to out-program all our previous events. And this year we expect to deliver on that promise again. I am so excited about the speaker lineup we have for you, partly because they are among the most senior technology and business leaders that have spoken at this conference, showing you how important the work of this community is. Maya Leibman is CIO of American Airlines, who presented at our annual forum in April. And we were fascinated by the perspectives that she shared with us.
I'm so excited that she will be co-presenting with our longtime friend, Ross Clanton about the American Airlines journey. And since 2014, we've all been dazzled by the CSG journey, as told by Scott Prue and Erica Morrison. I am so thrilled that this year, Scott Prue will be co-presenting with his boss, Ken Kennedy, executive vice president, and president of CSG, the largest provider of customer care billing and order management in the U.S. Ken and Scott will be sharing their story on the interplay between business and technology leadership and how it resulted in their amazing accomplishments over the years. This is just the beginning. Stay tuned for more exciting announcements about our amazing speaker lineup. This will undoubtedly be the best DevOps Enterprise Summit program we've ever put together. You can find more information at eventsthatitrevolution.com/virtual.
Gene Kim (00:54:46): That's brilliant. And you've talked about that in your DevOps Enterprise Talk, which I absolutely love. So what I find so dazzling about this is that, as we're analyzing the Amazon, thou shalt use APIs story. It seems to reinforce this notion that one of the responsibility of leaders is not just to design the organization, but also make sure that there is an architecture that is suitable for achieving that mission. So I have two questions. One is, you talked about reorgs, but who reports to who? How do we want their behaviors change? Who are your collaborators? Reorganization done well, and the ones that have done poorly. Can you give an example of the best re-orgs you've seen? In other words, independent of the reason that led to it. What are the characteristics of reorganizations and structural changes that you would give high marks to versus the ones that you would give very low marks to?
Micheal Nygard (00:55:42): Okay. Well, I'll start with the lowest of low marks. There was an individual I worked with that would probably fall onto the chaotic neutral alignment in DND. And he enjoyed drawing block diagrams that made no sense, we're half-truths and included a bunch of a weird parts that nobody had heard of and leaving them on whiteboards in conference rooms. He once almost accomplished a reorg by printing an org chart on a remote printer, in a different floor of the building and just leaving it there, to see if people would start behaving that way. And for context, he was doing this as an experiment because sometimes reorgs were communicated in that kind of a haphazard way. I would sort of hear [crosstalk 00:56:29].
Gene Kim (00:56:31): This is what such rumor mill triggers certain responses, whether it's fear or ambition or greed. Is that right?
Micheal Nygard (00:56:39): Yeah. You would hear, over lunch, somebody says, "Oh, yeah. You're reporting to the boss." And you're, "Oh, okay." So you got to talk to Bob and you're like, "Hey Bob, am I on your team now?" And Bob's like, "I guess." It's a funny story because it's sort of an extension to the extreme of patterns that we've actually seen, right? Managers that don't understand why the reorgs are being taken? Or what's being moved? Or who failing to change the incentives to match the new structure. Failing to explain any of the rationale about why you're doing this or what you expect people to do differently. That's what puts it on the poor end. The reverse end has all of those things and it improves feedback loops, closes broken feedback loops where they have them aligns people's incentives with their positions, creates clear areas of authority, where they can at recursive levels of the organization and you know what your parameters are? What's expected of you? What you're allowed to do? What your constraints are? And you're able to link that to that, that big vision that I talked about, the actionable vision. That's a good reorg.
Gene Kim (00:58:01): I remember hearing a great story from a friend at HPE. It was a group in Israel that took over the management of the Mercury TestDirector Tools. And he said he inherited his project, and to develop a feature, you had the front end engineers work on one thing. Then the backend engineers working on another thing. And then the middleware team working on another thing, it was for any given piece of work, you had to line up three different teams with their own different set of priorities. It really was extraordinarily difficult to even do small things. And the notion of then slicing up the product into different slices and then moving all of the relevant members of the front end, backend and middleware teams into one spot. So that decision making could be localized that they could do the work without any dependencies. I think in that before case, his name was really [Vice Bach 00:58:57]. He said, one of his engineers said in a very Israeli way, "Really, I don't think you are a very good manager because this is a terrible way of working."
And within a year after they put in this, they turned everything around and ended up with this incredible way of implementing features that were not only more productive, but had better outcomes and happier engineers. So that would be an example of a good reorganization. Does that resonate with you?
Micheal Nygard (00:59:29): Oh, it totally does. I'm curious in that situation, do you know if they split the code bases out to align with the new team structure or were they still working in three different chunks of code, but now had all the people in talent necessary in each feature team?
Gene Kim (00:59:50): I don't know, but I will find out. Like it threw me sufficiently motivated. It was a DevOps from Perth, London. Yeah, I'll-
Micheal Nygard (00:59:58): My friend and colleague Stefan Tolkoff talks about Self Contained System, SCSs. This is one of several alternatives to microservices that he's put forward. So each SCS goes all the way from gooey to backend to reporting. So we might consider the macroservices or mezzo scale services. But again, it's the same idea that you can do everything you need to do from front end all the way through.
Micheal Nygard (01:01:01): Angular, perhaps or Backbone.
That challenge of trading off autonomy and ability to make independent change versus efficiency pops up over and over again. Shared libraries, shared services, shared technology platforms. I don't think there's a static resolution to that. I think you kind of always end up going too far in one direction and pulling it back and going the other direction.
Gene Kim (01:02:35): So another thing than... Before I get to the question about leadership and architecture. I would be remiss if I didn't ask, Mike, what is your definition of architecture? We talked about some of our properties or outcomes when you have a good or bad architecture. What definition do you in your thinking's writings and doings?
Micheal Nygard (01:02:54): So to me, architecture is the organizing logic of the components and the materials used to construct them.
Gene Kim (01:03:04): Holy cow. What does that mean? It sounds so good.
Micheal Nygard (01:03:08): Yeah. It requires a little unpacking. So organizing logic is one of the first pieces that I really look for. When you look at the patterns of interaction or the patterns of construction, do you see a rationale behind it? Is there a coherence to the problems being solved? So for example, if you tell me that you've got microservices, you've told me almost nothing about your architecture, right? I know something about the construction of the components, but I know nothing about how they're allowed to interact. You could have a death star diagram, right? The dependency, topology where everything calls everything else. You could have everything communicating in CQRS style through a common message bus, right? You could have a body of the butterfly where you've got front end services, that call an API layer, that calls a backend services. Those are architectures. Each one has a different organizing logic.
If you told me you were combining all of them and putting a butterfly in CQRS stuck to a big ball of mud, I would say you have no organizing logic. But the other aspect is about the orderly construction of the things that you're creating. So architecture is really about time. If we could create any system as fast as we needed, we would never worry about the architecture. You would just always be coding, always be coding, but we can't. It takes time to build something. And because of that, once it's live, we need it to run for a certain period of time to pay itself back, to have a positive value on the thing we built. And so there will be a period of time where you are under construction with everything you make and different pieces will be under construction at different times and different stages of the construction or deconstruction or replacement. And so we need that organizing logic to extend, not just across the space of the system, as it is in any given moment, but we needed to extend across time so the system makes sense at every point in time now and into the future.
Gene Kim (01:05:15): And part of that is that, if all these components are being built in parallel, it sure helps to be able to test them and integrate them at places, not just at the end. It seems like one attribute or a requirement of what you just said in terms of the orderly construction.
But what it means is, I don't have any compile time checking on my large scale architecture, right? So everything is deferred to runtime and I find all my errors once I try to actually run things. Well, so that's exactly like you're in the dynamic language camp that says, "The speed of building things dynamically exceeds the cost of the runtime errors and fixing those." But there's definitely an alternate path you could go down and say, "I'm going to have strongly typed interfaces and those types are then how I communicate between the parallel streams. I synchronize my thinking and yours by defining a type that represents our shared understanding." And that would be along the lines of a protobuf fo gRPC based architecture.
Gene Kim (01:07:13): And you had an amazing article about that, in terms of a scaling, about if the goal is to enable teams to work independently without a lot of communication coordination with other teams, not because we don't like talking to other people, but now that does slow down the ability for teams to get things done. Your assertion was that static typing and broadcasting, that is a phenomenal time-saver that let alone the correctness advantages that you can make and get out of it. It is just a very, it's a great way to signal and broadcast and enable all the teams to get their work done. Did I characterize that?
Micheal Nygard (01:07:51): You totally did. And I'm tickled pink that you read that. You're one of like only 50 people who read that post.
Gene Kim (01:08:00): My reaction was that, wouldn't it be interesting if, it was actually that, that advantage that drove us to static typing, right? As opposed to the more mathematically oriented assertions that you make around a static typing gene here again. All right, that blog post that Mike wrote, was titled Cost of Coherence, and it is such an interesting post. So he talks about the cost of maintaining coherence across teams. And so he writes about what that cost might look like. He writes, for example, for a half dozen people, you might just put them in a single room. That penalty might be very small, just a whiteboard session once a week or so. But for a large team spread across multiple time zones, it could be large and formal documents and walkthroughs, presentations to the team and so on.
And he gives specific guidance on how to reduce the cost of coherence. He says, "Take a look at your architecture, language, tools, and team. See where you spend time reestablishing coherence when people make changes to the systems model of the world." He says, "Look for splits. Split the system with internal boundaries, split the team. Use your environment to communicate the changes." So recohering can be a broadcast effort rather than a one-to-one conversation. And look at your team communications. How much of your time and process is devoted to coherence? Maybe you can make small changes to reduce the need for it. So I think these problems are so important. How do you enable productivity at scale, not within a team? Which is pretty easy to do. And I think well understood at this point. But instead, how do you enable hundreds or even thousands, or even tens of thousands of developers to be productive working together towards common goals. This is at the scale of not teams, but teams of teams and within an enterprise.
Okay, back to the interview. So this gets to my question that, I've been leading up towards. Let me frame this first with my own prejudices. So there's certainly been times in my career where I've been very disenchanted and about reorganization. I think that sometimes for reasons that you've stated, let's put on kind of like malicious dynamics that can exist in some senior leaders. When I hear about a reorg, often my body language like, "Oh no. it was like we were doing so well." I was like, "Why are they doing that?" Then yet, through the lens of structure and dynamics, right? The notion that, so much of the dynamics are a function of the structure we create, it does seem that org changes are a very powerful tool in the arsenal that leaders use.
And then when you combine that with the notion that as you get higher and higher up in the organization, that you have fewer and fewer knobs to turn. The notion that the more senior you get, the more power you have and now you can make everything. So it turns out really not to be the case. It really is, your primary knobs are about structure, architecture and of course then modeling the dynamics that we want to create as you mentioned, right? These things that engender trust, whether it's through vulnerability, authenticity, clarity of mission, succinct message that everyone can see how their daily work helps advance is or is not connected to the grand goals. So it seems to me that most leaders they're far more comfortable in how the org chart looks versus the architecture. And if the assertion is that architecture is isomorphic, right? Meaning there's a one to one relationship between the two, then there's a deficit on the expertise to make good architectural decisions. So can you comment on that and really... Maybe it's seem to the question of, what extent do leaders need to be responsible for the architecture and how do you make them sufficiently informed to enable the right types of architecture?
Micheal Nygard (01:12:00): Oh, wow. There's a lot to unpack there. First of all-
Micheal Nygard (01:12:00): Oh, wow. There's a lot to unpack there. First of all, I somewhat agree that you have fewer knobs to turn at higher levels in the hierarchy. A lesson that I've learned sometimes painfully is that indirect influence can be just as powerful as direct control. And that's been brought home to me, especially in organizations that are very sensitive to the hierarchy, and to titles, and levels because some people are inclined to snarky remarks, or cynicism, or being facetious or flippant, but especially in large companies or global companies, the nuance of that being a facetious comment, or I'm not serious, I was just kidding about that. That nuance can be lost. And especially when it goes from somebody who was there with you having a beer together, when you said, "I don't know why Apache needs CAFCA and Pulse-R. One or the other should do, right?"
And then you hear back a week later, "Oh, Nygard hates Pulse-R." I'm making that one up. That's a fictitious example of a pattern that actually has occurred. And you're like, "No, look, that was just beer talk." But the thing is when, you are in a leadership position since there's no such thing as beer talk, so you have to be very careful with your words, but it also means that you have ways to communicate your intentions and that people will hear those and resonate with them and communicate them and so on. And they're going to do that, whether they agree with your intentions or not. In fact, it may happen faster if they disagree with your intentions. What that means is sometimes the changes you introduce don't have to be org changes. They don't have to be blunt instrument of changing someone's annual goals, right?
It can be as much as saying, 'I think we need to use messaging more and rely less on synchronous communication." Because we're moving from one location to multiple. And we can't always wait on the round trip time, right? And you say that 10 or 12 times in different ways and the message gets across. So there's, that. But the question you asked was about whether leaders needed to know more about architecture and how that interplay between org and architecture could maybe go both ways. And I think that there's this fun ambiguity in the term enterprise architecture. And I know that it's a term that sometimes is denigrated and sometimes rightly denigrated, there are a ton of pitfalls. We've all seen the ivory tower EA group. We've all seen the outdated, yet mandated enterprise framework that everyone is supposed to use.
So I get it. But you can interpret the term enterprise architecture in two different ways. One is defining the IT architecture at enterprise scale and visibility. And the other is, it's the architecture of the enterprise itself. And there are some books and some professional orgs that try to make that distinction even clearer to say that enterprise architecture is a discipline entirely separate from enterprise IT architecture. They may have lost that battle, but it is absolutely the case that in my current role, I'm trying to work on both sides of the fence. Advising on the architecture of the organization, the static structure, and the dynamic structure. Also advising on the architecture of the software and trying to make them work together to be congruent.
Gene Kim (01:16:18): Right. Right. In fact, or one can even say that the software really needs to be subordinate to the business architecture.
Micheal Nygard (01:16:27): Certainly.
Gene Kim (01:16:28): One constructs the software architecture to best serve the architecture as a business. I'm not sure if I'm splitting hairs here, but that seems like a very profound observation.
Micheal Nygard (01:16:38): I think, trying to ascribe cause and effect or say that one proceeds the other is pretty difficult because the truth is they're both going to co-evolve. What's possible to do in the software architecture is constrained by the current org structure, and what's easy or feasible to do in the org structure is constrained by the software architecture as well.
Gene Kim (01:17:01): But yeah, in any case, right? The goal is to make them mutually reinforcing congruent isomorphic.
Micheal Nygard (01:17:07): Yes, absolutely.
Gene Kim (01:17:09): As opposed to not, right? Totally. So, therefore, if that's true, if that's true, then to what extent are leaders responsible for being aware of creating truing up the architecture as they serves the goals of the business it serves. There something... If you delegate to the VP of enterprise architecture or what do you need from them?
Micheal Nygard (01:17:34): Well, that's a great question. I feel like I need to consult with Mark Schwartz before I can really answer that question well but-
Gene Kim (01:17:43): Oh, actually I know how to ask this more directly. So, can you talk a little bit about what it is like being responsible for a quote enterprise architecture and what are the best attributes among the counterparts that you deal with? Whether it's technology leadership, business leadership, yeah. What are the most helpful characteristics in those working relationships?
Micheal Nygard (01:18:02): That's a question I can absolutely answer. Thank you. So one of the things, it's a lot like trying to create a large scale change in your software, through a sequence of, an uncountable number of small refactors. So, you extract a method here, you inline a variable there, but you have a picture in mind of where you're trying to go and you want to get there, but everything has to keep working at each step along the way, right? If you're doing refactoring well, there's no point at which your tests all completely fail. And if you're doing your org change well, there's no point at which all your revenue stops, barring global pandemics.
Gene Kim (01:18:53): Right. I'm laughing, but it's not funny.
Micheal Nygard (01:18:56): Yeah. I mean, sometimes gallows humor is all we've got, but what I mean is, we're trying to engender continuous change in the org, but without blowing it all up at any point along the way, it has to continue working. And so in doing that, I'm continually working with the other leaders in the company, talking with them all the time. And one of the things I've found is that most of them would actually like to know more about why things are the way they are. So I won't say that they're all happy with IT, or happy with tech or, product development. There's plenty of frustration to go around. But the frustration in my experience, the frustration doesn't just lead them to pull the lever that opens the trap door that drops you into the shark tank. What it means is they would like to know why, why is it like this? And I don't know, you can probably tell that one of the things I like to do is teach and share what I've learned and what I know.
So there are people in my company that I spend an hour with a week just talking to them about what the nature is of our systems and of development in general. And very often after a few of these, they start to go, "Wait a minute, when you're talking about these big batches going into development and causing these problems, that's my BRD that I wrote. That was hard work. We wrote 300 pages in the BRD." I'm like, "Yeah, yeah." Imagine if you could have written one page and just had a conversation about that. Wouldn't that be easier? And imagine then that like two weeks after that conversation, you could see that bit running. Even if it doesn't go all the way to your customer, because they have to take long time to adapt to the change or adopt the change, but you can see it running. Wouldn't that be better? It isn't necessarily as fast a process as we would always like, but it's building that Rebel Alliance, except it's a Rebel Alliance of people who are VPs and above.
Gene Kim (01:21:11): Right. And so did that story seem to speak to the process, the nature of the way we build software? I have to imagine there are equal the big aha moments that deal with the architectural processes. So it's not about the methods, but the orderly construction and patterns of action organizing logic. Can you share one like that?
Micheal Nygard (01:21:36): So this one gets a little tougher and I'll take an anonymized example. Let's talk about a company that operates in many geographies with many customers who take different forms of payment. So one way to deal with that is to maintain a table internally that says, "I know these customers take these forms of payment. And so I'm only going to present the options in the go if that I know that they take." Another way to approach it would be to say, "I'm going to let them tell me what forms of payment they take either at request time or whenever they feel like it. They can just send me an update to what forms of payment they want to take." The first way requires us to do something when they change, right? Either it's baked into the code or, it's in a database table or a config file, depending on how rapidly we expect it to change.
The second one puts the control on the customer side. There can be valid business reasons to go one direction or the other. So deciding that as an architecture decision alone, doesn't actually solve it because sometimes we have to have a contract in place to accept a form of payment on their behalf. Sometimes there are government restrictions that we may not know about ahead of time to take a form of payment on their behalf.
Gene Kim (01:23:05): And so these would have cause one to constrain the choices so that we don't get into a position where we're accepting or processing something we shouldn't, right? That's the direction that takes us-
Micheal Nygard (01:23:17): Right.
Gene Kim (01:23:17): Yep. Okay. This is great.
Micheal Nygard (01:23:18): Now I'm going to add a wrinkle. By having that connection with the leaders on the business side as well. You can present them another alternative, which says, "I'm going to let the transaction processing software call out to ask if this form of payment is accepted for this carrier." But that's going to be another service of ours. So it's a middle ground where we are maintaining the information, but the service that's applying the logic, isn't the one that is maintaining the database of information. And so now at this point, it's a little like the scene in Amadeus where the guy looks at the score and says, "I don't like it. Too many notes." Because now we're talking about more parts-
Gene Kim (01:24:12): Was that Bachar Solitary or Mozart?
Micheal Nygard (01:24:14): It was one of the court musicians about Mozart I think. So it looks like more parts. It looks like more complexity, which makes people think it's going to take longer to build. But you and I know from our work enclosure that sometimes separating things into different concerns that are each extremely simple to build the net time for building it is less. And the net complexity is less. And so that's a place where we would need to talk to the product leader about the architecture and explain doing it this way, looks like a net increase, but it's not. And you're creating benefits for the next iteration down the road and the iteration after that.
Gene Kim (01:25:04): That's so great. Yeah, on my face, if you can see me, I have this giant grin as Michael explaining to me that it is not a bad thing, that there are too many notes in the scores, yeah but there is actually, could be a potential great business advantage, and simplification, and maintainability and so forth. So how does it feel, when you can successfully persuade someone to go towards a simpler design?
Micheal Nygard (01:25:32): Well, those are the days that I feel like I might actually be qualified for this job as opposed to the other days where you get blindsided by something and you're like, "Man, what? I don't even know what I'm doing here."
Gene Kim (01:25:48): Gene here again. Oh my gosh. I love that last bit because it shows the sophistication and understanding of the problem domain and the sensibilities of what makes great architecture, how doing things in a way that looks more complex actually makes things simpler and easy to build upon in the longterm. Listening to him, I actually had a flickering flash of insight and understanding, meaning it was exhilarating to see what he saw, but I'll admit I didn't actually fully understand it. In other words, I certainly couldn't explain it to someone else at that moment in time. And if you didn't quite get it, fear not. I did a followup interview with him where I got to ask him to explain this concept in more detail. And it was fascinating. I think it's something that every senior leader also needs to understand. I'll be airing that followup interview in the weeks to come just as we did for the Steven Spear interview series.
Okay. Back to the interview, you had this magnificent quote when you were interviewed by Karen Meyer on the Cognicast podcast about how engineers have to be able to talk in the language of business. We need to know something about how business people work, budgeting. Capital versus expense balance sheet, time value of money, getting benefits now versus getting benefits later. And you had this great concrete example of, yes, we need to hit some date in July, but we now eventually need to take two weeks after that, before we do anything with the single sign on issue, because presumably to pay down technical debt. Mike, can you talk about that quote and why it's important to you, and any advice you would have to technologists about how better to communicate with business leadership.
Micheal Nygard (01:27:26): I'm glad that you enjoyed that discussion because I think it's hugely important. And since that interview on the Cognicast, I've given even more thought to the way that engineers communicate. And so I want to go a little further. When engineers talk to other engineers, they don't want to be seen as condescending or belittling the skills of the other engineers. So what they'll tend to do is describe the situation and leave it to the other person to infer the consequences or assume that the other person understands the consequences. And so I might say something to you like our Oracle RAC cluster is supposed to be four nodes, but three of them are running five minutes behind on replication. And as someone with an engineering background, you go, "Holy crap, five minutes of backlog. That's terrible. We've got to fix this right now." But someone else might hear it and go, "Okay. And Oracle RAC is what? That's where we hang coats. It's in a bakery. What rack is this?"
Sometimes as engineers, when we're communicating to people outside the engineering discipline, we need to remember that the person we're talking to may not have the experience, or background, or baseline knowledge to draw the conclusions we want them to draw. And so at the risk of seeming pedantic, you have to develop a personal style that lets you do this where they understand you're trying to help, not condescend. You go, "So if we have a problem with our database right now, we're going to lose the last five minutes of order data, and we might not be able to recover the last hours worth of orders." And then they go, "Oh, that sounds bad." And then you translate it to dollars.
And you're like, "Yeah, the last hour we took $97,000 worth of orders." And now you're beginning to speak the language that they speak. Time, and money, and schedules, and resources. Gregor Hohpe talks about the architect's elevator. That, especially as an architect, you have to be able to ride the elevator from the boiler room where you're talking to the people who have their hands on the real work, they keep things running and you ride the elevator up to the boardroom where you're talking to the people who, to them, the boiler room is an abstraction. And as long as the heat is on, they're not too worried about how it happens. And so you have to be able to speak both those languages. So yeah, I do talk a lot about engineers translating into the language of the leaders they're talking to.
There's one more point though, that I'd like to make, which is something engineers don't realize is that most communications in business are negotiations of some kind. And so when you are simply stating facts or answering a question, you may not realize you're actually engaged in a negotiation. So someone will ask an engineer a question that appears to be a technical question. Could you do it in eight weeks instead of 20? You're like, "Well, we could, if we do X, Y, Z, and W where the consequences of X, Y, Z, and W should be horrifying." But the person hearing it doesn't have the baseline knowledge to draw those conclusions, right? [inaudible 01:31:05] Great, do it. You've got six weeks. Wait, what do you mean six weeks? I thought it was eight. I'm sure you'll have it all done in four weeks, and I'll talk to you then. What?
Gene Kim (01:31:15): Exactly.
Micheal Nygard (01:31:17): And so, one tactic is the engineer needs to recognize that as a negotiation and counteroffer. Could you do it in eight weeks instead of 20? Absolutely not. No way. In fact, we really need 24, or it's going to be screaming hell fire around here, or counteroffer with a request for the other things you need. Say, we can do it in eight weeks and then an extra 20 weeks to deal with issue X, Y, Z and W because each of them is costing us 10,000 hours a year in maintenance, right? So again, these seem like simple strategies, but there are definitely patterns that I've seen over and over again.
Gene Kim (01:32:01): I love it. Mike, I got to tell you, I have a big grin on my face and I'll have a smile on my face for the rest of the day, because you've explored in such detail things that I so much care about. And just to be able to hear your experiences and learnings over the years. It's just an honor and delight beyond words. So thank you so much for that. Mike tell everyone how they can reach you.
Micheal Nygard (01:32:23): Well, I'm on Twitter. I'm pretty active there as M-T-N-Y-G-A- -R-D. And you can use [email protected] to reach me by email.
Gene Kim (01:32:34): Awesome. Thank you, Mike. Thank you so much for listening to today's episode. Up next will be Dr. Steven Spears, DevOps enterprise summit presentations, both from 2019 and 2020, where he talks about the need to create a rapid burning dynamic, as well as how to create them. The 2019 presentation talks about many of the case studies we talked about today, but in more detail. And in 2020, he talks about one of the most remarkable and historic examples of creating a dynamic learning organization at scale, which was in the US Navy at the end of the 19th century at the confluence of two unprecedented changes. One was in the underlying technologies, which you found in ships, and in that strategic mission that there were in service of. As usual, I'll add my reflections and reactions to those presentations. If you enjoyed today's interview of Steve, I know you'll enjoy both of those presentations as well.
Sign up to receive email updates
Enter your name and email address below and I'll send you periodic updates about the podcast.