Skip to content
THE IDEALCAST WITH GENE KIM

(Dispatch from the Scenius) Fabulous Fortunes, Fewer Failures, and Faster Fixes from Functional Fundamentals

Episode 22
|
Scott Havens
|
Director of Engineering, Wayfair
|
38m

- Intro

Fabulous Fortunes, Few Failures, and Faster Fixes from Functional Fundamentals: Scott Havens's 2019 DevOps Enterprise Summit Talk with Commentary from Gene Kim

In this episode of The Idealcast, Gene Kim shares and gives commentary on Scott Havens’ talk from the 2019 DevOps Enterprise Summit Las Vegas. Havens is a Director of Engineering at Wayfair, where he leads Engineering for the Wayfair Fulfillment Network. He is a leading proponent of applying functional programming principles to technical and organizational design. Previously, Scott was the architect for Walmart's global omnichannel inventory system, unifying availability and replenishment for the largest company in the world by revenue.

In his 2019 DevOps Enterprise Summit talk, Havens highlights functional programming and e-commerce systems work. He also talks about what he did to massively simplify those systems while also making them more testable, reliable, cheaper to operate, and easier to change. Finally, he discusses the implications of using functional programming to change how to design systems and systems of systems on a larger scale.


- About The Guests
Scott Havens

Scott Havens

Director of Engineering, Wayfair

Scott Havens is a Director of Engineering at Wayfair, where he leads Engineering for the Wayfair Fulfillment Network. Scott cares deeply about scalable data-intensive software systems; he is a leading proponent of applying functional programming principles to technical and organizational design. Previously, Scott was a Director of Engineering at Jet.com and was the architect for Walmart's global omnichannel inventory system, unifying availability and replenishment for the largest company in the world by revenue. In his home life, Scott enjoys good food, good wine, bad movies, and asking his daughter to stop "redecorating" his Minecraft castles, pretty please.


- You'll learn about
  • Functional programming and what it is.
  • How e-commerce systems work.
  • What Havens did to massively simplify those systems while also making them more testable, reliable, cheaper to operate, and easier to change.
  • The implications of using functional programming to change how to design systems and systems of systems on a larger scale.

- Resources

- Transcript

Gene Kim (00:00): Welcome back to The Ideal Cast. Our sponsor today is ServiceNow, and I'm grateful that they have also been a long time supporter of the DevOps Enterprise Summit, the conference for technology leaders of large complex organizations. Join us there and visit the ServiceNow booth to see all the ways they're supporting the DevOps enterprise community. If you need to eliminate friction between dev and ops, head to servicenow.com/devops to find out more. Gene Kim (00:27): You're listening to The Ideal Cast with Gene Kim, brought to you by IQ Revolution. I'm so excited that my next guest is Scott Havens, who is currently director and head of Wayfair Fulfillment Network engineering. Scott has taught me so many things about great architectures and functional programming, both at the skill of small systems, such as one program running in isolation, to running at large scales, such as the inventory management systems that power Walmart, the world's largest company. Gene Kim (01:03): So two episodes ago, I got to talk to Dr. Steven Spear about just in time supply chains, both in support of manufacturing and for supply chains in general and how much it relies on modularity and information hiding. And then I got to talk to Dr. Gail Murphy about modularity and information hiding in software and how it has enabled in open source, a vast set of interdependent libraries that are able to coevolve independently. So in the next two episodes, I get to talk with Scott Havens on how he rewrote the inventory management systems that power Walmart at mammoth scales, and what it is like to be part of a vast physical supply chain from the perspective of a retailer. Gene Kim (01:44): To kick this off, in this episode, we're going to hear Scott's amazing 30 minute talk that he gave at the 2019 DevOps Enterprise Summit Las Vegas conference. This is one of my favorite presentations that I've ever heard because I learned so much about how e-commerce systems work and what he did to massively simplify those systems while also making it more testable, reliable, cheaper to operate and easier to change. And then the next podcast episode, you'll hear me interview Scott as we get to dive deeper into almost every element of his presentation. We'll learn more about his views on architecture, more gruesome details on what happens when you need 23 deeply nested synchronous remote procedure calls to present information to the customer, how one actually implements event sourcing patterns at the scale, and the vast challenges of managing inventory at Walmarts, which is a vast supply chain in its own rights. Okay, let's go to his talk, which I hope you enjoy as much as I do. Gene Kim (02:55): All right to motivate the next talk, I want to tell you just a little bit about something that influenced me a lot. About three years ago, I learned a language called closure and it changed my life. It was probably one of most difficult things I've learned professionally, but it's also been one of the most rewarding. It brought the joy programming back into my life. Gene Kim (03:13): For the first time in my career as I'm nearing 50 years old, I'm finally able to write programs that do what I want them to do, and I'm able to build upon them for years, without them falling over like a house of cards. That has been my experience for nearly 30 years. A famous French philosopher, Claude Lévi-Strauss would save certain tools. Is it good to think with? And for reasons that I will try to explain the next five minutes, I believe functional programming and things like immutability are truly better tools to think with and has really taught me how to prevent myself from constantly sabotaging my code, which I've been doing for decades. I'm going to make the astonishing claim that these things have eliminated 90% of the errors I used to make. So I'm going to try to motivate why. Gene Kim (03:56): So about a year ago, I found this amazing graphic on Twitter that describes the difference between passing variables by value versus passing variables by reference. So when I was in graduate school in 1993, most mainstream languages supported only passing things by value. So, which meant that if you pass the variable to a function and you changed it within the function, you would only change your local copy. So often, this means that you would have to return the new state. And if this was a structure or a large object, means you would have to do a lot of copy and pasting. This is tedious, error prone and very time-consuming. I often find myself complaining about this wishing there were a better way. And it turns out how you could eliminate this by using pointers, but actually, pointers are now considered so dangerous, few languages besides C++ and assembly even let you do it because it is that dangerous. In 1995, I got introduced to huge innovation programming languages that was called passing values by reference. This showed up in C++, Java, Modular 3, which allowed you to change the value that was past you as a parameter and it would change the reference that you passed it in from the caller, and this seemed really great. I loved it because it was such a time-saver because it lets you write less code. But three years ago, I changed my mind. Gene Kim (05:09): So closure is one of the category's languages called functional programming. Haskelll, F#, they're all part of these... have the same sensibility. They don't let you change. Variables functions need to be pure. The functions always return the same output, give the same inputs, and there are never any side-effects. You're not allowed to change the world around you. Now, you're not allowed to read or write from disk. Certainly, not reading from disk is not allowed because it's not always the same. And so this is one of the biggest aha moments of program for me because it taught me how terrifying passing variables rifles should be. Because when you see this, what you should really should be seeing is this. It's like, why is my coffee cup changing? Who is messing with my coffee cup? And how do I make them stop? The point here is that it's very difficult to understand your code and the reason about what is happening when anyone can change your internal state. Gene Kim (06:00): You may have heard of highs and bugs where even the mere act of observation changes the result, and these are the hallmarks of multithreading areas, which is considered to be one most difficult problems in distributed systems. I'm fixing my coffee cup and I can't figure out how to get it to fill up again. Right? I feel I need to replicate the problem. So in the real world, uncontrolled mutation makes things extraordinarily difficult to reason about because other people can put anything they want in your coffee cup. John Carmack, he wrote, "Castle Wolfenstein," "3D," "Doom," "Quake." He gave this amazing keynote at the Quake Con Conference in 2013 saying, "A large fraction of the flaws in software development are due to programmers not fully understanding all the possible states or code may execute in. In a multithreaded environment, the lack of understanding and the resulting problems are greatly amplified to the point of panic if you're actually paying attention." Gene Kim (06:47): So the point here is that in the real world, it's not just your coffee cup. You're operating in a universe of coffee cups. And if you zoom out, there are many, many more coffee cups around that. And if anyone can change your state because they have a reference to it, it becomes almost impossible to reason about. Under these conditions, it's almost impossible to understand what is actually happening and how to make things truly deterministic. This is one of the beliefs that functional programming truly taught me, is they have a belief that uncontrolled state mutation is that the very limits of what humans can reasonably understand and to be able to test and run in production. And so programming languages, pioneered functional programming techniques with this is Haskell. Ocaml, closures, Scala, Erlang, Elm, [inaudible 00:07:32] is becoming increasingly popular. Gene Kim (07:35): What I find so exciting is that these concepts are now showing up in infrastructure as well. Docker is immutable, right? You can't change containers. If you really want to make a change that persists, you have to make a new container. Kubernetes uses this concept, not in the small, but in the large for systems of systems. If you see Apache Kafka, chances are they're using it for an immutable data model that says you're not allowed to rewrite the past. It turns out version control is immutable, right? You get yelled at if you actually rewrite history. So I'm going to introduce the next speaker, which is Scott Havens. As we were talking for the slide, he said, "Everyone knows now, as a doctor Dexter said, is go to statements are considered harmful to program flow." He said that, "It is without a doubt that uncontrolled state mutation will surely, within our generation, be considered the next go-to." So one is for code, one is for data. Gene Kim (08:25): So the next speaker is Scott Havens. Until very recently, he was director of software engineering at jet.com and Walmart labs. His remit was to rebuild entire inventory management systems at Walmart, the world's largest company. He earned this right by the amazing work he did building the incredible systems that power jet.com, a company that Walmart then acquired. It powered the inventory management systems, order management, transportation, available to promise, available to ship and tons of other critical processes that must all operate correctly to compete effectively as an online retailer. Gene Kim (08:57): He's now senior director, head of supply chain technologies at Moda Operandi, an upscale fashion retailer. And I hope what he presents will blow your mind as it blew my mind, showing that functional programming principles apply not just in a small in a program, but can be applied at the most vast scales, such as Walmart enterprise. With that, Scott Havens. Scott Havens (09:16): Good morning, DevOps Enterprise Summit. I'm really excited to be here today and talk about something that's really near and dear to my heart. My name is Scott Havens. I'm a senior director and head of supply chain tech at Moda Operandi. It's a fashion e-commerce company that was founded in 2010. Our mission is to make it easy for fashion designers to grow their business and for consumers to recognize their personal style. I joined Moda because fashion supply chains are notoriously challenging, and I'm excited about how we can use technology to improve time to market lower, costs and even help designers predict next season's fashion trends before the season starts. However, I just joined two weeks ago, so I'm going to supplement a lot of my discussion today with- [inaudible 00:10:04] Scott Havens (10:03): ... ago. So I'm going to supplement a lot of my discussion today with my experiences prior to Moda. Before Moda Operandi, I was an architect at Walmart, the largest company in the world by revenue, over half a trillion dollars a year, and by number of employees, 2.3 million to be precise. I was responsible for designing and building supply chain systems like inventory management for Walmart, including the 4,500 stores in the US, e-commerce like walmart.com and owned brands like jet.com, and international markets. I joined Walmart via the acquisition of jet.com three years ago, for $3.3 billion, at the time, the largest e-comm acquisition today. Scott Havens (10:44): One of the reasons that Walmart had bought Jet is because the Jet tech stack looked transformative. It was Cloud native, microservice-based, event sourced, and fundamentally it was based on functional programming principles. It looked cool. But not everyone is convinced by just cool. They didn't know if Jet's techniques were just the latest buzzwords or if they provided real world benefits. Well, it wasn't long before we were fortunate enough to get the chance to demonstrate these benefits. And when I say fortunate, what I really mean is that disaster struck. Scott Havens (11:21): About three years ago in the middle of the night, I got paged for a system alert. I woke up, hopped on to phone bridge and our PagerDuty Slack channel and started looking into it. Almost immediately, I was joined by coworkers from several other teams. It turned out our production Kafka cluster was down. If you're not familiar with Kafka, it's a very scalable pub sub messaging system. We use it as the primary method of communication among all of our backend services. Scott Havens (11:48): Before too long, we realized that the cluster wasn't just down, but it was dead. It was an ex Kafka cluster. Every single message in flight was gone. Customer orders, replenishment requests, catalog changes, inventory updates, warehouse replenishment notifications, pricing updates, every single one, just gone. We were going to have to rebuild the cluster from the ground up. Now, this could have been catastrophic. This could have been the end of the grand Jet experiment, enough to convince our new Walmart compatriots that Jet's technical tenants sound good on paper, but don't work in a real enterprise compared to tried and true systems. So what happened? Well, first we rebuilt the cluster. New brokers deployed in minutes via Ansible scripts. While this was happening, we coordinated with all the teams who manage the edge systems, the systems that are exposed to the outside world like merchant API inputs and customer order input. These edge systems, like all the others are event sourced. Each of these teams reset the checkpoints in their event streams to a point in time just prior to the outage. All of the events after that point were readmitted to all the downstream consumer. And when these checkpoints were set back, there was some overlap on messages that had already been sent and processed downstream. But the downstream systems were all designed and had been fully tested to handle duplicates and act item potently, even though they were out of order. These downstream systems were hit by a flood of messages, but we were able to just scale them out in seconds, some automatically, some manually, to handle the throughput and stay entirely within our SLA. Scott Havens (13:26): In the end, this potentially catastrophic event was little more than a minor annoyance. No data was lost and not a single customer order was delayed. Walmart was happy that their $3 billion wasn't wasted on worthless tech, and it afforded us coming from the Jet side, the opportunity to examine the rest of the Walmart technology ecosystem and see where we could provide value. Scott Havens (13:50): Now, Jet, as a startup have the advantage of being completely greenfield and focused in their business. Walmart on the other hand had built their incredibly successful and wide ranging business over many decades, requiring a number of different stacks and technologies. What we found was an organization and an architecture of enormous complexity and costs. Now, I'm not going to attempt to capture the entire mammoth business that is Walmart, or even any other e-commerce company here. Instead, I'm going to dig into just one common small piece of e-commerce website functionality. Scott Havens (14:27): Our customer Jane, wants to buy a cocktail dress for an upcoming party. She wants to know if it's available in her size. It doesn't have to be a store or a warehouse nearby. It can be anywhere as long as it can be shipped to her. This item availability is served via an API. When she checks her favorite e-commerce site, it can't be down or take too long to load because competitor's websites are only a click away. Scott Havens (14:52): So our item availability API has an SLA of, let's say 99.98% uptime. That's just shy of two hours a year of permissible downtime at say, 300 milliseconds latency. What factors will go into this item availability? The first ones that may come to mind are the inventory in the warehouse and any reservations that may exist from existing orders. But there's a lot more to it than that. In addition to the warehouse inventory, you might have the store inventory on the floor or the inventory in the back room of the stores. If you are a marketplace, you might have a lot of third parties, could be thousands of different third parties inventory. You have to look at the item and see if you're even eligible to sell it on this site. Just because it's sitting in a warehouse doesn't mean that you're permitted to sell it. Scott Havens (15:40): There is warehouse eligibility. Perhaps a certain warehouse that has your item isn't permitted to ship to a certain area, or is not allowed to sell on a particular website. There are sales caps where you might have limits for a particular timeframe of how many are just permitted to sell, like maybe a cap of a thousand during some kind of discounted special. And then there are back orders for all the orders that already exist that weren't able to be filled originally if the customer still wants them. And for every single one of these factors, at a large enough organization, they're going to be legacy systems that have duplicates of all this information that you need to consider as well. Scott Havens (16:18): So how do we add all of these things together to give Jane her answer? A common model is via service-oriented architecture or SOA, which we decompose each of these factors into a service. You call each of these services on-demand in real time to get the information you need. What does that look like here? Scott Havens (16:38): Well, now I have the pleasure of showing you one of the ugliest diagrams I have ever made. And don't worry, I don't expect you to memorize this or even be able to read it. The complexity is the point. You can still see at the top, the website calling the item availability API. Each of the item availability factors that I listed is represented somewhere on here by a service which may depend on other service. And to give you a sense of scope, each one of these boxes is a whole system or multiple systems, each maintained by one or more whole teams. So let's walk through what happens when Jane looks for her dress. Gene Kim (17:14): Gene here. I just want to verbally describe the slide. So at the top you have walmart.com. I need to have item availability, which calls other services, which call other services, and which call other services. It looks like five or six levels deep. So that's the before state. And I'll be back at the end of the presentation to verbally describe the after state. Okay. Back to the presentation. Scott Havens (17:38): At the top, highlighted in red, the customer facing website calls the item availability API. That general API calls the global item availability API, which checks its cash, doesn't find it, and falls back to other services, which call other services, and more, until we can finally compute the answer for Jane. Scott Havens (17:58): So let me save you some time on the math. To get the dress availability in under 300 milliseconds 99.98% of the time, requires 23 service calls, each of which has five nines of uptime and a 50 millisecond marginal service level objective. Without every single one of these services working correctly, it is impossible to know if an item is available. You're better off not even guessing with partial information. It's better than risking telling the customer the wrong answer. Scott Havens (18:32): To be blunt, an outage in any one of these services takes down the entire availability API. Because each of these systems has business logic that is so tightly coupled to so many other systems, it's extremely difficult to properly test them. Unit testing covers a tiny fraction of the space of potential errors. And relying on integration tests to fully vet something this complex is costly and absurdly ineffective. Further, each of these systems was fundamentally designed internally in a traditional manner. As changes happen, the current state usually stored in a relational database, is mutated in place. And there's an expectation, correct or not, that servers are reliable and will only be shut down or restarted with permission. And we all know how well that works. Scott Havens (19:19): So how do we go about tackling these problems? Can we take what we learned at Jet and extract lessons? Further, these problems probably aren't unique. Can we ensure that these lessons are broadly useful to anyone or any company that might suffer similar problems? The jet.com way of approaching these problems was to look at them through the lens of functional programming. So let's walk through these principles and learn what the implications are for system design. Scott Havens (19:45): There are many principles, but I'm going to focus on just a handful today. Let's start with immutability, the idea that the inputs don't change. The functions that take these inputs produce outputs that are also immutable. Now, state is not directly mutated. We embrace purity. We avoid riding- Scott Havens (20:03): We embrace purity. We avoid writing functions that produce side effects, no writing to disk or network until the last possible moment. And we strictly control those side effects when we do. This makes it easier to reason about the code and test the code. Scott Havens (20:17): The external world outside the function can't affect the results and the function won't affect the external world. This makes the function very predictable and repeatable. Given input, the output will be the same every time. And that repeatability unlocks a principle called the duality of code and data. It's a fancy way of saying that the code and the data are interchangeable. Scott Havens (20:41): A function that accepts parameter A and computes output B could be replaced with a lookup table with a key of A and a value of B. Conversely, a really big lookup table takes up gigabytes of space that maps A to B. It could be compressed into a function that computes B from A. You can go back and forth between the two. Scott Havens (21:02): Now, Gene did a great job introducing some of these principles and showing how they work in the small when you're writing code. We took these same principles and applied them in the large, changing how we design systems and systems of systems. Let's walk through some of these results. Scott Havens (21:18): Starting with immutability, you get message-based, log driven communication. The first part of this, the message-based part, is pretty ubiquitous. Systems communicate with each other via messages, synchronously over HTTP or asynchronously over some kind of queue. Scott Havens (21:34): Log-based pub subsystems like Kafka AWS Kinesis and Azure Event Hubs take this a step further. Not only do the messages themselves not change, but they are ordered and retained for an extended period of time, even after you've consumed the message. The consuming services keep track of their own progress via checkpoints into the log. So what does this mean? Scott Havens (21:57): Imagine you suffer an outage that causes you to lose the last day's worth of transactions, or even worse, you've introduced a bug in your code that corrupts data. You can deploy your fix and reset the checkpoint to the point in time before the bug was introduced. This will force your consumer to replay all of the subsequent messages, re consuming them with corrected code and fixing your corrupt data. This approach drastically improves your mean time to recovery on an entire category of production [error 00:22:27]. Scott Havens (22:27): Now, at Walmart, we produce replaced HTTP calls, queues and even enterprise service buses with Kafka. Now, at Modo, we're using AWS Kinesis for the same end. Scott Havens (22:37): With immutability, you also get drive event sourcing. Events are facts about something that happened in the world. Once an event occurs, it always will have occurred. It doesn't change because, by definition, it's already happened. Scott Havens (22:52): In an event source system, events are first class citizens. The canonical data store consists of ordered streams of events. The current state is secondary, a consequence of the events. You use the stream of events to build the current state by aggregating over all of them. Bank accounts are an obvious example of this approach. Your account balance, your current state, is the result of summarying over every deposit and withdrawal that had ever happened. Scott Havens (23:22): Event sourcing, storing the events this way, is extremely powerful. It effectively gives you a time machine. You can see this state for any point in time and you can walk step-by-step through everything that's ever happened. This is fantastic for troubleshooting. You can validate behaviors that people are observing. Scott Havens (23:42): People may report that they saw a problem at a specific time. It could be days, weeks, even months after the fact. And we can go back in time, re observe it, and perform a root cause analysis. Further event sourcing, unlocks entire new areas of analytics, where we've found that our marketing teams love having this kind of data of everything that's happened over time. And our operations and audit people love knowing exactly everything that's ever happened. Scott Havens (24:08): With purity, our goal of purity means that we isolate computations from the real world. We write all business logic as stateless functions with zero external dependencies. That means zero IO. Instead, collect all state you need upfront and pass it into your business logic as parameters. That statelessness, that isolation of the computation, gives it predictability and it gives it atomicity. There are no random outcomes, so-called heisenbugs. And there are no partial results. Scott Havens (24:40): Real world failures, and in the cloud, you are constantly dealing with real world failures, may keep your code from running, but it will never affect correctness or consistency. Scott Havens (24:51): Now, because the business logic doesn't have side effects because they are pure, it means that 100% of the domain logic is unit testable. You can provably identify every single path through the business code and write unit tests for it, not just write unit tests, but create executable specifications. You can define in variance from your specification explicitly as properties. Scott Havens (25:15): For example, we say that inventory counts should never be negative. That is an invariant. These properties can be checked automatically with large numbers of randomized inputs extremely quickly. Spec-based and property-based testing frameworks that do this are available for most languages, but to work well, they depend on your code being stateless. And if you do this well, integration tests are only needed for establishing basic connectivity between services. You can test much more thoroughly in less time and for less cost. Scott Havens (25:48): Now, you can't remain pure forever. Once your business logic is complete and you have a result, you have to do something with it, but don't do any more changes than you absolutely have to in this process. Write it to one and only one place. You may be tempted, and this happens all the time, in the same process, to write to a database and then notify a downstream consumer about that change, maybe via queue. Don't. This is called a dual right. Scott Havens (26:20): In a distributed environment like the cloud, failures can and will happen at any point. As soon as one of those rights succeeds and the other fails, your system is now in an inconsistent state. Dual rights take all the hard work you did to get guaranteed outcomes and tosses it out the window. Instead, the safe way to accomplish this is via change data capture. The result is that an event is published downstream, if and only if, it's been committed to the database. This ensures eventual consistency. Scott Havens (26:53): In failure scenarios, you may fall behind publishing, but you'll never lose events. You'll never lose events in your data store. You'll never lose telling your downstream consumer. Different databases will support this in different ways. Walmart now uses the Azure Cosmos DB change feed for this. And at Modo, we use Kinesis Streams. Scott Havens (27:13): So by applying these principles, we've established a pattern for designing systems that looks like this, we receive immutable messages over Kafka that are consumed by a microservice running stateless domain logic that emits these immutable events into a data stream. The events are then published downstream to any consumers over Kafka again. Scott Havens (27:36): But we're not done yet. When we're employing immutability and purity, we can take advantage of the third principle and replace realtime compute with data lookup wherever feasible. When you know the set of possible inputs in advance, or you've seen specific inputs before, you can replace the often expensive runtime computation with a pre-computed cache of the result. Scott Havens (28:01): For instance, in event sourcing, if you try summing over the first 1,000 events in a stream more than once, you'll get the same result every time. Particularly for long running streams that are millions of events long, it makes sense to save a snapshot and use that as your starting point next time, instead of retrieving and summing the entire stream. This costs you a very small amount of storage for the snapshot. And congratulations, you've just exchanged a computation for data. And that gives us a final pattern for system design that looks like this. Scott Havens (28:33): What has changed from the previous diagram is that we've added a service that consumes the events from the Kafka feed, builds updated stream snapshots, and then updates the cache. Further, we're publishing all of those snapshots via change feed to Kafka as well. Downstream consumers will have a choice. They can consume all of the events as they happen, or if they only care about the latest state, they can consume that feed instead. Scott Havens (28:58): One of the first teams to use this pattern at Walmart was called Panther. Panther is an inventory tracking and reservation management system. On the supply side, it aggregates and tracks all sources of inventory. It includes the Walmart and Jet owned warehouses and all partner merchants and their warehouses. And on the demand side, it acts as the source of truth for reservations against the available inventory at those warehouses. Scott Havens (29:23): When a customer is checking out, the contents of their cart are reserved to make sure that no one else will order them. If there is only one left, that's pretty important. If the inventory is not available at that point, the reservation fails and the items must either be resourced from a different location or different items must be selected. Scott Havens (29:41): The primary goals of Panther were to maximize onsite availability while minimizing order reject rates due to lack of inventory. There were a lot of secondary goals as well to improve the customer experience by reserving inventory early in the order pipeline. We wanted to enhance insights for the marketing and operations teams by providing more historical data and better ana- Scott Havens (30:03): ... operations teams by providing more historical data and better analytics, and we want it to unify inventory management responsibilities typically spread across multiple systems. Of course, along with these business goals, our solution had a lot of non-functional goals like high availability, geo redundancy, and fast performance backed up by SLA. We found a lot of success with this architecture. The entire team started by a single engineer in July 2016 had only three team members when Panther went into production by Black Friday that same year. That's only five months later. After one year, the team still only needed five engineers. Once in production, we found it very easy to add features. With inventory tracking, staleness of data is an issue. Simply put, if a merchant last told us their inventory months ago, there is no way we would trust that. So we wanted to implement a feature that expires the merchant updates after a certain amount of time, just zero it out. Scott Havens (30:57): The results were immediate. We dropped our third-party reject rate in half from 0.8 to 0.4%. And what's great about this is that this was done by a single engineer who is new to the company with lights F sharp training and no cloud microservice background, went from design to production in three weeks. So with the success of Panther, we started rebuilding a number of our supply chain systems following the same principles and patterns. But we didn't stop there. I want to revisit my ugly diagram from earlier. This is the one with all the nested synchronous API calls. We looked at this mess of dependent services. There may be dozens of teams and a lot of deployments on heterogeneous stacks on thousands of servers. But as far as the front end shopping site is concerned, looking up the availability of Jane's dress may as well be a function that calls other functions that call other functions and eventually returns a single end result. Scott Havens (31:57): This is a call graph. It's code. It's distributed unreliable, stupidly expensive code, but it's code. And if we remember the duality of code and data, there's a way to exchange that code for data. Maybe that data will be more reliable and less expensive than this monstrosity. It turns out it is. The systems modeled after the Panther architecture, all stream events and state changes as messages over Kafka. We can use these message streams to invert the dependencies. Instead of the dependent service polling it's needed inputs in real time, the source system can push the data changes. Scott Havens (32:37): The dependent service consumes these changes as it happens, updates its own state accordingly, and pushes its own changes downstream. We can convert it from a primarily synchronous service oriented architecture to a primarily event driven architecture. All of the same item availability factors are represented in this diagram, but now almost all of them are hooked up asynchronously. Messages are flowing in this diagram from left to the right. We're trading the real time computations, the real time calls for pre-computed data throughout the supply chain systems. How does that affect the hot path? The moment that Jane looks for her dress. Well, a moment ago, I showed him the SOA model. I highlighted that hot path in red. Let's look carefully to see what it looks like here. That's it. To achieve this same SLA, we need only two service calls, not 23, both of which have only four 9s of uptime, not five, and only 150 millisecond SLO. All of the event driven systems still need uptime and processing time SLOs, but they're no longer in any customer hot paths. They are completely asynchronous, three 9s uptime and end to end processing and seconds, or even minutes is sufficient. Gene Kim (34:01): Gene here. I love the applause at the end. I just want to describe the slide that Scott showed. Basically it goes from walmart.com to item availability API, to availability data store. That's two calls. And what's interesting is that the arrows are reversed, that it is not item availability API calling the availability data store. Instead, it is the other way around. We will explore more about that in the next episode. Okay. Back to the presentation. Scott Havens (34:31): So how does all of this affect cost? This is going to vary among organizations, but we can ballpark it. First an event driven system. Three nines, uptime, mid and latency is about as cheap to operate as any system are likely to see. If we increase our uptime by 10X to four 9s and drop our latency by 400X to 150 milliseconds for a lot of orgs, you're looking at an order of magnitude higher cost. To push your uptime to five 9s while tightening the latency even more, for most organizations that is an obscene amount of money. How do the total operational cost compare once you've replaced all of these things? Well with the functional event driven approach versus... Yeah. Now, I'm not allowed to give you precise numbers, but I can tell you that for walmart.com, this difference is millions of dollars per year. You may have a lot of objections to this. Scott Havens (35:27): You may be thinking, wow, this sounds really great, but there's no way we can do that. Well, let's talk through some of the more common reasons. My dev team isn't skilled enough. Well, I've trained up not just senior devs, but junior mid, and senior level engineers from all kinds of backgrounds. Java, C#, JavaScript, Ruby, Python, and they've all succeeded at this. We don't have the technology to do this. Well, you can follow these principles in any language. And if you're talking about the infrastructure that you need, every cloud provider has some kind of results available or some kind of infrastructure available for messaging. It'll make your app too complex. Well, that might be true for some systems if you're only for talking about the most basic one. Gene talked about a system that he built that was really just a toy system, but wanted to see what you could do to simplify it. Scott Havens (36:18): And we're running a little low on time. So I'm going to just walk through this really quick. He found that it turned out to be a lot more practical than he expected. And the last ones are it'll cost too much, it'll take too long, it's too dangerous, and it's just too much. I have this enormous creaky bailing wire and duct tape spaghetti code monstrosity. It grew uncontrolled over the years, not decades. Has dozen hundred thousands of people trying to keep it working. Well, I recognize this is a pretty big shift in mindset, but there's an old joke about this. How do you eat an elephant? The answer is one bite at a time. There are small steps you can do right now to take these principles and apply them regardless of what your systems look like now. You can identify just one duel right somewhere in all of your systems and figure out a way to eliminate it. Scott Havens (37:07): Consider using change data capture to do so. You can encourage property-based testing in just one system. Most of your devs won't find it that different from regular unit testing and you can switch one web service to also publish events. You don't have to fully commit to event sourcing, just publish your changes as they happen. Then switch one consumer to read the events rather than make HTP calls at runtime. This is a very easy way to bite off a small piece and ensure that safety of the system while you do it. So if you have an architecture that looks like this, and you don't have someone, an architect who is talking about how to move to something like this, you're doing your organization a grave injustice. Scott Havens (37:49): My mission in life is to reduce the amount of entropy in the universe, or at least our little corner of it. So if you want to help me in this journey, if you want to replicate what we've done or you have new ideas, here's how to reach me. I'm [email protected] or Scott Havens on Twitter. And I'd be remiss if I didn't say that we are hiring. Thank you very much. Have a great day. Gene Kim (38:15): Okay. I still so much love that talk. And I'm so excited that I got to ask Scott Havens so many questions about specific aspects of what he did, which you will hear about in the next episode. So see you then. Our sponsor today is service now, and I'm grateful that they have also been a long-time supporter of the DevOps enterprise summit. The conference for technology leaders of large complex organizations. Join us there and visit the service now booth to see all the ways they're supporting the DevOps enterprise community. If you need to eliminate friction between dev and ops, head to servicenow.com/devops to find out more.


gene (2) (1)

Gene Kim

Gene Kim is a Wall Street Journal bestselling author, researcher, and multiple award-winning CTO. He has been studying high-performing technology organizations since 1999 and was the founder and CTO of Tripwire for 13 years. He is the author of six books, The Unicorn Project (2019), and co-author of the Shingo Publication Award winning Accelerate (2018), The DevOps Handbook (2016), and The Phoenix Project (2013). Since 2014, he has been the founder and organizer of DevOps Enterprise Summit, studying the technology transformations of large, complex organizations.

Want to be the First to Hear About New Books, Research, and Events?