Gene Kim (00:00:00):
Welcome back to the Ideal Cast. I'm your host, Gene Kim. Today we have on Scott Havens, currently director head of Wayfair Fulfillment Network Engineering. This episode is made possible with the support from ServiceNow. Take the risk out of going fast. If you need to eliminate friction between dev and ops, head to servicenow.com/devops to find out more. You're listening to the Ideal Cast with Gene Kim, brought to you by IT Revolution. I am so excited that my guest is Scott Havens, who is currently director, head of Wayfair Fulfillment Network Engineering. So in the previous episode, you got to hear him give one of my favorite presentations of all time. It's a presentation he did at the 2019 DevOps Enterprise Summit in Las Vegas, where he talked about how e-commerce systems work and what he did to massively simplify the systems that powered Walmart while also making them more testable, reliable, cheaper to operate, and easier to change. He described the work that he did as director of software engineering at jet.com and later Walmart Labs. His remit was to rebuild the entire inventory management systems for Walmart, the world's largest company. He earned this right by the amazing work he did rebuilding the incredible systems of powerjet.com. It powered the inventory management systems, order management, transportation, available to promise, available to ship, and tons of other critical processes that all must go right at an online retailer.
Gene Kim (00:01:35):
In this episode, we will learn more about his views on what makes great architecture great, the gruesome details on what happens when an API call requires 23 other deeply nested synchronous remote procedure calls to return a correct answer, how one actually implements event sourcing patterns on Walmart scale, and the functional programming principles that it depends upon, the challenges of managing inventory at Walmart, which is a vast supply chain in its own rights. And just how much category theory do you actually need to know to do functional programming? This was such a fun interview for me, because as much as I learned from his DevOps enterprise talk, I learned a ton more from his amazing explanations. Okay, Scott, I've described you in my words. Can you describe yourself in your own words and describe what you've been working on these days?
Scott Havens (00:02:32):
My name is Scott Havens. Thank you very much for that introduction. I am a huge fan of functional programming principles and how you can use them to build enormous systems operating at scales that bypass anything that you might actually need to see in the real world. This often means a switch to a synchronous thinking, which is a big change for a lot of people from synchronous mindset. And it's a switch from an object oriented mindset to an immutability first mindset, a purity first mindset, and being able to think about the duality of code and data in your systems. These have huge practical benefits. They're not just theoretical. As you're building these systems, you actually see the practical advantages in terms of latency, in terms of observability and systems, and being able to prove that your systems are doing what you want them to do.
Gene Kim (00:03:30):
It almost seems that you're almost minimizing kind of your achievements just by pigeonholing it into just functional programming. You have a very drastically different view of the world in terms of the architecture should reside in as well as evidenced through your talk that you did about your work at Walmart and later modus operandi. Can you extend what you just said and incorporate how you view architecture fitting in as well?
Scott Havens (00:03:53):
The point of any architecture is to build a collection of systems that you understand well, and that solve the business domain problem at hand. I like to think about what the end goal you're actually trying to solve. Think about if you're in retail, what is the customer actually trying to accomplish? Work on making it as easy as possible for the customer to accomplish that and pull out of the hot path all of the work that needs to be done to get there in the first place. That means make as much processing as possible asynchronous done in an event driven manner where you can map out all of the domain events over a period of time without making the customer wait for you to solve all these problems in real time.
Gene Kim (00:04:46):
Wow. But yeah, so let's make this concrete and maybe to make sure that everyone slowly appreciates the awe in which I view your work. I recently watched your DevOps Enterprise Summit Talk from Vegas 2019. And as mentioned to you before, it is just as breathtaking now as it was listening to a year ago. And if I took notes correctly, there was one particular example that I found utterly riveting. You specifically talked about a specific capability that most e-commerce sites needed to do, which is for a given product, is the item available for me to order it.
Gene Kim (00:05:21):
And you described that how in the Walmart context to render that information required 23 separate service API calls, and each one of these 23 service API calls would have to respond in 50 milliseconds. They would have to have five nines of availability in order for the customer to get a response within the third of a second. And any one of these services going down would potentially take down the ability to tell the customer anything at all. Can you tell me about why that's so bad and maybe how it offends you both in the realm of functional program principles, as well as the architectural principles that you hold so dear?
Scott Havens (00:05:58):
Sure. Customer facing systems need to have an enormous uptime. And especially in a retail environment, the next retailer is just a click away. So if you need to have your systems up and running all the time with minimum latency, every click the customer does needs to return results as quickly as possible without any failures. That means in the real world, you're talking four nines of uptime, more or less with really low latencies. Anything over 300 milliseconds, the longer customer has to wait, the more likely they are just to give up and switch to a competitor. It's really expensive to have systems running with that kind of uptime and that kind of latency. You need to have staff on hand all the time, dealing with all of these redundant systems, making sure that they don't break, or if they do, they get fixed as quickly as possible, 24 hours around the clock.
Scott Havens (00:06:55):
When these systems then have to call other systems and those systems then call other systems and call other systems all in real time, that means that every single one of these supporting systems, every one of these supporting services needs to have even better uptime because the effects of downtime are multiplicative. If any one of those systems is down, then the whole system is down. And that ends up being really expensive. If you have call after call after call, every single one of these is running at five nines uptime, it's incredibly expensive. If you can get all of those systems out of the hot path, then you don't have to be supporting all of these additional systems at such an expensive level. If it's okay that they go down once in a while, you don't have to staff them 24/7. You don't have to design them in a way that they can survive a nuclear blast on one side of the country, and they're still operating out of the other.
Scott Havens (00:08:02):
The key to doing that is switching these from synchronous service calls that are RPC type requests to event driven systems, where as they're processing their own changes, they will push via messages any of the relevant changes to the customer facing system. If some of those updates happen a few seconds late or a few minutes late, it doesn't matter because the customer facing site will still continue running that whole time. So you're taking all of the complicated computations out of the hot path and letting them run at a much lower service level for much cheaper.
Gene Kim (00:08:44):
And so if I'm a customer going to a product page, I'll see all the product details, I'll see the add shopping cart button, but then I'll also see X number in stock. So when you talk about item availability, it's that, right? In other words, it's available in stock and not only is it available stock, but it's available to ship to me. Can you give us a sense of what are those 23 API calls required to render that simple piece of information?
Scott Havens (00:09:10):
The obvious ones are that you need to know that an item is in stock, in a warehouse and need to know how many reservations from other orders are against that already. But then you need to know, is that warehouse open? Is that warehouse eligible to ship to your particular location? Maybe it's located in different country. Maybe that a warehouse is going to be down for some kind of physical maintenance. Maybe the item isn't eligible to be sold on a certain site because of some kind of agreement with the manufacturer.
Gene Kim (00:09:44):
Yeah, brother, I'm getting stressed out just even thinking about it. I've never worked in a retail system. So the one that I found particularly stressful sounding is the reservation one. If I place an order in my shopping cart, I might not ever hit ship. So that reservation is no longer valid, am I understanding correct?
Scott Havens (00:09:59):
There are different kinds of reservations you can do in the retail world. We've looked at whether you want to reserve, as soon as you are purchasing the item, or if you want to reserve earlier, maybe on add to cart. It turns out that in the real world, we don't actually want to reserve on add to cart very often because it will just get stuck in a cart. People wouldn't necessarily buy it and we don't care if that customer buys it versus a different customer buying it. It's going to be same money for us one way or the other. What we do care about is the customer experience and making sure the customer is never told that, yes, you definitely are going to be able to buy this. And after they've paid for it, they get told later, oh, turns out we actually sold it to someone else. Or, oh, no, turns out that we, for whatever reason, we thought we had it to give to you. And we don't actually.
Gene Kim (00:10:51):
So the reservation is that period of time between I submit the order and is on the way to me, is that right?
Scott Havens (00:10:57):
Gene Kim (00:10:59):
So the windows are even smaller.
Scott Havens (00:11:01):
It becomes a daunting technical problem when you look at days like Black Friday, where you might have a special where a Nintendo switch, you have perhaps 100,000 of them in stock and you'll have 500,000 people all hitting the site the exact same time at midnight, when that deal comes on, all trying to add to cart and check out at the exact same time within a minute. So you have to guarantee that those first 100,000 customers will get that item and be told that they get it. But the 100,001st customer, is told, nope, sorry, you can't do that.
Gene Kim (00:11:40):
Interesting. Sorry. It strikes me that this is a particularly hard part of the ordering domain.
Scott Havens (00:11:47):
It is. This is one of the only areas of anything in e-commerce that needs to be strictly serialized. It means that every single customer could be interfering with every other single customer simultaneously. Every single request needs to have a strict order behind it. You need to know that this request came before this one and comes after the previous one. And that ties in really closely with event sourced systems, where you want to have all of the events in a strict order where you know that this is the thousandth event, this is the thousandth first, et cetera.
Gene Kim (00:12:25):
So can we explore what it might feel like to be part of one of those 23 teams? Some of the things I heard was you have to over provision capacity, you have to be on call 24 hours a day just because you are in the hot path. Specifically, it's difficult to test things, but I have to imagine it's also difficult to implement things because of the interdependent nature of the systems. Can you talk about that?
Scott Havens (00:12:47):
When you have one of these systems that's providing one dimension of availability, whether it's your system is responsible for saying whether the warehouse is open on certain days, or your system is deciding what kinds of carriers and carrier methods are possible for transporting the goods. If you're building a system that is designed to be called synchronously from some availability API, it means that your own system has to be up all of the time, running in production with no downtime, no errors. And it makes that very hard to both build new features behind an existing API and to roll out new features and to test these new features in production. One of the things that has become very clear to me, dealing with these massive distributed systems is that you can never fully test something before it gets into production. The scale is just not going to be there in whatever kind of staging environment, the right kinds of inputs are never going to be there in your staging and testing environment.
Scott Havens (00:13:56):
So when you have a synchronous system that has to respond to all of these requests in real time, adding a new feature and deploying a new feature is... There's a lot of overhead to it. You need to deploy in parallel. You need to do a little bit of Canary where it may be 5% of traffic gets rolled over. You need to be very much on the lookout for any errors and be ready to roll back the instant there are any errors. The roll up process ends up having just a lot of overhead or to get rid of that overhead takes a lot of automation upfront to get that deploy process.
Gene Kim (00:14:36):
Just so I understand it correctly. So this is now speaking to just because you're handling live traffic all the time. There's just a lot of infrastructure required to make sure that you can do controlled deployments so that if there is something bad happening, you don't take out 100% of your traffic or introduce errors into all of the incoming quests.
Scott Havens (00:14:55):
Exactly. Safety is very difficult to guarantee when you're dealing with live synchronous systems. You have very strict performance requirements as well, performance and scalability requirements that have to be met upfront. Every feature that you are writing, you have to make sure that adding on that feature, it doesn't bump your service level outside of your SLA. If it adds 50 milliseconds on, the feature may work, but you're now out of your SLA. And it's hard to test that in advance as well.
Gene Kim (00:15:31):
So can you talk about maybe the correctness aspects? I mean, is it more difficult to actually establish the correctness of any new code?
Scott Havens (00:15:40):
It's not necessarily difficult within a given system to test the correctness. However, with a live system, you don't know what the dependent system is going to do with your response. For better, for worse, contracts between systems nowadays are usually very loose. There's some kind of Jason based RPC, or even if it's something that's a little bit more strongly typed, like Proto Buff, you're still going to have some categories of errors that aren't being caught by your type system, so you never know what your downstream system is going to do with your new feature. So correctness within your own domain, great. But correctness further down, you can't really predict that and you may blow up a light system because of that.
Gene Kim (00:16:27):
So if we don't want to do that, then now we have to start coordinating with other teams and the dependent systems to make sure that any changes I make don't accidentally blow up a nearby system or a distant part of the system.
Scott Havens (00:16:42):
Yeah. That ends up being a massive level of coordination because you aren't just coordinating with your single downstream system or upstream, depending on our perspective. It ends up being the entire chain of synchronous calls all the way up to the customer facing API, that all have to be tested in concert to make sure that the customer is still seeing the right thin. When you roll out your change, five calls down.
Gene Kim (00:17:09):
And so I'm kind of assuming that this means in order to do this, now you have to test it in the presence of all the other systems in the... I'm assuming it'd be some sort of huge integrated test environment where all the components have stood up so that you can somehow test the system as a whole?
Scott Havens (00:17:26):
That is one way to approach it. But in a distributed system, like I said earlier and alluded to, you're still not going to get the right kind of scale, the right distribution of inputs. It ends up being a testing in production that is coordinated across all of these teams and just crossing fingers, hoping that that ends up working right.
Gene Kim (00:17:47):
I have to imagine if going to roll out something new in production, there's going to be a lot of coordination to make sure that everyone else knows that we're doing at the same time. Is that right?
Scott Havens (00:17:57):
You ended up seeing that. You have your schedules where team A is allowed to test on this day. You have a two hour window that everyone else has to be aware of and work around. And team B is allowed to test in the next two hour window. And it really slows down your release cadence because of the amount of coordination overhead across all of these teams. You're not able to act as independent teams like you'd want. It is all really one big team that has these coordination problems.
Gene Kim (00:18:26):
Our sponsor today is ServiceNow. And I'm grateful that they have also been a long time supporter of the DevOps Enterprise Summit, the conference for technology leaders of large complex organizations. Join us there and visit the ServiceNow booth to see all the ways they're supporting the DevOps enterprise community. ServiceNow connects the work going on in DevOps pipelines to the operational work already being managed in ServiceNow to create a seamless governance layer for delivering apps using DevOps at high velocity. For organizations that need the assurance of change control, but want to avoid slowing down developers, ServiceNow can completely automate the gathering of information from the pipeline, the creation of change requests, and the approval of those changes. This eliminates admin work and keeps developers in the tools they know and love while improving the quality and simplifying any future audit activities.
Gene Kim (00:19:18):
ServiceNow also uses this connection to add value stream management to DevOps, joining teams and information from ideas to value realization, so that they can gain insights into what's happening and understand how value is being delivered. In fact, ServiceNow was named a leader in the Forrester Wave for value stream management solutions. If you need to eliminate friction between dev and ops, head to servicenow.com/devops to find out more.
Gene Kim (00:19:45):
So what were the effects of all this inadvertent coupling? Was there a finger pointing and blame? Was it fear? Was it inability to make changes as quickly as you want? I mean, if you were to sort of project like what were the behaviors in the absence of this that you would point to as very unhelpful?
Scott Havens (00:20:03):
One of the results was that the time from writing a feature to the time that it was not just deployed in production, but active and affecting the customer experience was far higher than it needed to be. That we had to wait for every single team to be able to test a given feature simultaneously. If you're doing that in order, you were having to coordinate all of these teams on every release to do all the testing through the entire service call graph, before you're able to move on to releasing the next feature. Even if it had been done for that weeks or months ago. So that was one effect.
Scott Havens (00:20:42):
Two is kind of the corollary to that. And because people didn't want to wait as long, they would shove more and more features into each release that became much larger releases instead of every few hours or however often they wanted to, it became weekly releases or semi-weekly releases. And that introduced more risk into the system that every change was that much more difficult to test and much more risky and likely to cause a problem. It required rollbacks.
Scott Havens (00:21:16):
All of this added together to mean that we ended up having to do windows where once every few weeks to every couple of months, depending on the set of systems, in the middle of the night, you would do your deploy, hope that nothing broke, run through all of your tests. And if everything's okay, then you make that deploy active. Otherwise you have to roll everything back and maybe try again next week. It's the exact opposite of why you want it or one of the reasons you wanted to have services in the first place is to decouple all of that testing, all of that deployment. It really only got you part of the way in that you're able to develop the features independently, but having code sitting in main doesn't help the customer. It only helps the customer once it's actually in production and working.
Scott Havens (00:22:07):
It meant that when anything went wrong in production, there would be a single call that multiple representatives of every single team would be on. No one really knows what the problem is. And it's really difficult to get anything done when you have a hundred people on the same call trying to solve what the problem might be.
Gene Kim (00:22:28):
So I love the word sort. Okay. Brilliant. Despite all of the decomposition that that was done. Okay, awesome. So actually, I listened to your talk twice before it really sunk in what you did. I certainly understood I think on the first listing to it when I watched your presentation at Microsoft Build, I definitely did understand that you were able to, in a breathtaking way, reduce number of calls from 23 service API calls down to two. But what I didn't fully appreciate was that you were actually then pre computing all of those results in advance, so that those two service API calls were actually lookups, key value lookups. So the simplest type of database access that you can come up with. Am I understanding that correctly that this was really you're recomputing for every item, warehouse, zip code, and so forth, right? Is it available? Did I capture that correctly?
Scott Havens (00:23:27):
Exactly. One of the principles of functional programming that I've taken from the low level code and applied to systems as a whole is the duality of code and data. All of these calls are some kind of computation. If you know the domain of all the possible inputs in advance, you can precompute for all of those dimensions, all possible outputs and store those in your massive key value store. And because you're treating it as an actual function, the entire system, you get all of the nice benefits of what you could see in low level functions. Things like being able to curry your systems. When you look at all of your different inputs, maybe some of them are fairly simple and you have a small range of output possibilities, but maybe some the cardinality is just too high to be able to precompute everything.
Scott Havens (00:24:29):
For example, if you're trying to compute whether you're able to ship something and how much it's going to cost, you might include as an example, your source warehouse, maybe 12 of them, your destination zip code, 44,000, a handful of carriers, handful of carrier methods. All of these are minimal in terms of their cardinality, but then you look at your weight as an input. And if you're talking like say a 10th of an ounce to 1,000 pounds, that's 160,000 different possibilities for that input.
Scott Havens (00:25:04):
But because you've decided to treat the entire system as a function, you can use the concept of currying, being that you can take a function of multiple inputs and break it down into multiple functions of just one input or a handful of inputs. So this whole system, you could say, I want to precompute all of the inputs for the warehouse, the zip code, the carrier methods. And that'll give you maybe 10 million, which is pretty reasonable for your key value store and a separate function or a separate system that then takes one of those key value look-ups and only uses that and the weight to produce your final value.
Scott Havens (00:25:51):
In functional programming, if your functions are actually pure, it means that for any arbitrary set of input parameters for your function, you can split your function up using any combination of those input parameters and then compose the resulting functions to get the exact same result as you would have before.
Gene Kim (00:26:13):
Gene here. I want to mention just how incredible it is for me to be able to finally more fully understand how e-commerce actually works, whether it is the front end ordering processes, the backend inventory management and reservation processes. These are things I've never fully understood, even though I am a pretty heavy consumer of e-commerce properties and even more so since the global pandemic has reduced the ability for physical commerce. I think this is such a great talk to follow Dr. Steven Spear on the physical supply chains, as well as Dr. Kale Murphy on software architecture and software supply chains.
Gene Kim (00:26:48):
By the way, what I do know, I am so grateful to both Ron Forrester and Courtney Kissler back when they were both at Nike who let me visit them during their Air Jordan 11 shoe launches. These are the high heat e-commerce launches, where they sold tens of thousands shoes in minutes. The Black Friday launches in the Unicorn Project were very much modeled after what I got to see there.
Gene Kim (00:27:11):
Okay. My first elaboration is about Curring. I thought it was so cool that Scott brought this up because it was so surprising and illuminating. So to put this into context, Curring is often talked about in functional programming alongside the notion of function composition. The name come from Dr. Haskell Curry of which the Haskell programming language is named after. He was an American mathematician and logistician born in 1900, who got his PhD in mathematics. So let's talk about function composition first. It basically states that if you have a function F that takes an A and returns B and a function G which takes that B and returns C, then you can substitute functions F and G with function H, which takes that A and returns C. So you might recognize this as the property of associativity. You can replace that with what is often called F.G.
Gene Kim (00:28:06):
So here is a concrete example from the Wikipedia entry on function composition. Suppose you have a function that returns an airplane's altitude at time T. Call that function A that accepts T. And you have a second function that returns the air pressure at that given altitude. Call that function P that accepts X. Then you can return one function P.A That takes a time, and return the pressure on the plane at that time T. It's kind of hard to describe these various succinct equations in audio.
Gene Kim (00:28:39):
So this is very close to the concept of Curring. According to the Wikipedia page for Curring, suppose you have a function that takes three arguments, A, B, and C that returns X. You can Curry that function into three unitary functions, X, Y, and Z, which can all be composed together to return that same value as that function F. So to make this very concrete, you've heard Scott talk about how it required initial, 23 deeply nested API calls to determine item availability.
Gene Kim (00:29:09):
He's saying that there's a way that you can treat most of these API calls as a pure function, which allows you to Curry them, which allows him to precompute the values of the entire function. So to talk about how he exactly did that, let's go to number two, which is a concept of cardinality. So if I understand correctly, we can loosely interpret the term cardinality as the number of elements in a domain or the size of that set. So Scott is saying that the item availability function takes the input, product skews, warehouse, carrier, zip code, and weight, and returns how much does it cost to ship it? And how long will it take to get there?
Gene Kim (00:29:52):
So the cardinality of the inputs are as follows. The number of product skews is say 75 million. The cardinality of warehouses is about 12. The cardinality of zip codes is about 43,000. The number of carriers is about a dozen. The cardinality of weight is about 160,000. So imagine creating a Curried version of the function for each argument except for weight and the product skews. Those are the inputs where the cardinality is too high, because it would cause a number of entries in that lookup table to be too large. The size of the lookup table if you included product skews and weight would be 7.4 times 10 to the 19th power. If you omit weight, but left in product skews, you would get 464 trillion. And if you omit both the product skew and the weight, you have what seems to be a very manageable 6 million entries. I think this is so amazing. Scott has described how you can go from 23 synchronous API calls, where all the computation is being done in line, to two function calls where almost all the computation has already been performed and you merely need-
Gene Kim (00:31:03):
... Where almost all the computation has already been performed and you merely need to look up the value in a table with about 6 million entries in it. This is such an amazing and practical way to solve a real business problem using both function composition and currying, super cool. Number three, Scott talks about the coupling that occurs when teams are not truly able to test independently, even though the services behind each API may be decoupled from each other from a code architecture perspective. I think a wonderful definition of a loosely coupled architecture are when teams are able to independently develop, test and deploy value to their customers. Scott is describing the conditions when you are able to develop independently, however, are not able to test or deploy independently. And he describes the horrible things that happen as a result. You have larger batch sizes where you are able to deploy only on fixed cadences. And because as the deployments may not work every time, you then end up with more deployed, the next time around. Scott described brilliantly what happens when you have a high coordination cost. That means everyone has to be on the lookout for problems and only a certain number, or maybe only even one team can deploy in a given period. I love that if you miss that window, you have to jam more features into the test, which reduces the likelihood of everything working, leading to even worse outcomes. And because those outcomes are so problematic, you now have to do deployments in the middle of the night. Okay, back to the interview. So, let's talk about those 23 service teams. How did their world change because of this new architecture?
Scott Havens (00:32:41):
By switching from the synchronous calls to totally asynchronous, they were able to focus more on making sure that they were just doing the right kind of computations with scalability that they could handle on their own without being beholden to a customer facing API and roll out changes as they're ready to go out without coding testing with everyone else. Instead of having a huge redundant cluster, it is the minimal amount of services that you need just to process what's happening on a regular basis. And if it starts to fall behind, you can scale at that point, but there's always some kind of buffer in place from the messages that they're reading from Kafka or whatever their other message queue is and it allows for a lot more flexibility in their performance. If you make a mistake deploying, if it breaks something, that's okay. You roll it back and try again or identify what the problem was.
Scott Havens (00:33:43):
If something rolls out and it breaks everything, again that's okay. It's not going to cause problems until a large amount of time has passed until things get too delayed. If you're down for five minutes, 10 minutes, 20 minutes, usually that's not going to be a problem. So, your deploys are you're able to do a lot more because you're comfortable that if there is any problem, no big deal. Roll back and no one will even notice. Third, you're able to add more features, adjust your contract as you want. Because all of this is happening asynchronously, you can put your outputs into a new stream. The downstream systems that care about that data will be able to look at these new events, these new messages, whenever they get to it, when they're ready to switch over to your new output format that is providing them more data or whatever change changes that you want to make, you're not dependent on the downstream team switching over.
Scott Havens (00:34:45):
You can just go ahead and make those changes or those new features and they can take advantage of it when they're ready. This is an example of testing a production in a way that is safe. You're always deploying your services into production and they're running in parallel with your existing ones. Any data that you're outputting can go into a parallel stream. Again, it's happening very safely because no one's consuming it yet. And you can check your data at that point, make sure that everything that's happening is it's hitting all of your right, all of your SLAs, all of the data being output is correct as far as all of your testing would be concerned, but it's in the production environment. It's getting the full set of all of the possible inputs and running at the scale that is going to be necessary for it to work correctly. Just that the downstream teams will be able to switch over it, switch over to the new stream at their leisure.
Gene Kim (00:35:39):
[inaudible 00:35:39], can you talk about the inputs? So, I'm part of this service team. Before I had this rest API that I'm supposed to make a request I put in a response, that's the inputs and the outputs. What does it look like in this new world?
Scott Havens (00:35:49):
The inputs are fewer than they were before you. If you are managing your carrier methods, for example, you're still getting your updates on the methods from UPS, FedEx, DHLs of the world that change your system configuration and your possible results. And you're still processing those as they happen. But the big change that of your inputs is that you no longer have your rest API or your RPC API that is exposed to the customer. That goes away. The only way that they get that data is by consuming your event stream and cashing those results or doing their own computations in advance just like you're doing so you don't have to worry about any kind of production level RPC interface that you're supporting anymore. So, with your transportation example, you still have your inputs that change what warehouses are available. Zip codes probably are not going to change, that's a configuration. Your carriers and carrier methods, you still have those as inputs and you still have your outputs of all of the different combinations of these that you're emitting over Kafka or whatever you're queuing or messaging pub subsystem is.
Scott Havens (00:37:09):
In the service based example, you also have to maintain rest based or RPC API that is going to be operating at production level production or customer facing exposed to the world that may get hit by any number of requests as time goes by. In the event streaming example, you don't have to support that anymore.
Gene Kim (00:37:34):
And now a message from this episode sponsor, ServiceNow. ServiceNow connects to work going on in DevOps pipelines to the operational work already being managed in ServiceNow to create a seamless governance layer for delivering apps using DevOps at high velocity. For organizations that need the assurance of change control but want to avoid slowing down developers, ServiceNow can completely automate the gathering of information from the pipeline, the creation of change requests and the approval of those changes. This eliminates admin work and keeps developers in the tools they know and love while improving quality and simplifying any future audit activities. ServiceNow also uses this connection to add value stream management to DevOps, joining teams and information from ideas to value realization so that they can gain insights into what's happening and understand how value is being delivered.
Gene Kim (00:38:26):
In fact, ServiceNow was named a leader in Forrester Wave for value stream management solutions. If you need to eliminate friction between dev and ops, head to servicenow.com/devops to find out more. Someone else's responsible for enumerating the universe of possibilities and my responsibility is to generate the right computation and then that'll go live somewhere else, eventually living in the key value store that is customer facing.
Scott Havens (00:38:51):
Gene Kim (00:38:52):
Okay. Awesome. What else changes for this particular team in this new world?
Scott Havens (00:38:57):
When the team has switched are the point that they're taking all of these messages as inputs and they are emitting these messages as outputs, suddenly the system as a whole looks very much like a function where it's not needing to do anything in the middle except its stateless computation. And that's where you get a lot of advantages in correctness that you know your entire set of domain inputs, your entire range of domain outputs. You can use something like spec based property based testing to make sure that all of your end variance will hold that you expect that you'll never have a cost that goes negative for your transportation system or your inventory count will never go negative if it's an inventory system. You just know that that will never happen all based in tests that are happening in code at the unit test level without ever needing to hit a database to do that.
Gene Kim (00:39:58):
This is kind of exposing. One other thing that just shows how loosely I grasped this, so within the service, if I have an external dependency, I need information from another, how does that change for me before and after?
Scott Havens (00:40:08):
You may make a call upfront, but ideally that service has also switched to being an event driven system that is pushing its own inputs into you, rather than you making a call to it.
Gene Kim (00:40:21):
Oh, my goodness. So, in the ideal, it's not something I'm responsible fetching, it is actually given to me. So, it's like a pure function?
Scott Havens (00:40:31):
Exactly. If you're able to push this architecture all the way through, only the customer facing a API is going to be doing a live call. Everything else that contributes to that is going to do its or have its own inputs that are streamed to it, do whatever computation it needs and push its outputs to the next system.
Gene Kim (00:40:56):
So, in your presentation, you talked about just the value that was created by this. Orders of magnitude cheaper to run, easier to test, easier to implement. Looking back, are there other sort of dimensions of value that made people so happy about the work that you did?
Scott Havens (00:41:12):
One of the things that we get from systems like this that our event driven is that you see so much more data that you didn't have before. Synchronous systems, all of these calls that are coming from the consumer and hitting a service that's calling another service, calling another service, if you want to store that inform for analytics purposes later, you have to go out of your way to build something that's going to capture this, that's going to siphon off all of that data and store it somewhere to get that observability. A lot of that comes for free in these event driven systems. You are capturing it in virtue of needing to send over a queue or pub-sub to the downstream system. It's pretty trivial to just hook up another listener to that pub-sub stream and save it off to a data lake somewhere. The data's already being put into a format that is conducive to that.
Gene Kim (00:42:12):
So, it seems that this as a whole just incredibly magnificent properties, it seems to be a magnificent architectural story. And so, when I think about the hallmarks of good architecture, some of the things that come to mind are the ability to be able to test your portion of the system in isolation from everyone else, the ability to do your work within your system without needing to communicate and coordinate with people outside of that service. And maybe even making large scale changes to your parts of service without permission from anyone else. And then performing deployment [inaudible 00:42:49], the ability to perform deployments during normal business hours with [inaudible 00:42:53] downtime. So, those are all markers that came from the 2017 State of DevOps Report. So, what are your hallmarks of great architecture?
Scott Havens (00:43:03):
I think all the ones you listed are fantastic. It's really important to me for architecture, that people are able to pick it up quickly and understand the scope of changes that they're making, what the potential blast radius is going to be, understand hopefully that it's minimal and not be afraid of making changes while in other words there's a pit of success that as long as they understand a basic minimum about it, they can get up and running very quickly with just doing the things that they need to do to create business value while making sure that the system is going to be maintainable for the people after them. For me, a great architecture is one that allows you to make changes to your business domain, to add new features that the business domain needs and continue to be able to keep adding those features over time in a way that doesn't slow down because of all of the debt that you have accreted over the last month's, years, possibly decades depending on your business.
Scott Havens (00:44:13):
If your architecture allows you to make these changes in a way that the person or team making the changes can feel confident that they're not going to be breaking anything else, then that is a hallmark of great architecture, something that stands the test of time.
Gene Kim (00:44:29):
So, you said two things, and I think it's just some beautifully the way you put it. If I heard you correctly was like, can you make the change you need to make in the present and will you be able to continue to be able to do that in the future? Did I oversimplify what you just said?
Scott Havens (00:44:43):
No, I think that's great. It's having the right levels of abstraction that you can easily identify the boundaries of what you're trying to change versus what you don't want to change. If any developer coming into that code base can say, "Oh, for me to make this change, here is where I need to make it and I don't need to worry about anything else, or don't have to worry that that's going to break. Either break their code, break their deployment, break their tests, any of these pieces. I only have to look at this one area." That makes it a great architecture.
Gene Kim (00:45:23):
I love it so that you're essentially encapsulating or creating these boundaries so that local failures cause only local problems. And you don't really need to have too many concerns or worry about things outside of your system.
Scott Havens (00:45:36):
Yeah. And that is another way of looking at that is the measure of innate complexity versus the accidental complexity. You want to be able to make whatever changes you need in the innate complexity side. It's the accidental complexity where these things are coupled in ways that you wouldn't expect and shouldn't be the case. When you think about it, in retrospect, that's the kind of complexity you don't want to have. You shouldn't need to worry about those connections.
Gene Kim (00:46:08):
When I think of coupling, the ability for teams to be able to build a feature, test and deploy independent of each other, usually I think of that coupling as, "Oh gosh, I'm reliant on someone else's functionality there and I need them to make a change over there, right?" And, that's what kind of shackles teams down. But what you said was so different than what I've was expecting to hear. You said, no, I could actually, the lines in the system were drawn well, the systems were decomposed, features were able to get done within the 20, the dozens of teams independently. The problem is that they couldn't test independently or deploy independently, which you said it actually shackle them, so they couldn't act as 23 independent teams. They were basically acting as one giant team. Could you validate that? Being shackled by testing and deployment is as devastating as being shackled by inability to get the feature work done.
Scott Havens (00:47:06):
Well, addressing the last part first. I wouldn't say it is as devastating. You still have, some of decoupling is broken up. The actual feature development is properly decoupled. But going back to what I've said earlier about distributed systems that you can't really test except in production. You need to be prepared to deploy two production and test in a real world environment in a way that won't affect your users. Even though the new features for a given system were developed in a decoupled way, when you're working with synchronous systems that have a very nested service call graph, when you deploy your test version of the surface, whoever is calling that service in order for them to test it will need their own parallel test environment or test deployment and whoever's calling that service will need their own test environment and so on all the way up to the customer.
Scott Havens (00:48:11):
Because the only way to test that your service and your change of the service aren't going to break the customer is by traversing that entire call graph in a single synchronous call. Whereas the asynchronous event driven services, you don't have to traverse the entire call graph in order to test that something is going to work. All you need to make sure is that whatever changes you've deployed, that your direct consumer is correctly going to consume those and they don't have to worry about their own changes to their own contract that are also being pushed out asynchronously.
Gene Kim (00:48:49):
And so, do I understand correctly that add new functionality? Do I actually create a new topic for my consumers so that they can test on that new topic?
Scott Havens (00:48:59):
It depends on what your specific contract is with your consuming service. Maybe they know that you may be sending test or not quite production ready messages through a particular topic. And they're only picking out the ones that they identify and know about and ignore any V2 type messages or test messages. Or maybe you set up a separate topic because they're not ready for that and they can fully evaluate it on their own time. It's whichever works better on a case by case basis. I would say that the rule of thumb would be a breaking change in schema would be something that you'd want in a new topic and a non breaking change should stay in the same topic. And as a rule of thumb, you don't have to stick with that. The nice thing about event driven systems like this, where you're working entirely asynchronously is that even if you do mess up on your definition of what is breaking and what's not, it's not that big of a deal because you're already expecting to have some kind of delay in processing.
Scott Havens (00:50:04):
It is going to be eventually consistent. If you need to do a rollback on your code because you made a mistake on your estimation of where to put it, it's usually not that big a deal, as long as you know what your SLAs are and you're able to stay within those.
Gene Kim (00:50:18):
Got it. So, in that case of a failure where a giant mistake was introduced, everything after the point of the deployment would be recomputed?
Scott Havens (00:50:25):
Yeah. You could do that with a recompute. If you were system is set up to process [inaudible 00:50:30], it's not a big deal that you're handling all of these messages the same way. Most of the time, if you're talking about a breaking change, it's either going to process correctly or not. It's going to most of the cases where you see these failures, it's just going to be a message that blows up the code on the receiving end. And you don't really need to worry about replaying, you just need to worry about fixing that one message or that particular schema.
Gene Kim (00:50:56):
Got it. So, you don't actually have to do anything global. If you actually introduce a calculation error, then all you have to do is replay yours and then that will trigger all the downstream recomputations?
Scott Havens (00:51:07):
Gene Kim (00:51:08):
Wow, that's so great. So, clearly you did something that people appreciated through your work at jet.com through Panther and so forth. And it sounds like that the architecture you created replaced the dominant architecture, and is now the dominant architecture at Walmart. What did you do to bring that into being? Was it just passively you did nothing and everyone just noticed that it was a better way and then they were the ones eagerly to drive it, or was it the other extreme where you had to bludgeon everybody into submission and everyone went kicking and screaming? I mean, so I guess I'm kind of imagine what the two extremes are. I imagine it's not one of those. Can you describe, well, what did you do that earned you the right to make such a huge change to the dominant architecture and eventually replace it?
Scott Havens (00:52:01):
With the merger of jet.com into Walmart had the opportunity to show two systems that are doing approximately the same thing but were built in completely different ways. You don't often get a chance in the real world to do such a great AB test of two completely different ways of doing something and what the results would be. When we went back to first principles, starting with just merging the teams in the first place, what are we trying to solve from a business perspective? What are the business problems and what are the non-functional goals like in terms of resiliency, scalability, et cetera, that we'll need to be able to meet in order to solve these problems?
Scott Havens (00:52:52):
We are able to take the two existing systems from Jet and from Walmart, push them to their limits in the testing that we need to do and see where they start to fall down and make sure that since everyone has already read that these are the principles we're trying to solve or that we have, and these are the problems that we're trying to solve, it helps to get everyone on the same page that, "Hey, we can see that following this approach we'll produce better results in our one year plan in our five year plan than this other approach."
Scott Havens (00:53:27):
And so, that worked really well to get the teams on the same page in the first place. Then with that, all that evidence we collected for ourselves in making the decision of which way we want to go, we were able to go to executives all the way up the chain to say, "Hey, here's the evidence that we found. Here's what we're going to do and why? We want you to know that this is the approach we're going to take, because want to come back to you once we've done this with the evidence that we've collected since then, and use it as you will to decide, "Hey, should we spread this approach to other teams as well?" So, we had a lot of good evidence from the AB testing that we were able to do with our own, the concepts that we were developing within our own two teams that were merging together."
Gene Kim (00:54:13):
And specifically the two teams were, was it the inventory management systems?
Scott Havens (00:54:16):
Yes. The inventory management for jet.com and the walmart.com specifically at the time. And since that was successful, we were given free reign to spread that to the stores, to spread it to international, and to use the same concepts in the Walmart fulfillment services where external parties are using Walmart's supply chain for fulfilling their own orders.
Gene Kim (00:54:44):
Was there a strategic mandate or a strategic desire to reduce number of inventory management system? Because I think I heard at one point Walmart had three inventory management systems, one for eCommerce, one for physical stores, and then plus jet.com, that's three. Sounds like there's some downsides of having three of those instead of one.
Scott Havens (00:55:01):
Well, in reality the number is much higher when you look at all of the international market and the other non Jet stores that were also under the Walmart umbrella. It's difficult to make inventory systems work by linking two existing ones together. A lot of other systems or other business domains, that's not as much of a problem. You can maintain the two parallel system stems with their own that are both working at smaller scale with their own teams and that'll work okay. But inventory management, if by definition, you can only maintain inventory in one system at a time because you're trying to take reservations against it. You want to make sure that someone from Jet and someone from Walmart, if they're both trying to buy the same item in the same warehouse at the same time, only one of them should get it. You can't have two competing inventory management systems, both do a reservation and do so successfully and make the customers happy. So, inventory management has to be unified across for a given warehouse or a given item that has to be a single system.
Gene Kim (00:56:09):
Interesting. And when you don't, you either have inventory that can't be purchased by one. So, you end up with too much inventory that can't be sold to certain customers is it?
Scott Havens (00:56:21):
Or the exact opposites in that you only have one item sitting in a warehouse, but if you look at it from different perspectives, you'd be able to buy it from multiple sites all at the same time and some customer is going to be unhappy.
Gene Kim (00:56:36):
Scott Havens (00:56:36):
Two inventory mention systems is lying to your customers. It's in many ways, worse than having no inventory system at all because at least you're not making that promise.
Gene Kim (00:56:46):
So, let me share with you a story that I heard a couple weeks ago, that is probably one of the most startling things that I've heard in the last couple of years. So, I was talking to Dr. Steven [inaudible 00:56:57] and he tells me about a plant tour that he took in Japan with his mentor, Dr. Ken Bowen, and a VP of manufacturing from a big three auto manufacturer. And during this tour, the plant manager at a Toyota plant talked about how they were doing 60 line side store changes per day. So, I didn't know what that was, but apparently so in a physical manufacturing plant, it is where you get all the inputs in a given work center. And so, they were making 60 of those changes per day. And this VP of manufacturing from the American big three auto manufacturers says, "That's crap. We tried to do six and then shut down the plant for three days." And it took me almost two, three hours to really understand why that was so interesting. But to me it was shocking because it sounds so much like the people's reaction, my own included when John [inaudible 00:57:49] and Paul Hammond talked about how in 2009, they were doing 10 deploys a day. In fact, Mike Nigar said, when he first heard it, his reaction was, "They must be using some sort of slippery definition of deployment that doesn't match my own." Because we do 10 in a year and that doesn't feel so good. But to me, this is a shocking, it sounds so familiar in the fact that clearly they're trying to do something by making these blindside store changes, and some are able to do it very quickly and easily and some are not. Can you tell me about your impression, me telling the story in totally different context, right? Does that speak into you in any way?
Scott Havens (00:58:29):
That feels like it's the exact same problem. When you have too much coupling, your deploy is dependent on everyone else and you have to slow down your deploys because of that. When you are able to minimize your blast radius and know that the exact limits for every deploy of what it could possibly affect, you become very comfortable in deploying much more frequently. And when you have systems that are event driven, you know that the maximum possible blast radius is the single consumer that is going to be reading from your event stream. It's not going to go beyond that and especially when you're using a lot of functional techniques that we talked about multiple other times, you know that you're not going to be affecting your other features that you have in your own system. So, both in terms of the systems you're impacting and the number of features within a system you're impacting, you know what the blast radius is going to be and you feel comfortable that, "Hey, worst case, I'll roll this back and that is all that will have been affected at that time."
Gene Kim (00:59:38):
So, the expression of my face is actually one of surprise. So, let's go into what those configurations are in some speculation that we have. So, in the Toyota plant, Steve explains to me, is that if you and I both have a work center, when we want to basically swap jobs, that all we would need to do, the primarily mode of synchronization in a Toyota plant is a kanban card which basically is like an envelope with three fields on it from, I need these parts from this person to be delivered here. And so, if you and I totally switch roles, then we swap the swap the kanban cards, right? And no one needs to know except for the loading dock. Whereas apparently in the system that six line side store changes results in shutdown for three days, everything goes through the centralized planning system like this MRP system that apparently are not entirely accurate.
Gene Kim (01:00:37):
When you do these changes, you miss something and [inaudible 01:00:41] are incomplete and suddenly have parts going to the wrong place or parts that are not going to the parts going to where they aren't needed and parts that are needed don't ever arrive. Does that surprise you?
Scott Havens (01:00:52):
No. The kanban style system reminds me a lot of a queue or a very queue heavy microservice architecture, where each person is analogous to a microservice in the software world doing one thing with its input queue and its output queue, where in those, from where it needs to get the items and is focused on just that one thing and that one service will know if it's one queue, it for input is broken, or if it's not able to store whatever the output is in this case, some kind of material item, if there's no room there, it will know that and each person's able to operate independently because of that decoupling. Whereas in a larger scale centrally plant system, it's very similar to the monolith in the software architecture world where everything is very coupled together there. They do seem very analogous. You could draw the analogy and you have a central architecture or central-
Scott Havens (01:02:03):
Draw the analogy, and you have a central architecture or central architecture board that's trying to approve every single change that someone is making. And that you can't deploy that change from any small individual system until everyone has agreed on it and it has passed all the internal checks. Only then do you deploy the new, complete architecture that has this single approved change. That can really slow down agility. You're no longer working on contracts in between systems, but you are dictating that all the changes should go through a central configuration point where it can be approved.
Scott Havens (01:02:44):
The comparison would be that each of these microservices that has just a very simple input contract and output contract, and each of the person or teams responsible for the services understands they just have to meet the contract, and they can figure out the best way for that service of how to make sure that those contracts, those SLAs, everything else are maintained. There doesn't need to be a central architecture board that's going to approve it sometime down the line and decide, okay, now that we approve, you can do it. As long as you give that responsibility to the team, say, "Hey, here are your SLAs. Just make sure you meet them and make sure that your business level contract is maintained. You figure out the rest." It allows the teams to work much more quickly and effectively independently.
Gene Kim (01:03:32):
Independently. Right, right, right. Got it. And then also, you're hinting at the other side, which is that the architecture review board, you go through all that, the process, and yet sometimes they can miss something, and then everything's come crashing down. Just like as we've explored in the past. Ah, this is so great. So in your talk, you didn't have time to go into this, but you had given me some advice. I had this system that's been working for four and a half years. And I had described that I was having so many problems with it failing in all sorts of strange ways. I'm going to just read to you this testimonial on how your advice to me worked.
Gene Kim (01:04:12):
So I asked Scott Havens for his advice on how I decouple a bunch of components that write to a database through a terrible Ruby ActiveRecord app I wrote that interfaced with a Python that someone else wrote. And he told me to move it to event sourcing with Pub/Sub. At first, I couldn't quite believe this was practical. Pub/Sub for such a simple app. But then I took your advice and I was blown away by the results. Suddenly two components that used to be tightly coupled together with all sorts of weird failure modes where things couldn't be written out correctly just disappeared. I was stunned that, in my very simple case, I managed to reduce the code size by 90% and delete entire portions written in Python and Ruby and all that remains is a very small portion of written in Clojure. So I guess even reading this, I have a little bit of disbelief. So how can even simple systems suffer from problems like this? To me, it just seems so unexpected. Is there an easy answer to that, Scott?
Scott Havens (01:05:08):
I think the keys is to realize that any system that you're trying to develop that's solving some kind of business problem is always event sourced or event driven in the business domain. All businesses and all problems really have a series of events that happen. The question is, do you want to make that explicit and understand those events? Or do you want the series of events to just be implicit in the code? When it's implicit in the code, it makes it very hard to debug because it's still there and that's what you're inherently trying to solve from the business side. But you're not going to be able to get to the events themselves and understand the flow of events through the system until you make those explicit. So really every system is event sourced. The question is are you choosing to have that in code and store them, or are you choosing to make it hidden in the business logic and throw the events away so you can never recreate them? So that is one difference that you see.
Gene Kim (01:06:15):
I think coming in hindsight, what really helped the problem was that side effects were treated very differently. I was processing events in a very side effect-y way that was very prone to failure. And I saw modes of failure that I could never have dreamed up of when I wrote the code. Is that part of what you're saying?
Scott Havens (01:06:37):
It's not the same part that, but that is another aspect that is just as important. When you have a system that mixes your side effects with your business processing, it means that the correctness of your business logic is dependent on all of your Io always being exactly correct. And when you have them intermixed, any failures in the Io side, or any mistakes you've made in your Io portion are going to affect the correctness of your business logic. When you designed your system, and again, I'm going to call this in a functional manner where all of the business logic is happening where you have your command that is completely immutable, you have a pure function that takes the command in current state and computes a new output state or a new set of output events. And without any interaction with the outside world, you're able to prove that that business logic is completely correct, at least in terms of the code of what it's trying to do, what the business problem as it's trying to solve.
Scott Havens (01:07:43):
Then when you have finally at the last stage, your output, whether that is a state update or your set of events, and only then do you store it to your data store, whatever that is, or put it in a queue or whatever you choose to do with it, then you only have to worry about is the Io correctly saving it or not. And that's much easier to solve than is the business logic reading the right thing, and completely able to recover from any Io failure with the right data and have the right state, and be applying the right change to the right model at the right point in time. It just becomes way too complex to be able to handle every single error case compared to here's my business logic, and then is going to be the Io. Once you separate those problems, it may makes each of them significantly easier to solve because they're no longer so interconnected that it multiplies out all of the possible problems.
Gene Kim (01:08:45):
Right, right. And by the way, it was that experience that really took my understanding of functional program communities, pushed the side effects to the edges, and reduced that surface area of testing the side effects down, right? It's like, you don't actually I really have to test that the database will write in general. You can almost depend upon that. What you really want to do is isolate the testing to the logic, which it was such an aha moment.
Gene Kim (01:09:11):
Gene here. Okay. A couple of quick things. One, I love how eloquently Scott spoke about the value of being able to make local changes and be able to test them locally without having to worry about how the rest of the system works. This is so consistent with the themes of encapsulation from Dr. Gail Murphy. And this is of course the heart of the first ideal of locality and simplicity that shows up in The Unicorn Project.
Gene Kim (01:09:38):
Number two, one of the things that Scott really helped me understand in this interview is how he thinks and the way he does. I had mentioned in my introduction to him in his DevOps enterprise presentation, the quote from the French philosopher Claude Levi-Strauss about whether a tool was good to think with. Scott has explained to me some pretty amazing tools to think with. We talked about currying and function composition in the first break in, and Scott mentioned another one, idempotence. This is the property that an operation can be executed multiple times without changing the result. I was familiar with this term in the context of distributed computing, as in an operation is called idempotent if it can be executed more than once safely. This can be a very important property for side effecting operations like databases or file systems or network calls.
Gene Kim (01:10:31):
So under a certain condition, you may want to make sure that certain operations such as adding or deleting a element can be performed in a way that is idempotent so that you can survive network issues without corrupting data. But while researching this, I discovered that this term also has a more fundamental mathematical connotation. For instance, multiplication is idempotent only for the operations of multiplying by zero or one. And addition is idempotent for only the value zero. I have a bit more about using these in another break in.
Gene Kim (01:11:04):
Okay. Number three, Scott talked about how if you have more than one inventory management system, you can't actually take reservations accurately against it. Holy cow. I had no idea that this was actually the case. I know of many organizations that have multiple inventory management systems, and apparently there is a cost to this. I suspect that there's so much value that could be created if organizations bit the bullet and unified their inventory management systems.
Gene Kim (01:11:32):
Number four, Scott and his DevOps enterprise presentations and in his interview referred to a toy application that I had written. This is actually an application that is super important to me because it provides telemetry on how books are selling on certain e-commerce platforms, which is actually a critical capability when executing book launches. Because book launches, like any activity, is one where important to be able to see what you're doing. So this program started off as something that I worked on with Tom Limoncelli, author of the amazing Cloud Administration book. And this was one of my first recurring workloads in the cloud.
Gene Kim (01:12:09):
I eventually took this project over and then proceeded to turn it into a Frankenstein monstrosity. It was made up of to Tom's amazing code written in Python. My unreliable code written in Ruby and ActiveRecord, which would write values out to a database, send emails and Slack messages. The code that I had written was incredibly unreliable. It would fail for so many reasons, the least of which was being occasionally blacklisted by those e-commerce sites. Over the years, I eventually was able to stabilize it. I eventually rewrote the whole thing in Clojure, which made it much smaller. And I was amazed at how much smaller it got when I rewrote it to run primarily on Google Pub/Sub, which is very much like Apache Kafka, which Scott used. You heard the testimonial I gave about doing that in Scott's DevOps enterprise talk. If you're interested in how I did this, I gave a fuller experience port of this at the 2019 Clojure/conj Conference. I will put a link to it in the show notes.
Gene Kim (01:13:05):
Okay. Number five, let's talk about the definition of side effect. So according to Wikipedia, a function is said to have a side effect if it modifies some state variable outside of its local environment, that is to say it has an observable effect besides returning a value. In that entry it actually talks about maybe the more useful notion of referential transparency. So the absence of side effects is a necessary, but not sufficient condition for referential transparency, which means an expression, such as a function call, can always be replaced with its value.
Gene Kim (01:13:37):
So this gets to the notion of a pure function. So a pure function has referential transparency. Given a set of inputs, a pure function always returns the same thing. So a non-pure function is when you do a network call or when you write to disc or even read from a disc because the result may not always be the same. In fact, calling the get system time function is not pure because it is different every time you call it if the granularity is small enough. And so when I say side effects, what I really mean are non-pure functions. Which gets us to number six. That took a lot of words to describe, but a pretty simple concept.
Gene Kim (01:14:14):
So I think Eric Normand has a much better definition, which he proposed in his fantastic book, Grokking Simplicity. Basically everything can be boiled down to either data, calculations, or actions. So data is inert. So think about a string, a number, a data structure in JSON. Ideally, you can't mutate it. If you want to change it, you have to create a new set of data. Then you have calculations. Calculations operate on data and calculations are always pure functions. So these can be things like adding, subtracting, capitalizing, map, filter, reduce. And then you have actions which are any operations that aren't pure. So this is making a network call, writing or reading from disk to a database, getting the system time, sending an email, posting to Slack. Eric Normand was recently on a book tour and he did so many great interviews on this topic. And I'll point to a couple of them, which I think are incredibly instructive. I love this reframing of functional programming concepts of everything being data, calculations, or an action.
Gene Kim (01:15:19):
Which gets us to number seven. So many of the problems that I described in my book. E-Commerce Application, was due to the terrible way I was handling side effects. Deeply nested calls where I was fetching a result from an API, writing it to a database, posting it to Slack, generating a graph. And so when something went wrong, it's usually deeply inside of a call stack. And when my program blew up, it's just not clear what went wrong or how to better handle that error. So much of fixing that application was pushing side effects to the edges, pulling apart these individual Io operations, and isolating them. And that way you tend to have very simple cases when you're doing Io, and the tests are really about the calculations, not about the actions. And this pattern is often called creating a functional core with an imperative shell. The notion that you want as much written in a functional style based on calculations and data, and you have all your imperative actions at the edges.
Gene Kim (01:16:17):
Which gets us to number eight. There is an amazing series in one of my favorite podcasts, which is the Functional Design in Clojure podcast by Christoph Neumann and Nate Jones. I loved all 100 episodes that they did. But one of my favorites was when they designed a Twitter scheduling system. These are episodes number 21 to 26, where they go through the process of separating the logic and the Io, and how you write better tests and the patterns that emerge. It was so dazzling to listen to. And what's so interesting is that what they eventually evolve into is a CQRS or event sourcing system just as Scott is describing. It's just brilliant. Okay. With that, let's go back to the interview.
Gene Kim (01:17:01):
I've heard this a couple of times before that in CQRS it's sometimes difficult to sort of comprehend what the call graph actually looks like. Can you talk about whether this problem is just isolated to me? And I understand that you've been doing some work in this as well.
Scott Havens (01:17:18):
The issues that we have seen in switching to these event driven asynchronous architectures, the synchronous call graph of all of the services that you see, or that we had at Walmart before, and that you see in a lot of places where you have these nested RPC calls. The downside of them, as we've talked about, is that they're all coupled together temporally. You have to have all of these working correctly within the same call with no failures in order to have a successful response to the customer, or whoever's hitting your end API.
Scott Havens (01:17:59):
That is also a positive in terms of being able to understand what happened. In that it's a fairly well understood technical challenge of being able to write down that service A is calling service B, which is calling service C. It creates a tree of calls that you can write down and understand that if there is a problem that happened in this call, it's going to be somewhere in this tree. You're able to understand that it can't have been affected by something that happened yesterday. And it's not going to have an effect on something that happens tomorrow. It is very well temporally contained.
Scott Havens (01:18:45):
So yes, you have to deal with the latency upfront from an operational perspective, but when you're looking for problems, you're able to focus where you're looking to just that single call graph. In an asynchronous system, the big advantage is that you have been able to spread the responsibility temporally, smear it over the last seconds, minutes, hours, so that the customer doesn't need to know how long it took to compute all of these things, and that the testing was able to happen at different stages in advance.
Scott Havens (01:19:20):
That also means that if you're trying to examine a system in production, observe what happened with a particular call, you're not going to be able to reconstruct the causality nearly as easily. The response that the customer gets on a particular call, when it's looking up something in a key value store, the problem probably isn't there. It is in how that value is calculated, which could have been even months ago. And understanding that causality graph becomes a lot more difficult.
Gene Kim (01:19:56):
And you said that this is an area that you've been researching or doing something about?
Scott Havens (01:20:01):
I've been looking into it because it's one of the biggest problems that we've seen in these asynchronous distributed systems. We've seen a lot of intersection between distributed tracing desires in observability and synchronous systems where, because you want to understand the latency at every point in all of these calls and how all of these different services are connected together, distributed tracing has come into existence to help with observability. Where you're able to see the latency at every single hop in the service chain, and be able to dig into maybe this is a database call that was a little bit slower. It really helps with observability concerns.
Scott Havens (01:20:41):
From another perspective, from the business domain perspective, especially coming from supply chain, we wanted to understand the flow of goods through the entire supply chain over time. And this doesn't seem like a technical problem. It seems much more like a business problem. But they really end up being very deeply connected when you want to be able to prove that something arrived at the warehouse several weeks ago, and then was moved to a different part of the warehouse. And then an order was received. And we wanted to ship that item and packing it up. All of these business events that are happening over time.
Scott Havens (01:21:21):
You want to be able to prove that you know where everything is in the entire supply chain at any given point in time. You want to be able to know that the downstream effects of any single change that you requested, and you want to be able to trace back the graph of all causes of a particular effect. If you see that this item was moved from warehouse A to warehouse B, you want to be able to say, oh, let's look at that item's history. And see that, oh, it was moved from this warehouse because of this request. And that request was put in by this person at this point in time two weeks ago. And be able to trace back all the way the entire history, not just within a single system, but across systems.
Scott Havens (01:22:08):
All of these business problems of wanting to audit every single change that has happened end up really correlating very strongly to wanting to be able to trace back all of the messages and events at a system level. What we want to do is marry the concerns of distributed tracing that you see in the synchronous systems, where you're seeing the call graph over time, and get the causality graph that will help you in understanding your business domain. Or to build a graph of causality across all of your systems where you're able to prove no message was lost. You're able to identify the time in between each set of events, and make sure that you're hitting all of your SLAs in an event driven system, and be able to derive a business understanding from looking at your message flows. This is an area that I'm not the first person to identify that there is a problem here, but there aren't a whole lot of good solutions yet.
Gene Kim (01:23:12):
And can you help educate me on is this a big problem, and why? Why is this a problem worth solving?
Scott Havens (01:23:18):
By the time you get to the point that you want to make sure that there are no mistakes in the real world handling of whatever you're doing, making sure that every last item is going to get to the customer within the right amount of time, that you're always going to be fulfilling your promise, it becomes very difficult to make additional improvements. And you're not going to be able to make improvement in your fulfillment rate without knowing where you're failing. We're already at a point where you have two or three nines of correctness on getting the item to the customer within certain period of time.
Scott Havens (01:24:03):
But if you want to address that next nine of accuracy on availability and fulfillment rate, and you want to figure out where can we cut down the most on time on the processing steps to change it from next day shipping to same day to two hour, one hour shipping. Identifying where are the slow parts in the chain and where are we making mistakes, where are we dropping messages or taking too long to process? You need to be able to trace every single message, not just sample some messages. And you need to be able to audit that every single one is happening with the latencies for each of those to be able to graph the entire thing, and drill into, hey, here's where the problem is. Without that information, you're not going to be able to make any improvements in the system.
Gene Kim (01:24:53):
The Idealcast is produced by IT Revolution where our goal is to help technology leaders succeed and their organizations win through books, events, podcasts, and research. This episode is made possible with the support from ServiceNow. Take the risk out of going fast. If you need to eliminate friction between dev and ops, head to servicenow.com/devops to find out more.
Gene Kim (01:25:15):
I was listening to a presentation by Art Bryne. Yeah, he's a lean pioneer leader who got so good operations manufacturing at Wiremold. Part of their acquisition strategy was to actually acquire our underperforming plants and create greatness within them. It was an astounding story. And listening to the video, he actually talked about something that I just didn't understand. I've heard Steve Spear talk about the miracle of Toyota, and how keep people keep thinking it's about manufacturing. He said it's nothing short of miraculous that a Toyota plant can create generate 4,000 or 5,000 cars per day, defect free, and just all the choreography it takes to get parts to where it needs to go.
Gene Kim (01:26:00):
And I kind of accepted that. I thought I intellectually understood it. But then in this Art Bryne talk, he talked about going to one of the Toyota suppliers where engines are being sent in lots of four a couple of times a day. And him asking the driver, "How many times have you been late delivering an engine to a Toyota plant?" And the driver said, "Once 11 years ago in a snow storm. But we ended up putting it on a helicopter and getting it over there." And transiting multiple hours this convoy truck. It made me realize I did not fully internalize the scale of the miracle occurring every day to support that rate of production. I suspect you've seen something similar in terms of just the volumes of flow and the incredible choreography it takes to actually ship things that you speak so eloquently about so that customer gets what they ordered. Can you describe to me maybe the scale and your awe of the system? I mean, help me understand the vast scales of which these organizations operate in.
Scott Havens (01:27:09):
Oh, I don't think that I can even give a really good picture of what the scale is. I think that is the point of what makes this whole thing or everything about this so difficult is that no one can really put in their head the entire flow at the same time. That goes exactly to the point of why we want to break things up so that we have limited blast radius on changes in the first place. Because no one's going to be able to keep in their head the entire thing.
Scott Havens (01:27:40):
Now that said, when you're dealing with even just something that's just a fraction of Walmart's business in just the supply chain side, you're dealing with, last count, 4,500 stores in the US, each of which have somewhere between hundreds of thousands of items in each one. You have hundreds of warehouses around the country and hundreds more in around the world. You have, I don't even want to count how many different manufacturers of all of these items that you have to understand where they are in their process, and how soon they'll be able to get these things to you. All of the potential possibilities for the transportation networks, from distribution centers to fulfillment centers and stores, of how long it takes to get an item off the shelf in a warehouse, packed into a crate or into a box, put onto a truck, know exactly how long it's going to take that truck to get to the next place where it's going to be put on a shelf in a store, or it's going to be shipped to a customer.
Scott Havens (01:28:44):
And the advances we've made in being able to analyze the entire workflow across all of these different stages, where you're dealing with thousands, if not millions of people, working on these systems. It used to take weeks to be able to get something from A to B across the country and do so on a regular basis. And now we're able to do two hour fulfillment in most markets on any good. And this isn't just Walmart. Amazon, arguably was doing that even earlier. But the scope is amazing. And you only get that from being able to collect all of the data from all of your hopefully independent systems, and see how long it takes to do all of these different steps without necessarily knowing what all the steps are in advance. And the point is you need to be able to collect the data, and without having the full understanding of what you're going to be looking at, be able to derive that understanding just from looking at the event flow across all of these systems.
Gene Kim (01:29:54):
Can you describe a moment in either Jet.com or Walmart that you just found particularly that evoked a sense of utter awe in terms of seeing the effects of what you helped create?
Scott Havens (01:30:06):
The first story that comes to mind is when I was asked to do very rigorous scalability and performance testing on, what it was at the time, going to be the Panther system, where what needed to make sure that it was going to be able to handle the scalability of what would eventually be Walmart's scale. Nothing was public at the time about the acquisition, but we had a sense of that was the scale that we were going to need with that. And I said, "Okay, this is the budget I'm going to need to be able to do these tests, to be able to scale up for this period of time. Is that going to provide the data that we need?"
Scott Havens (01:30:49):
And I was told, "No. We need to go even bigger. We need to really be able to prove that we're able to do this for a week straight at maximum scalability. And that we're not going to have any problems at all." Just a really in-depth soak test. And the budget that I had figured was going to be, and again, we're working in a startup at that point. It was a well funded startup, but still thinking about money and not wanting to spend too much. I was thinking, okay, this is going to be super expensive. We're going to drop 10 grand on single performance test. But that's the kind of thing that's important. And to have the answer come back, and say, "No, you need to be thinking an order of magnitude higher than that. We need to spend 100,000 to $200,000 on scalability, just to be able to prove that we're able to do. That amount of money on a single performance test was beyond anything I had done at that point.
Gene Kim (01:31:51):
And did it find anything interesting?
Scott Havens (01:31:52):
Well, it was interesting to me that everything worked great. I was super thrilled about that, obviously. But it really said a lot to the fundamentals of the architecture and the designs based on that architecture that we were able to handle that kind of scale, even if it was just in not quite a production environment, but using production level data that had been replicated to that level of scale. I was really happy with that.
Gene Kim (01:32:20):
That's awesome. Congratulations. So the main effect was generating a lot of heat. Quarter of a million dollars spent heating up a data center somewhere.
Scott Havens (01:32:28):
But I would still argue it's more useful than using all that money on blockchain computation.
Gene Kim (01:32:35):
Right. That's awesome. All right. And obviously it was the assurance that that created is orders are even more valuable than that.
Gene Kim (01:32:44):
Okay. Gene here. Two quick things. One, I love how Scott described the dizzying business problem that arises when you don't have perfect inventory information at all points in the supply chain, whether it's in a warehouse, whether it's being picked or packed, whether it's in transport, being unload or shelved at another warehouse.
Gene Kim (01:33:02):
It's a transport being unloaded or shelved at another warehouse. I had mentioned that amazing presentation by Art Byrne. I'm going to put a link to his 2013 presentation that he gave at the Lean Summit in the show notes. It's just such a breathtaking example of what mastery of a given domain looks like, and the reference to the dizzying level of choreography required to deliver engines to a Toyota plant will be in another link. Number two, Scott had mentioned the quarter million dollar bill of that huge performance test that may have been critical to the Walmart acquisition. This was definitely a bet that paid off because in 2016, Walmart acquired them in a $3.3 billion dollar acquisition, $3 billion in cash, and $300 million in Walmart shares. Okay, back to the interview.
Gene Kim (01:33:50):
I feel like 10 years from now, all these philosophies and techniques that you talk about will be far more mainstream. I think people will actually point to your presentation as one of the reasons that made that so. I think one of the things that you talked about was especially surprising to me, was about the small teams that actually created this. You just talked about the Panther written by essentially five engineers. You talked about how a new developer with no very little experience in F# was able to deliver a feature that halved inventory reject rates by half in only three weeks.
Gene Kim (01:34:29):
This is stunning to me, right? Because I mean, maybe just maybe exaggerated to make a point, many people would say, "Oh, this is all well and good, but our organization is too stupid to use these principles. People aren't smart enough," and I think you went out of your way to really suggest that that's not the case at all. Right? Can you talk about what it was like to not just be an architect, but essentially, you're leading a team of engineers? What was it like to onboard new engineers and get to the point where they could be productive and generate valuable features?
Scott Havens (01:35:04):
Bringing on new engineers has always been fun for me. So many of them have been trained in object oriented techniques and in more classical architectures. The end to your architecture with your single database and maybe your set of web servers and load balancers, and haven't really had the opportunity to work in an architecture like this, that is cloud native. Now a lot more people are accustomed to that than they were five years ago when we were starting this project but even so nowadays, a lot of people still haven't fully internalized the changes that you need to make to how you design your systems for the cloud. It's always been fun bringing on new people and saying, "Hey, the only change that I want you to think about for right now is a complete separation of your domain logic from your IO. Let's start with that. You can do that in any language. Don't worry about any of the cool stuff in this particular language. You don't need to know what computation expressions in F# are, or even know the definition of a monad."
Scott Havens (01:36:23):
Some people always get hung up on things like that in functional programming. No, those aren't the important parts. Just focus on doing your business logic right here, figure out what the result is that you want, and then save it off. Start with that and that comes pretty easily to most people, and that is just the start of the change in mindset into thinking functionally, making sure that you have that purity in how you're developing your code. That is, while it's an easy change, it ends up being super important in that they can then see that same principle applied in larger and larger scales, and then they see why is the system built this way? Oh, it's to have that separation to maintain that purity and everything builds up from there.
Scott Havens (01:37:15):
The system as a whole, it ends up looking like just a larger version of that much smaller function. So, that ends up being really the first thing that I do with bringing people on, and maybe I've been lucky with the developers, all the developers that I've hired or got to come into the teams, but when you pick that part up first, I've been able to pick up the rest very easily because of it. It's really just that quick switch in the mindset.
Gene Kim (01:37:43):
So suppose I'm this new engineer that you've hired and I'm being handed a portion of the project to half delivery reject rates, and I've never done F#. I mean, can you describe what my first week might be like, how do you orient me and what does my context look like when I open up my laptop and try to start to prove to you that I can actually do this? Or prove to myself that I can do this?
Scott Havens (01:38:09):
The first thing that I will do is explain the design of the system that exists right now and how it takes in a message off of a queue. It, if it needs to, maps it to the internal domain model. It does its computation. It saves it to a data store as the event that is the result of the computation, and that event will get emitted downstream automatically by the change data capture that's already in place. Explain, this is the model we want for all of our systems. At that point, they usually get that, okay, if I follow this model, what would my next feature look like? Is it an entire new system that runs in parallel, or is it just a modification to the business logic? Usually if they're just starting out, we just want to add one small feature to the business logic, but we're able to point to, this is where that should exist for the systems you're looking at.
Scott Havens (01:39:07):
The structure is similar across all of these systems, so they know to look in the same place and it's following the same pattern, because a pattern that happens naturally, when you are separating your pure business logic from your IO, it's always going to be in the same place that you look. Whether it's system A, system B, system C, or if the change that they are going to make is to the IO side of things, they know it's going to be in the IO section of the code. It becomes pretty straightforward of where they are going to need to look just based on the scope of the feature that they're going to build. That's the blast radius part that I mentioned earlier. They don't need to understand the entire system upfront before making changes. They don't need to understand how this system interacts with other systems. They're able to be confident that once I've identified the place where I think I need to change, they're correct. That's all they're going to need to change, and it's not going to break anything else.
Gene Kim (01:40:09):
So react to that. I'm smiling because I'm like, oh my gosh. This sounds great, right? I mean, you're probably pointing me into, it's going to be in this function or this module. I don't have to deal with IO and it's apparently going to be a computation. It sounds wonderful, and so I'm also assuming that you'll give me an input dataset and some idea of what the output should look like. Or, here's what the output looks like now and the output probably is going to look like this. I mean, that also helps kind of concretize that the problem needs to be solved.
Scott Havens (01:40:40):
The existing inputs and outputs are all explicit in code because you're not dealing with a random [inaudible 01:40:52] message coming from an API. Here is the part of the code that explicitly lists every single possible input message that the system will process. It does map from some external data source, a queue or the like where the messages are going to come in, but you don't need to worry about those. You only have to worry about your explicit set of input messages, and on the other end, on the output side, your explicit side of output events, you know what those boundaries are. As long as you're dealing with those, the pre-existing set of input and output messages, there is no possible way for you to break any other system as long as your business logic is correct.
Gene Kim (01:41:37):
It's strange. I might be inferring too much here, but that sounds so fun. I mean, I want to think about my worst episodes of programming. I mean, it usually has to do with side-effects. I remember writing a multicast driver for Windows 3.1. I mean, it was all side effects in a treacherously dangerous system, right? Usually blue screens a box, right? You're just trying to understand, how does the underlined system work, right? Windows 3.1, interrupt drivers and registers. Not fun, but when you're dealing with these computations that can be easily tested, and the worst thing can happen is a failing test, I mean, it just sounds fun. Right? Because you're in the business logic. Am I deluding myself?
Scott Havens (01:42:24):
Gene Kim (01:42:24):
That that's kind of how the experience feels like?
Scott Havens (01:42:26):
It really is. It makes the business logic part so straightforward, but you can have your product person review your code and make sure that, hey, this is exactly what we expect the output to be. This is what we want the result to be. The code looks good here. It's when you've taken out all of the IO, you've taken out all of the bootstrapping code, everything that we think of as engineers that you need to have in order just to make the system run. You've pulled out just the core business logic into its own very pure code that people with minimal programming experience are going to be able to read and understand.
Scott Havens (01:43:15):
The IO, because it's completely separated, you're able to make your changes to the IO as necessary without worrying that you're going to affect your business logic. It's either going to keep correctly saving or emitting the events, or it's going to stop working and you'll know pretty quickly, but you're never going to get straight up inaccurate results. When you have that binary of working or not in the IO layer, it helps you understand that your blast radius is very limited and you'll never have incorrect results.
Gene Kim (01:43:52):
Do you find that people, engineers, are having more fun?
Scott Havens (01:43:55):
Oh, first I'll limit it to me. I feel happiest in a role when I am being productive, I am solving business problems, and I think most engineers feel that way. Yes, more money is nice. Recognition is nice, but if I'm not actually having an impact on the business, I feel like everything I'm doing is just going into a black hole.
Scott Havens (01:44:20):
With systems that are designed like this, where you're able to focus on the business logic changes, it's very easy to have an impact on the business because you're able to measure exactly that, hey, I added this feature to the business logic. I was able to test it fully. I know that I don't have to wait on anyone else in order to get it into production, and you can look at the before and after on whatever metrics I'm collecting and see that I actually had an impact. You don't have to worry about your overhead code, or spend a lot of time on your bootstrapping code, or making sure you get the IO just right. You make your change to business logic. You push it to production. You're good to go. You've already made an impact, and it's very fulfilling for me.
Gene Kim (01:45:11):
Gene here. I want to quickly describe that experience working in the Windows 3.1 multicast networking group. That was in 1995 and was what brought me to Portland, Oregon, which is where I still live. It was a summer internship working with Intel while I was a graduate student at the University of Arizona. It required implementing the network multicast protocol for the TCP/IP network stack in the 16 bit Windows operating system. For my temperament, it was the worst job ever. I probably didn't help that I wasn't very good at it. My memory of that summer was constantly getting the blue screen of death on my dev workstation because of memory address errors. In hindsight, it is obvious that one of the main reasons is that I didn't understand what a segment pointer was.
Gene Kim (01:45:58):
So in the 16 bit Intel x86 architecture, you had near pointers and far pointers. Basically, if you referred to any memory address that wasn't in your current 64k block of memory, you had to use a far pointer, and if you didn't, you'd end up with a segmentation fault and you would get the Windows blue screen of death. I think I spent over two weeks trying to figure out why this was happening to me, and the only tool I had was a hardware debugger. It was something you plugged into your PC and it had a thumb switch that would bring you into a hardware debugger and let you inspect the memory, and the feedback loops were so long. I remember for certain weeks I was working late into the night every night, basically seeing blue screen of death after blue screen of death. By the way, this is what is so amazing about having flat address spaces. You don't have to worry about second pointers anymore.
Gene Kim (01:46:49):
This was only made possible by 32 bit programming that came in with Windows 95, and Windows NT, Linux, and later macOS. I cannot overstate how much this simplified programming. For a while, I thought I wanted to do operating system work. In hindsight, that's the exact opposite of what I like to do. I just want to work in a nice dev environment with the entire environment created for me, and I just to get to work on the business logic where the worst thing that can happen is a failed test. It's the ideal where you get fast feedback on your work, and you get to work safely within your module with very little changes and side effects happening outside of your module. Just like how Scott described. Okay, back to the interview. At the modus operandi, you had an opportunity to take these same principles into our Ruby on Rails code base. I'd love to hear any reflections on that, and my last question will be, when is CQRS unsuitable? Can you talk about anything about your modus operandi experience? What does it like to tackle a Ruby on Rails code base? To what extent is it applicable there?
Scott Havens (01:47:49):
I would say that it's pretty difficult to follow this style on a preexisting Ruby on Rails application. It's not what Rails was designed to do. It's designed very much for storing the values directly to the database. A lot of updates in place. We spent less time on trying to put in new features in that code base versus extract the features out into separate services. If there is something that is a basic part of a feature that was already in place in that code base, we'd write our new microservice that we were able to employ completely separately, test independently, and with that in place then make the change to the Ruby monolith. To instead of making this call in its own code base, to either emit an event or call out to the separate already completed service. So we're really following a strangler pattern there where we would write code it following a new model that would replace the existing part, and then cut that off, and over time cut down on the monolith that existed there.
Gene Kim (01:48:56):
What did you write these new services in?
Scott Havens (01:48:59):
I was lucky to get a team there that was already pretty familiar with functional programming, which made me happy. Their background was mostly in Scala, and I don't really care which language we're doing it in as long as we're following certain principles. So we did all of this in Scala.
Gene Kim (01:49:16):
Were there any surprises in that journey or something that you're most proud of?
Scott Havens (01:49:19):
I was really happy that we were able to get the fulfillment rates up quite a bit. The existing code base, because it was doing a lot of updates in place and didn't have as many unit tests as it did integration tests, was not able to fully cover all of the potential business cases. That happens a lot, startups. You get the code in place that is working and you have to move on to your next problem.
Scott Havens (01:49:49):
It's more important to get some of these features in place early on than to prove that if you add more warehouses down the line, if you get to the point that that's a problem that you're dealing with lots of warehouses instead of your single one, then you should be pretty happy that you're able to get there in the first place. Then it did lead to problems where we weren't able to test exhaustively all potential inputs and outputs to the business logic. So five, six, seven years later, we stumbled across this problem because we are doing inputs in place and now we made a change in one part of the system and it ended up having impacts in other parts that we didn't expect. The blast radius was not particularly contained.
Scott Havens (01:50:35):
So as we were able to identify these parts of the code handling supply chain issues and extract them to an exhaustively tested microservice, we were able to eliminate a lot of the code errors that were putting in incorrect inventory values in these corner cases. Or, calculating the transportation times incorrectly. All of these different supply chain potential code bugs, and get our reject rates from, I shouldn't give exact numbers, but much better reject rates than we had been seeing before.
Scott Havens (01:51:12):
When the customer has requested an item, and expects the item to arrive by a certain point in time, and we end up not being able to fulfill that promise, you want that number as low as possible. Or, the inverse of that being the fulfillment rate and the perfect order rate, that we've been able to get that order to the customer on time without making any mistakes in which item we selected, or damage to the item, or anything like that.
Gene Kim (01:51:45):
Does late contribute to a reject rate?
Scott Havens (01:51:48):
For reject rate, no, but reject rate is a portion of what we call the perfect order rate, which is all inclusive. It's really the best descriptor of the customer experience, where they were expecting the right item undamaged by a certain point in time, and making sure that we get that order to them perfectly, fulfilling all those expectations.
Gene Kim (01:52:10):
So one of the things I'm starting to suspect is that it's really important at the highest levels of leadership to understand the importance of good architecture. Often I feel like there's a gap. In other words, many big decisions are being made not properly informed by architecture, or maybe not with the recognition of, that they don't have a good architecture and they need to invest in one. What advice would you give to business leadership? How do you help leaders get that aha moment, right? To care about the things that you care about. What advice would you give to someone who wants to replicate the journey that you've had?
Scott Havens (01:52:46):
I think that is an excellent question, particularly since I don't know if it's reasonable for executives who are focused on making sure that the technology organization as a whole is not just solving some business problems, but solving the right business problems. It is important that they are measuring the right things from their technology organization. Obviously very familiar with all of the software delivery metrics that in talking about making sure that the lead time on changes that are getting to production, that you're measuring that, that you're measuring the number of successful deploys at a time. I would say that executives need to make sure that they're pushing those metrics down to the front level managers and that the managers let the teams know that that's what they're measuring. Not just number of commits, or number of bugs that are filed, but that they're measuring these particular metrics.
Scott Havens (01:53:58):
Then the executives need to be able to listen to their engineers, their architects who say that these architectures will be improving these output metrics, these delivery metrics. Maybe in five or 10 years, there are going to be new advances in architecture that we have no idea about now, but you should be able to test these architectures in your own teams by saying, "Hey. If they are going to improve these metrics, these delivery metrics, then let's do that. Let's try it, see how it works and see if it actually makes our teams better in the long run." It's not a matter of going with a particular architecture. It's a matter of being comfortable with new or different architectures that may have real world differences in your productivity. Being able to listen to those people who are saying, "Hey, we're going to see these differences in productivity," but making sure that you are getting your teams to measure that productivity rather than just saying, "This is the next new cool thing."
Gene Kim (01:55:08):
I heard the accelerate metrics of deployment frequency, [inaudible 01:55:11] component lead time, [inaudible 01:55:12] time to prepare. You're saying that there's a certain sufficiency there and I also heard you say if there is a gap, you have to be able to create room for architectural experiments. I think I'm hearing with a promise that could promise orders of magnitude improvement in those and just making room for that experimentation. Did I capture the salient aspects of the answer?
Scott Havens (01:55:34):
Yes. It's the experimentation combined with making sure the promise is fulfilled. Just doing a cool thing because it's new and cool isn't enough. You do the cool new thing because it has a promise that it will have better results, and so you should be able to measure those results. That should be something that is upfront whenever you are pushing an architectural change that could be affecting a bunch of people, you should be able to note, what are the expectations from this? At what point should we see these results? What do you expect to see and why? Make sure that your engineers, architects, are answering those questions before they do the experiment.
Gene Kim (01:56:18):
I'm three years, three and a half, four years into functional programming and I still have a loose grasp of understanding of category theory, monoid, semigroups, functors, applicative functors, monads. Right? I mean, what do people need to understand? It's clearly a precise way to think that I certainly appreciate, but there are times when I wonder, will I ever get there and how much energy should I invest in understanding these concepts? What advice would you give me on that?
Scott Havens (01:56:47):
I would say the vast majority of it is irrelevant. It's useful if you are wanting to debate whether a particular feature should be added to a language in the most effective way, that has the broadest impact for the least amount of changes to the syntax. In general though, I've found that most developers who are working on line of business applications or irregular websites, using all of the terms from category theory that you were listing earlier isn't going to help them build better software.
Scott Havens (01:57:24):
What's going to help them is to see examples of, here's how to do error handling in a way that doesn't involve throwing an exception and catching it somewhere up the stack. By having a result type that has two options, your success results that has whatever that result is, or your failure result that has an error message included in it, and returning that single result type as opposed to throwing exception. Giving examples like that and showing how they can be useful in the real world ends up being much more important in terms of getting new developers to be productive in a language.
Scott Havens (01:58:11):
If they want to go off on their own and learn about monoids, and monads, and applicative functors, and just all of the fun category theory stuff, great. Let them and they'll get a better understanding of why these things were designed the way they were in the first place, but that's not going to necessarily help you build a discriminated union that covers all of your potential inputs and makes sure that you're exhaustively testing all of your business coding.
Gene Kim (01:58:39):
Okay. Gene here. I love what Scott just said, saying that category theory is probably overrated and not critical to learn to be able to write better, simpler programs in a functional programming style. Okay, so number one, it occurs to me that one of my biggest aha moments in functional programming is not my shaky grasp of category theory. Instead, it is the notion of algebraic thinking, which is also described brilliantly by Eric Normand. He has this amazing podcast episode called Examples of Algebraic Thinking. He gives an example of building a video editing program comprised primarily of pure functions that operate on videos. So you can split or trim them, you can concatenate them together, so that for any given time T in a timeline, you know what frame should go there. You don't mutate them in place, but instead you transform them, composing these functions and operations together.
Gene Kim (01:59:31):
It's just a great way to think about solving problems, and he gives two other examples in that podcast episode. Incidentally, when you're doing this, when you're using these tools to think with, you're actually the properties of monoids or the order of operations and grouping don't matter. It's pretty awesome. Which gets us to number two, I've also been enjoying Adam Gordon Bell's podcast called CoRecursive. There's an episode where he interviewed Sam Ritchie, who discovers that you can actually do certain types of searches and counting in HyperLogLog time, if you're willing to put up with some inaccuracies. Because, of a property of another thing that comes from category theory is called semigroups. I don't actually claim to understand this, but it does show that there are some pretty amazing things you can do if you understand some of the mathematical properties, which gets us to number three.
Gene Kim (02:00:18):
Despite all this, I love what Scott says. You don't actually need to understand category theory, which I don't even want to try to define here because I can't, in order to use functional programming to get some of these amazing benefits that have been mentioned throughout this podcast. To build things in a simpler, faster, safer way, which also makes us happier, which gets to the last point, number four. Maybe to prove this, Dr. Adam Grant, famous for many things including one of my favorite books, Give and Take, tweeted this out earlier this year. He said, "Coding is more about communicating than computing new data. The best predictor of how quickly people learn to code wasn't math or cognitive ability, but language aptitude."
Gene Kim (02:00:59):
In other words, math skill was almost irrelevant. Coding is a mastering of language and he points to a nature article that explored this pretty great. Okay, here's Scott to conclude the interview. I love it. Scott, I've got to tell you, I've learned so much from you over the years, and I've had so much fun in these interviews. Thank you so much. I love the fact that in our talk, you said one of your missions life is to reduce the level of entropy in the world. I think your work has done so much to advance that, not in academia, but in the commercial world where these problems are most acute and are most needed. So Scott, tell us how to reach you.
Scott Havens (02:01:37):
Sure. I can be reached on twitter at Scott Havens, or via email, [email protected] That's Sierra Papa Havens.com. I would love to talk with anyone who is interested in applying some of these concepts about distributed asynchronous event-driven systems in their own work, where they may have more traditional architectures and are wondering if they'd be able to see any kind of gains by making any changes, or how to go about doing so. Or, if you have already started on that journey and have experience, either good or bad, with making these changes. Maybe you're having a lot of the same concerns about being able to observe the systems as a whole and how they interact, and your thoughts on that and what's working for you, and what's not, I would love to be able to talk with anyone with that experience as well to collaborate on tat.
Gene Kim (02:02:40):
Thanks so much, Scott.
Scott Havens (02:02:42):
Thank you, Gene. My pleasure.
Gene Kim (02:02:43):
The IdealCast is produced by IT Revolution, where our goal is to help technology leaders succeed and their organizations win through books, events, podcasts, and research. This episode is made possible with the support from ServiceNow. Take the risk out of going fast. If you need to eliminate friction between dev and ops, head to ServiceNow.com/devops to find out more.