Skip to content

June 22, 2023

Platform as a Product at Walmart

By Bryan Finster

This article is excerpted from the paper “Platform as a Product” that appears in the Spring 2023 DevOps Enterprise Journal. It is authored by Bryan Finster, Distinguished Engineer at Defense Unicorns.

In 2015, we began a pilot of continuous delivery at Walmart. We had a large legacy system supported by hundreds of developers deploying to scores of distribution centers in multiple countries. We pushed three or four releases annually, each requiring planned 24/7 support for a couple of weeks and heroic efforts to stabilize. Our leadership challenged us to find a way to deliver every two weeks. We assembled a tiger team of senior engineers, studied the book Continuous Delivery by Jez Humble and Dave Farley, and decided daily delivery was a better goal. Why daily? Because smaller batches of change fail in smaller ways and nothing is more effective at uncovering waste and delivery pain.

To achieve this, we couldn’t simply tell teams to deliver more frequently. We needed to change everything. Our team structure, organized around feature delivery rather than business capabilities, was continuing to degrade the application architecture. We also had no experience with the continuous integration workflow that is fundamental to continuous delivery (CD). We presented leadership with a new team structure aligned with business domains and a plan to begin rearchitecting for delivery. We created a small platform team to build self-service and opinionated delivery pipelines that aligned with the principles of CD.

While the new product teams solved the challenges of continuous integration (CI) and CD, they provided feedback to the platform team to improve the pipeline defaults. This allowed the teams to focus on their domain capabilities rather than how to deliver them while we embedded the good patterns we learned into the platform to help future teams. Solving all of the problems of delivering nonbreaking changes several times a day rapidly grew engineering skills on those teams. It also had the unexpected benefit of improving morale. Teams love seeing their work used, and those pilot teams could see that several times per week rather than three times a year. We had improved mean time to dopamine.

Continuous Delivery: The Lever, Not the Goal

Our experience showed that focusing on solving the problem of “why can’t we deliver working solutions daily” was the most effective method for driving improvement. We implemented a strategy to use CD as the primary method to scale the same engineering and business improvements across the enterprise. The challenge—and the strength—of this strategy is that CD is more than tooling. It also requires a specific workflow and mindset to deliver optimum results. One way of doing this is to create a large coaching organization and perform training with every team. However, even if enough qualified people could be found to work with teams, that method creates unsustainable overhead for any large organization.

We needed to apply force-multiplying solutions to help teams self-improve. To that end, we created a centralized delivery platform organization to lead that strategy.

Centralized Platform

Standardizing tools is a double-edged sword that can result in poor outcomes if done incorrectly. If we impose too many restrictions or opinions on the platform, we can strangle innovation or force people to work around the platform to get the work done. Done correctly, we can generate economies of scale. Having one set of tools for everyone has obvious benefits. Using fewer tools means lower operations costs and less integration effort. It also means onboarding or changing between teams is easier. Further, it allows us to automate standards, policies, and security into a single platform and incentivize change.

Empathy, Not Mandates

When delivering any solution, the last thing we want to do is alienate our potential users. The most effective way to alienate internal users is to force them to change. Even if our solution is better, forcing them to switch would burn any goodwill we may have received from voluntary adoption. That goodwill is very important when we stumble early on, as every new solution does. Instead, we wanted to build solutions that enticed them.

We knew that a global platform was the enterprise’s goal and that using it would be mandated in the future. However, if we acted that way, we would drive away users with both poor interactions and poor solutions. We wanted adoption to be a pull, not a push. To that end, we focused on behaving with empathy for the problems our users had. That empathy and user-centric focus didn’t end with developers.

An improved developer experience does not mean we optimize only to reduce the time or effort required to deliver code to production. It means we make it harder for teams to make errors and easier for those teams to operate their products. Doing this means working with all of the other disciplines that surround coding and helping to make their jobs easier as well.

We collaborated with the Security and Compliance areas to embed their concerns into the platform so that every change could be validated against their policies automatically. This took disciplined change management on our part. You cannot apply strict automated validation to applications that only used manual security and compliance processes in the past. We’ll cover this more later.

The result of this work was that teams were not required to use our solution. However, they would need to work very hard to meet the organization’s nonnegotiable with another delivery solution. With ours, it just happened.

Scope and Organization

Another problem to consider is how to avoid solving the wrong problems. As we discussed earlier, internal platforms do not generate income. Our value proposition is lowering costs elsewhere for other value streams.

For a platform with enterprise funding, it’s easy to fall into the trap of “That would be neat. Let’s solve that problem too!” In the process, we can become top-heavy with features that solve phantom problems or only help a small fraction of the organization. “Wouldn’t it be cool if we had a developer portal with drag-and-drop service creation?” It might, but is there a need? If it does not dramatically lower the cost of development for a majority of teams, we have only created an ongoing expense for something that looks good on a conference stage. In this way, we can quickly eliminate our value proposition and spend ourselves out of existence. We need clarity of mission and vision.

We had a clear mission: help drive the organization’s CD strategy. We also had a clear vision: irresistible developer experience. Next, we needed to define our scope and organize for success.

Software delivery enablement (SDE) was part of a larger infrastructure organization. Our scope of responsibility was all of the capabilities from version control to delivery. We would interface with, but not be responsible for, capabilities such as operational observability and workflow management solutions. With a known scope, we leveraged domain-driven design to quickly deliver our goals while remaining flexible about how we implemented them. We defined the discrete capabilities, version control, CI, security, artifact versioning, delivery metrics, etc., and organized cross-functional teams around each.

Each team’s product owner was responsible for aligning the team’s road map to the overall platform goals. Teams were trusted to deliver the capabilities they were responsible for and held accountable for the outcomes, including operational stability and user experience. By giving this trust, platform leadership established a culture of ownership that fostered innovation and improved user interactions with each product within the delivery platform.

By organizing around capability domains with the goal of being able to replace underlying infrastructure without impacting the user experience, we were not locked into early technology choices. We could, and did, change the tools used to deliver these capabilities.

An Irresistible Developer Experience

Because one of our goals was to spread knowledge of continuous delivery workflows, the platform made practices like trunk-based development the easiest way to work and provided feedback in the form of scores for how well CD was being executed. It also made other workflows more difficult, either intentionally or by simply not explicitly supporting them. Using practices such as GitFlow, which were incompatible with our goals, resulted in increased toil. However, if we eliminated the ability to use other workflows entirely, it would prevent adoption for any team that was not already on the path to continuous delivery. We took an approach that allowed teams to trade simplicity for flexibility as needed while also ensuring that security and compliance could not be avoided. Our abstraction layer over the tools also allowed us to maintain a consistent user experience while changing tools as our needs changed.

To be effective, we needed to avoid the platform-as-a-service-ticket antipattern, so we prioritized the ability for teams to configure their delivery flows according to the needs of their products. We provided extensible templates that allowed teams to use simple, declarative commands to configure common tasks such as test coverage reporting. The base templates exposed those simple commands while also enforcing data collection, security scans, compliance validation, and business rules. For example, we could create and enforce rules to block unapproved changes during major sales events. However, if a team needed more complex behaviors for their specific application, they still had the ability to create scripts for those behaviors since not all twenty-year-old code is architected for continuous delivery patterns. We were also responsive to input from teams using common technologies, but not yet supported. As an open-source first organization, we encouraged inner sourcing. We would happily accept code submissions to improve our support for those technologies as long as they aligned with the overall mission.

Speed with Safety Nets

Focusing on a good developer experience does not mean we ignore our responsibilities to keep the enterprise safe. It means we make keeping the enterprise safe as easy as possible by embedding security and policies into the platform. However, if the teams’ previous delivery solutions relied on manual security verification, it’s exceedingly unlikely they met our desired security profile. If we simply blocked the delivery of their current applications until they improved their security, then our platform would never be adopted for anything other than greenfield development. We needed to enable them to keep the business running while adopting our platform, but also increase our security profile and automate compliance.

To make it easy for teams to deliver secure solutions, we needed to make it continuously more difficult to be insecure. We also needed to do it with empathy for the delivery goals of the teams. The easy thing would be to simply implement the final standards and tell the teams to pick up the pieces, as most of their pipelines went red. “Why didn’t you follow the standards before? Suck it up.” To keep the enterprise safe, serve the needs of the business, and help the teams, there needs to be a more reasonable approach. Whenever we do anything that may cause disruption or add friction to their delivery flow, we need to over communicate. Before implementing a new security or compliance gate, we broadcasted on Slack, email, and anywhere else we knew teams might see the information that a new mandatory gate was going to be implemented. After a month or so of broadcasting, we implemented the new pipeline gate as a warning message in their pipelines that included information about when the warning would be switched to an error that would halt their pipelines. Only after that did we block noncompliant builds. Even then, we worked with areas with legacy applications that could not migrate fast enough to provide an exception while helping them come into compliance. In this way, we continuously increased the security and compliance profile of every system that used our platform.

Our explicit goal was to transform how the enterprise worked. There was a push from the CTO for that change where he challenged every team to solve the problem of delivering to production daily. This was not about the speed of delivery—it was a challenge to solve the engineering and communication problems that prevented daily delivery to improve the entire organization. However, you cannot improve if you do not know where you are and how to get to where you’re going. Metrics reporting was an important part of our platform strategy and one of the earliest capabilities we delivered.

The delivery metrics views had two important uses. First, it gave teams visibility into the status of their pipelines. A single view of pipeline health is crucial to a team’s productivity because a broken pipe-line cannot provide quality feedback, so fixing a broken pipeline is the highest priority for a team. Second, the metrics showed the teams how well they executed CD behaviors relative to what “good” looked like. For example, if you as a developer used trunk-based development and integrated code at least daily on average, then you got five stars for source management. We continuously reviewed the behaviors the scoring caused to guard against perverse incentives. We would show a balanced set of metrics so that when gaming the metrics occurred, they were gamed in favor of our goals. When we saw undesired outcomes, we would adjust how we displayed the information or, in some cases, use education to help prevent management from using the scores to compare or blame teams.

We would frequently receive requests to adjust the scoring to provide good scores for things like delivering once per month because “our users don’t want changes more frequently than that.” Since this didn’t align with the organization’s improvement goals, we referred them to other resources we offered to help them migrate to CD. More on this later.

Building the “Easy Button”

Our next task was to implement a multi-tenant cloud solution that would let us deploy containerized workloads to the public cloud, our private cloud, and the data centers that reside in every store and distribution center. We wanted a solution where the teams did not need to know or care where their applications were running, and we wanted it to be even easier to use. After all, platform engineering is about delivering products to development teams that allow them to focus on the problems they are trying to solve instead of solving the problems of how to deliver the solution.

In 2019, we released the Walmart Cloud Native Platform (WCNP). Shifting from deploying onto virtual machines to deploying containers in Kubernetes would usually require every team to slow down and learn new technologies and make costly mistakes along the way. We wanted to make that transition easier. With WCNP, if you wanted to deploy a React application, simply tell WCNP to deliver a React application owned by your team and where to send alerts with ChatOps. WCNP would handle everything else. It configured the container, logging, monitoring, and alerting, and assigned a domain name. From then on, all interaction could be handled with ChatOps, including approving delivery to production, if not configured for continuous deployment.

Teams did not need to learn how to build efficient containers or anything about Kubernetes to deliver solutions. The platform hid that complexity from them. If they needed more complex pipelines, they had the ability to build their own containers, and WCNP would handle just the delivery and operational monitoring setup. Again, this gave teams the option to handle more complexity if needed. Feedback from product teams was overwhelmingly positive. The developer experience also enabled teams to run quick, disposable experiments with almost no effort. Lowering the cost of change is critical to innovation.

To read more about Walmart’s journey to using platform as a product, follow this link to download the Spring 2023 DevOps Enterprise Journal.

- About The Authors
Avatar photo

Bryan Finster

Distinguished Engineer - Defense Unicorns

Follow Bryan on Social Media

No comments found

Leave a Comment

Your email address will not be published.

Jump to Section

    More Like This

    Building an Automated Governance Architecture – Investments Unlimited Series: Chapter 5
    By IT Revolution , Helen Beal , Bill Bensing , Jason Cox , Michael Edenzon , Dr. Tapabrata "Topo" Pal , Caleb Queern , John Rzeszotarski , Andres Vega , John Willis

    Welcome to the fifth installment of IT Revolution’s series based on the book Investments…

    Addressing Burnout in Our DevOps Community Through Deming’s Lens
    By John Willis

    A Crucial Battle We Must Not Ignore Today, I'd like to pivot from our…

    The Ethical Tensions Between Bureaucracy and Digital
    By Summary by IT Revolution

    We live in an era of competing value systems—the lingering influence of impersonal, productivity-maximizing…

    The Path of Gracious Perseverance: Developing Leadership Courage for Business Impact 
    By Summary by IT Revolution

    We’ve all encountered situations at work where politics, opinions, and power dynamics seem to…