July 10, 2023

Governance Engineering – Applying SRE Principles to Regulated Software

By Bill Bensing

DevOps and Site Reliability Engineering have played a pivotal role in breaking down organizational and cultural barriers in organizations. By fostering collaboration, tearing down silos, and embracing automation, these practices have revolutionized software development, propelling organizations towards faster, more efficient delivery.

Despite this progress, a persistent challenge remains and it’s one you will immediately recognize if you deliver software in a regulated context. There is still a huge amount of friction between the dynamic nature of DevOps engineering and the necessities of software governance (audit, compliance, security and change management). They’re siloed in much the same way as development and operations were in the bad old days.

This article explores the emerging concept of Governance Engineering, where compliance and innovation coexist harmoniously, where governance and engineering are synthesized instead of siloed.

If we can make governance an enabler of DevOps instead of an obstacle, it has the potential to release incredible value in industries like healthcare, automotive, and banking where security and compliance are essential. Let’s explore.

What is Governance Engineering?

When you ask a software engineer to design a governance team, governance engineering happens.

Does that sound familiar? It’s a riff on the definition of Site Reliability Engineering (SRE) from the introduction written by Ben Treynor Sloss in the original Site Reliability Engineering (SRE) book, where “operations teams” has been replaced by “governance team.”

SRE is a big inspiration for Governance Engineering due to the strong similarity in the problems being addressed. Before SRE (and DevOps), the work required to manage infrastructure and services was largely manual, repetitive, and done at a tactical level. It meant that the human effort required to handle this work scaled at least linearly with the size of your infrastructure.

The same problem lies at the heart of software governance. Audit, security, and compliance work scales at least linearly with the amount of change you make. And as software teams deliver faster changes to ever more fine-grained and distributed cloud architectures, governance becomes a massive challenge for regulated software teams.

Let’s expand a bit more on why SRE is so useful for conceptualizing Governance Engineering.

What is Site Reliability Engineering? A Quick Overview

Google’s SRE model addresses the undesired side effects of traditional systems administration. The conventional sys admin model is based on system administrators assembling existing software into deployable enterprise services.

The standard model is preferable to lots of people because it’s relatively easy to implement, there are many easy and understandable models of system administration, the talent is broadly available, and numerous tools and technologies are available.

But, the traditional systems administration model drives natural and pathological outcomes. It creates a rigid division between the developers and operations. Why? Developers and Systems Administrators differ significantly in skillsets, backgrounds, and incentives.

Ben describes it best when he writes, “They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability. The split between the groups can easily become one of not just incentives, but also communication, goals, and eventually, trust and respect.”

These differences manifest themselves in two broad categories: direct and indirect costs. Direct costs are apparent. These are the team costs as it grows to address the issues of their service usage manually. The operations team grows as usage of the service grows. Indirect costs are more subtle. The indirect costs are the more expensive of the two. They are hard to capture and are driven by the differences mentioned above.

Governance Engineering’s Shared Pathology with SRE

These differences extend to developers and governance professionals. The same pathological outcomes exist when you swap out “system administrator” with “governance professionals.”

Traditional governance professionals are similar to systems administrators. Like conventional system administration, there is an ample supply of traditional governance talent, commonly used governance models, and existing tools and technology that addresses standard governance execution. Governance professionals prefer low volumes of change because the manual nature of validating a change requires significant effort, just like traditional systems administrators.

But business needs and market realities require constant, quality change. The exact direct and indirect cost issues occur within manual governance execution. As the number of systems and changes increase, the costs of directly administering the changes increase proportionally or exponentially in the worst case. The subtle indirect costs are the most expensive. How do we identify these indirect costs?

Mapping SRE Concepts to Governance

There are three key concepts in SRE that we can directly apply to governance: Toil, Service Reliability Hierarchy, and Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

Let’s explore them and see how they map to the concept of Governance.

Toil – How To Identify Direct & Indirect Costs

Current implementations of governance models are rife with toil. This section paraphrases the SRE Book Chapter 5 – Eliminating Toil. What is toil? Let’s start with what toil is not.

“Toil is not just ‘work I don’t like to do.’ It’s also not simply equivalent to administrative chores or grungy work. Preferences as to what types of work are satisfying and enjoyable vary from person to person, and some people even enjoy manual, repetitive work.

There are also administrative chores that must be done and should not be categorized as toil: this is overhead. Overhead is often work not directly tied to running a production service and includes tasks like team meetings, setting and grading goals, snippets, and HR paperwork. Grungy work can sometimes have long-term value too, so it’s not always toil either. Cleaning up the entire alerting configuration for your service and removing clutter may be grungy, but it’s not toil.

Toil is the kind of work that’s tied to running a production service and tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

Google identified five (5) attributes of toil. Not all toil has every feature listed here, although toil will be represented by work that is:

Manual – manually doing work or kicking off a script to shorten the manual effort.
Repetitive – work that is done over and over again. Not first-time work, or work solving a novel problem or creating a new solution.
Automatable – Anything a machine can do just as well as a human. If human judgment is essential, it may not be toil.
Tactical – Work that happens in reaction to something that a strategy-driven and proactive approach can address.
No Enduring Value – The type of work done on something that does not change the state of your system or does not create a permanent improvement in your system.
O(n) with service growth – Work that scales linearly, or worse, as the service size or usage increases.

The same attributes are valid for traditional implementations of governance execution. Many people will generally argue that their governance process requires significant human judgment. I disagree.

Many of the check-box processes in a governance approach, like automating risk controls and change approvals for audit and compliance, do not require human review. These are manual and repetitive actions that are tactical with no enduring value. And these check-box actions grow significantly as more changes occur and new software systems come online.

Governance Toil & Delivery Toil

Toil generated while executing a governance model can be categorized into Governance Toil and Delivery Toil.

Governance toil is the toil that drives direct costs. It’s highly apparent and is characterized by humans turning the cranks of the governance process. It includes, but is not limited to:

People ask others to collect evidence that a process was followed
Manually reviewing evidence to assess if the process was followed correctly
Human-driven check-boxing that can be done with automation

Delivery toil is the toil that drives indirect costs. This toil is not apparent and has become accepted as the cost of doing business for governance execution. This is toil caused by ambiguity or conflicting interests among the multiple parties within the governance process. This includes, but is not limited to:

Significant effort coordinating the execution of the governance process
Undefined, unclear, or inconsistent processes that are being governed
No common understanding of required accountability, responsibilities, or expectations when executing the governance process

What do we do about this? Who is responsible for reducing this toil and what does that work look like?

Defining the Role of Governance Engineer

Addressing this toil is the responsibility of a governance engineer and the solutions require human judgment. However, the outcomes are toil reduction through permanent improvements to the execution of the governance process. It’s critical that governance engineering is software and systems engineering work.

Governance engineering requires the governance teams to adopt an engineering mindset. They must identify, test, validate, and continually improve solutions that address the root causes of governance and delivery toil.

Governance systems engineering requires the team to take a systems-thinking approach to understand, design, and consult on their new governance solutions. Governance software engineering requires the team to write scripts, code, and add features that allow the governance capability to grow unhindered by the number of changes or new systems.

Ok, let’s now take a closer look at the Service Reliability Hierarchy and how we can apply it to Governance.

The Governance Engineering Hierarchy

The SRE book outlines excellent practices that address toil issues. Part III – Practices introduces the Service Reliability Hierarchy. This hierarchy lists the many considerations that ensure reliability for a given service. These considerations also largely apply to governance. Many things go into ensuring effective governance execution. I have modified the Service Reliability Hierarchy to represent the governance tasks as I did with Ben’s quote.

The Service Reliability Hierarchy demonstrates seven (7) elements of the SRE practice from basic to most advanced: Monitoring, Incident Response, Postmortem/Root Cause Analysis, Testing + Release Procedures, Capacity Planning, Development, and Product.

Governance requires the same seven (7) practices for its similar domain. I’ve taken these practices and updated four specifically for Governance Engineering.

Governance Monitoring

General monitoring is required to tell you if something is working. Without monitoring, you’re simply navigating a room in the dark. This is the same with governance monitoring. This monitoring is set up to tell you:

What system are you governing
What’s the current governance state of that system
The state of a system when a significant external governance event happens, such as a new policy or change to existing policies.

Governance Incident Response

The Governance Engineering team hands-on resolves governance issues so that they can understand why, how, and what went wrong. The team’s job is to address the incident and engineer a solution that significantly reduces the probability of this or a similar incident from happening again.

Governance Procedures

Testing was replaced with Governance Procedures because they are ideologically similar. Software, or systems, tests are things you can run to assess a system’s state or behavior. Governance procedures are software tests that validate a system’s governance state or behavior.

The word “procedure” is purposely used. I’ve borrowed it from the Secure Control Framework – Integrated Controls Management (ICM) Overview. Page 12 of the ICM describes how cybersecurity and data protection documentation is generally organized. At the top of this pyramid are procedures. Procedures are the defined practices and steps which implement a standard or guideline.

A standard is a specific requirement, and a guideline is recommended but not required. Procedures should always be automated if possible, just like tests. A procedure will tell you if your system under test meets governance expectations.

Control Planning

Capacity is critical to SREs as it’s the most vital resource needed to scale systems. The most critical resource for governance, relatively speaking, is controls. Controls are the actions you take to minimize the risk of a security issue. Control planning requires human creativity and ingenuity. It’s how people identify new or better steps to secure their systems.

Most companies spend very little time or focus here. Why? Because most of their effort is spent executing their manual governance processes. It’s spent on tactical, repeatable, and automatable things, like human box checking.

The goal of Governance Engineering is to minimize as much toil as possible so that the organization can reallocate these newly freed resources to address security and compliance opportunities creatively. We’ve already seen something similar with testing, where the automation of manual toil has freed up QA teams to do more exploratory testing.

Wrapping Up The Hierarchy

The way we solved reliability toil with Site Reliability Engineering gives us the guiding principles we need for removing the toil around Governance. I can take any chapter of the SRE book and start replacing SRE with Governance Engineering, and it maps out. Next, let’s look at the third concept I mentioned above – Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

The Four Golden Signals of Governance Engineering

The SRE book has a whole chapter dedicated to SLOs, SLIs, and SLAs and I’m going to propose a way of applying the SLO and SLI concepts for measuring governance. Most companies cannot, in real time, measure their governance state. But, if we draw a parallel between SLIs and Governance Level Indicators (GLIs) and SLOs and Governance Level Objectives (GLOs), organizations can start to define and measure, in real-time, their governance state.

I’m going to briefly define SLIs and SLOs for those who may not be familiar with them. Service Level Indicators are “a carefully defined quantitative measure of some aspect of the level of service that is provided” A Service Level Objective is the target value or range for that indicator. For example, a common indicator is latency time. The SLI would be the response latency of a system, and the SLO may be a latency between 25ms and 50ms.

Indicators are ALWAYS around something that the service users care about. For governance engineering, the users of our system are the stakeholders that may be interrupted in the event of a system failure. Typically the development, operations, and/or product or project management.

The Monitoring Distributed Systems chapter discusses four standard metrics for a user-facing system. These are latency, traffic, errors, and saturation. Using these golden metrics as an inspiration, here are Four Golden Signals of Governance Engineering: Human Touch Points, Software Delivery Takt Time, Control Ambiguity, and Control Coverage. Let’s define each of them in turn:

Human Touch Points

This is the number of times a human has to manually execute a part of the governance process for software between commit and delivery. This is a leading indicator of the indirect costs of the governance process. While this GLI may never be zero, it’s a good practice to manage this indicator via statistical process control.

Software Delivery Takt Time

“Takt time is a calculation of the available production time divided by customer demand.” I specifically use takt time instead of lead time. Why? Takt time is the time you have to produce something based on your resources. Think of takt time as the metronome or cadence of software delivery.

For example, if you have a team of 3 people working 8 hours a day, they can only administer 24 hours of work daily. Given 20 daily changes, your entire delivery process, including governance, from commit to production, cannot be over 1.2 hours (24 hrs / 20 changes).

Control Ambiguity

Control ambiguity is when you are unsure if a specific control applies to your system. Not knowing if a given control applies is worse than a failing control that is well-known. Ambiguity causes vary widely, but not having a definite yes or no, with the reasoning behind it, is crucial. Counting these ambiguities is vital as it allows you to address an issue before it arises.

Control Coverage

The count of controls that are 100% automated for your system. These may be controls you have directly automated, or they can be automated controls inherited by your system from other subsystems. Understanding your control coverage and how it deviates as the number of controls grows drives action.

Why These Four Golden Signals and Not Others?

As I looked at many other possible indicators, I kept in mind, “What is an indicator that would cause me to take action if it changed?” This was the key factor as I was making my decisions. There are other potential indicators, such as the total quantity of controls or software delivery lead time. These are interesting, but they don’t have the power to drive a decision.

For example, the total quantity of controls will likely increase over time, but just because controls are increasing doesn’t mean I need to take action. What will drive action is if my Control Ambiguity or Control Coverage changes.

Also, why Takt Time and not software delivery lead time? It’s good to know how long it takes for changes to get from commit to production and if that’s increasing or decreasing, but this information does not compel a specific action. Understanding the software delivery Takt Time drives explicit decisions. If the Takt Time reduces to such a level that my lead time is greater than my Takt Time, I need to figure out how to address my lead time.

A Call To The Governance Engineering Community

What was described in this piece is already happening across highly regulated industries. It may not be happening in exactly the ways outlined above, but there are successful examples of Governance Engineering practices in many companies now.

It’s happening for two reasons. One, individuals are taking the initiative and automating parts of the governance processes when they run into toil. Two, technology leaders are motivating their organizations to figure out how to implement this type of automation to unlock delivery potential while solidifying security and compliance.

The Governance Engineering concept is my attempt at defining all of the exciting work I have seen happening across the industry. My hope is that this article will serve as a sort of call to action for people already involved in Governance Engineering style practices in their organizations. I know there are lots of you out there already, but we don’t have a name or a place to share ideas and perspectives.

A new discipline is emerging with its own set of principles and approaches, and I’d love to see a community form around discovering and learning how to succeed with Governance Engineering. If this interests you, feel free to reach out to me directly or join the Governance Engineering LinkedIn group that emerged out of the recent DevOps Enterprise Summit in Amsterdam.

- About The Authors

Bill Bensing

Bill Bensing tranforms Shadow IT into legitimate software development organizations. Bill's recent thought-leadership is proving software devliery velocity and highly secure and compliant software are not mutally exclusive. He lives in Tampa Bay, FL, area.

2 Comments

Anonymous Jul 10, 2023 1:58 pm

Great article, Bill. Good to have some explanation of some of the concepts and models that can be applied to governance engineering.

Reply

Anonymous Jul 14, 2023 1:18 pm

Thank You! Can you elaborate more on your comment, "Good to have some explanation of some of the concepts and models that can be applied to governance engineering." Cheers, Bill

Reply

with Dominica DeGrandis

with Matthew Skelton & Manuel Pais

August 20-22, 2024

By Gene Kim

By Dr. André Martin

Governance Engineering – Applying SRE Principles to Regulated Software

What is Governance Engineering?

What is Site Reliability Engineering? A Quick Overview

Governance Engineering’s Shared Pathology with SRE

Mapping SRE Concepts to Governance