Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
October 3, 2022
This post has been adapted from the 2022 DevOps Enterprise Forum guidance paper Responding to Novel Security Vulnerabilities by Randy Shoup, Tapabrata Pal, Michael Nygard, Chris Hill, and Dominica DeGrandis.
Despite large and increasing investments in IT security, enterprises are still ill-prepared to respond to novel security vulnerabilities like Log4Shell. Companies often tend to invest in tools and processes optimized for “fighting the last war.” That is, we naturally create defenses against known vulnerabilities and attacks we experienced in the past. The last several years have shown that new classes of vulnerability continue to be discovered, including
These challenge our organizations in several ways. First, the scope and impact of novel vulnerabilities may be unclear or difficult to determine. Our configuration management information may be incomplete, outdated, or fragmented across operational and development support systems. Each class of vulnerability requires dynamic reteaming involving different parts of the organization. For example, Heartbleed required close collaboration between security staff, OS administration, and operations. Log4Shell, on the other hand, needed the involvement of internal development teams, platforms teams, and external vendors.
Second, novel vulnerabilities require large-scale redeployment of infrastructure and application software. Organizations that have not completely automated their operating system, application, or container deployment will struggle with the scale of (re)deployment required. Organizations that review changes manually (via change approval boards or other human reviews) will struggle with the volume of changes needed. In effect, these vulnerabilities act as a denial-of-service attack on the systems of software change management itself.
Third, the scale of the threats makes our response critically urgent. Each vulnerability can hugely affect our customers, brand, and business. When faced with one of these vulnerabilities, our teams must work overtime to mitigate and remediate the threats. To the organization, this is an opportunity cost. Efforts that could go into economically productive work must instead go into work that, at best, keeps us in the same competitive and economic position. To our people, it means lost evenings, weekends, and holidays. Responding to these vulnerabilities takes a heavy emotional and psychological toll on our employees and their families.
We have every reason to expect that these will not be the last new vulnerability types to be uncovered, and cybersecurity insurance costs will correlate. Procurement questions will include your upgrade life cycle as security vulnerabilities continue appearing. As they do, the cost asymmetry between attacker and defender continues to worsen. The rigorous processes that protect us from existing, well-known vulnerabilities are necessary but insufficient. We also need to become better at responding to novel attacks and vulnerabilities by creating adaptive capacity within our organizations.
In our experience, organizations largely follow this sequence of events when responding to a vulnerability. But first, let’s define vulnerability: A vulnerability is broader than a single security incident. It affects many (potentially all) systems simultaneously but is not necessarily the site of an active intrusion. A vulnerability may also take a long time to remediate.
The Log4Shell vulnerability was like this. Some instances of that vulnerability were exploited—these became incidents. The Log4Shell vulnerability itself was much broader, comprising the incidents and the potential for incidents. The vulnerability life cycle consists of a series of actions and reactions (as illustrated in Figure 1) and is detailed in the next section.
This early part of the life cycle is when we begin to notice signs of a potential vulnerability. In the case of Log4Shell, this stage involved social media posts describing the vulnerability. This stage is characterized by uncertainty about the nature of the vulnerability and whether an organization needs to take action. At some point, the signal rises beyond a threshold where some people in the organization begin to look seriously at the vulnerability and their exposure to it. This leads to the next phase: the detection of the vulnerability.
The detection event should trigger triage and assessment of the vulnerability. It can also trigger reactions of shock and denial among the staff, the initial stages of the Kübler-Ross Change Curve, which we will discuss in detail later. After the vulnerability is detected, the organization must begin the action of assessing the risk.
Next, the organization determines whether and to what degree the vulnerability affects them. The unknowns have shifted from external to internal: How many of our systems might be affected? Which ones are affected? There are many steps needed to assess vulnerabilities of an IT system, so it’s important to go through them thoroughly and understand why they are relevant to your organization in particular. As we will see later, this stage can be protracted if individuals in the company disagree about the degree of risk. Part of the hesitancy is based on cost—it can be expensive to actively respond to a threat. But once the assessment is complete, the next step is declaring what, if any, action the organization will take.
Initiating the formal declaration event often requires a high level of decision-making authority. Reaching the appropriate authority with actionable information can also be a source of delay. Further, some individuals’ emotional reactions of denial and frustration can also cause hesitancy and delay (more on this when we discuss the Kübler-Ross Change Curve). From declaration, the organization must now implement an active response.
Active response includes the early actions to triage the vulnerability. Triage includes assessing the scope of vulnerable systems and products as well as their impact on customers. Triage leads to prioritized steps to first mitigate and then remediate the vulnerability. The active response phase involves more people in the organization. As each cohort of people becomes involved, they also go through the unproductive stages of shock, denial, frustration, and depression (part of the Kübler-Ross Change Curve). Meanwhile, they’re being asked to ascertain the situation quickly and catch up to those who are already in the know. Be warned: a company might experience communication and collaboration challenges when part of the organization is working on mitigation while another part is still looking for reasons their system should not be affected (a denial reaction).
An organization’s active response typically leads to one of two results (or a combination): mitigation or remediation of the vulnerability. Mitigation is temporary. It usually involves workarounds to prevent active exploitation of the vulnerability while the vulnerability itself is fully addressed. Remediation is the permanent solution to remove the vulnerability completely. We’ll detail the activities that take place during the mitigation and remediation phase later in this paper.
Many organizations’ response stops when remediation is completed, but this is a lost opportunity. The highest-performing and most secure organizations treat large-scale vulnerabilities as unplanned investments in learning about security. Retrospective techniques such as blameless post-mortems can help organizations examine what went well during the event and what went poorly, and look for opportunities to improve for the next vulnerability.
As John Allspaw says, “Incidents are unplanned investments, and they are also opportunities. Your challenge is to maximize the ROI on the sunk cost. To do that, the organization has to invest in really exploring and understanding these events, and share that understanding broadly and over time.” When mitigation and remediation are complete, do not skip over this necessary retrospective phase.
Related to the retrospective is the action of reflecting and learning from an incident or vulnerability. There are many ways to reflect and learn. Consider the following questions:
As in any retrospective, nothing is gained by making scapegoats from developers who made decisions a decade ago (as was often the case for Log4Shell).
After the retrospective and reflection and learning phases, it’s time to actually implement adaptation actions. The vulnerability might be closed, but it is still fresh in the participants’ minds. This is the ideal time to improve tooling, processes, system data, design and testing practices, and deployment pipelines. Only once adaptation measures have been acted upon should an organization consider a vulnerability or incident “complete.”
Different companies handled the Log4Shell vulnerability in different ways. For such a sweeping issue, it’s no surprise that some organizations dealt with Log4Shell neatly while others floundered. In the full paper (download here), we’ll examine various paths some (anonymized) organizations could have taken in addressing Log4Shell.
In our next post, we’ll explore lessons learned along the way.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…
Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…
We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…
The following post is an excerpt from the book Unbundling the Enterprise: APIs, Optionality, and…