October 5, 2022

Novel Security Vulnerabilities: The Organizational Response

By IT Revolution

This post has been adapted from the 2022 DevOps Enterprise Forum guidance paper Responding to Novel Security Vulnerabilities by Randy Shoup, Tapabrata Pal, Michael Nygard, Chris Hill, and Dominica DeGrandis.

In our previous post, we looked at three specific examples of reactions to the Log4Shell/Log4J vulnerability.

Beyond human responses, organizational responses also play a critical role in detecting, assessing, mitigating, and remediating security vulnerabilities. This section outlines several lessons to learn from the varied organizational responses to Log4Shell. We found that the organizations that had richer interteam collaboration, more adaptive security processes, clearer prioritization mechanisms, and more robust continuous delivery pipelines achieved faster and better outcomes from their Log4Shell responses.

Interteam Collaboration

Many organizations struggled with Log4Shell due to a dysfunctional relationship between their engineering organization(s) and their security team. As an immediately exploitable zero-day vulnerability (i.e., a vulnerability that was not previously known), Log4Shell magnified the existing lack of trust or communication between Security and Engineering.

Shared Responsibility vs. “Not My Problem”

In the organizations with the most effective responses to Log4Shell, teams shared responsibility for remediating the vulnerability. In Examples 1 and 2 presented earlier in this paper, once the vulnerability was declared, there was no debate about whether it needed to be remediated or who was responsible for doing what. Any debate was about the how. In Example 3, development teams did not initially understand why they were even attending a meeting about a critical security vulnerability. This led to long delays and ineffective organizational response to the vulnerability.

Clear Roles and Responsibilities

In Example 2, the security team automatically generated tickets for development teams to remediate, and development teams started remediating them within hours. There was an understood division of labor: Each team had its clear role, and each team worked in its area of expertise with a clear interface between them.

Interteam Trust

In Example 3, the mitigation response was delayed by several days due to mistrust and lack of coordination between the development teams and the security team. Even a working solution was delayed by interteam approvals. In contrast to the other examples, mitigation and remediation started almost immediately.

Shared Priorities vs. Conflicting Priorities

In Example 3, development teams resisted taking time away from committed features to remediate security vulnerabilities. They felt that they did not have the team capacity to fix the vulnerabilities in the face of standing, conflicting product and business priorities.

Shared Ownership and Structured Collaboration

In summary, the most effective relationships between an organization’s development teams and security team were characterized by:

A common understanding of the threat and its priority. There was essentially no debate about whether to remediate the Log4Shell vulnerability or when to do so.
Shared ownership for the response. Every team acted as if it was “our” problem instead of “their” problem.
A clear division of labor between development and security teams. Each team worked in its area of expertise and ownership.
A structured communication mechanism between teams about what to do and how to do it, with heavy levels of automation.

Rapid adaptation to the changing understanding and circumstances. As new information was discovered, it was immediately incorporated into the shared understanding and acted upon.

Appropriate Level of Security Process

Inappropriate Process

For many organizations, overburdensome security processes are a significant drag on productivity and flow. In several Log4Shell examples, processes to control developer access directly hindered efforts to detect and remediate the vulnerability.

In one case, the security team’s ability to assess the extent of the vulnerability was hampered by the company’s approach to code repositories. This organization does not create a bill of materials or update its CMDB with information about libraries used in applications. As a result, in order to assess the scope of vulnerability to Log4Shell, the security team needed to scan source repositories for dependency information. In this “default closed” company, access to repos is controlled on a need-to-know basis. One team’s imperative to protect the company’s intellectual property impeded another team’s imperative to assess and remediate this vulnerability.

In security, as in other areas of software development, a natural reaction to a bad event is to institute additional processes. While not bad in and of themselves, policies and processes often build up over time as scar tissue from past failures. If not checked and reevaluated, accumulated processes from “fighting the last war” can overwhelm and choke an organization. Moreover, good-faith attempts to prevent the known sets of previous attacks can make remediating unknown threats more difficult.

No Process

No process is as bad as a bad process. In both Example 1 and Example 3, the lack of a clear process for evaluating the Log4Shell threat and the lack of clear process for remediating it slowed down the organizational response.

Enabling Process

In the best example, a minimal and effective security process streamlined the efficient and effective remediation of vulnerabilities. In Example 2, the established security process acted as an enabler. Security and development teams communicated in clear, simple, and automated ways. Prioritization of which applications to tackle first (the externally facing ones) was clear and decisive.

In summary, organizations achieve better outcomes when Security leverages the minimal amount of process to achieve its goals. As threats and capabilities evolve, regularly reevaluating processes for their cost-effectiveness can help to keep them optimal.

Automation and Continuous Delivery

One consistent theme that separated the high-performing organizations from the lower-performing ones was the extent and quality of automated processes. Organizations that had better outcomes were able to automatically:

Discover which applications were vulnerable, and which teams were responsible to remediate them.
Inform the owning teams what to do to remediate.
Rebuild, test, and redeploy the applications.
Verify the applications had remediated the vulnerability in production.

Automated Discovery

The immediate next task after the declaration of the vulnerability was discovering which applications were using vulnerable versions of the Log4j library. Among the examples we surveyed, the ease of this task varied significantly between organizations.

In Example 2, the organization used their SCA tooling to discover all vulnerable repositories within an hour and could immediately begin remediating them. In Example 1, the organization needed to develop on the fly a capability to discover vulnerable applications. With various fits and starts, it took two weeks for the process to reach a passable state.

Worst, in Example 3, the organization had next to no broad-spectrum way of discovering applications and was therefore still in the process of discovering vulnerable applications more than eight weeks after discovering the vulnerability.

Automated Alerting

Similarly, the highest-performing organizations had automated mechanisms to inform teams about which applications were vulnerable. In Example 2, GitHub issues were being filed within an hour, and a separate set of automated communications over multiple communication channels went out in several hours. In Example 3, on the other hand, the organization had no alerting mechanism, which led to multiple days of chaos and confusion before remediation could begin in earnest.

Continuous Delivery

For some organizations, one of the hardest parts of the Log4Shell remediation was simply validating that changing the Log4j version does not break anything else. Because Log4j has been used extensively across the Java ecosystem for more than twenty years, many applications, both modern and legacy, depend on it.

The burden of rebuilding, testing (and retesting), and redeploying was staggering. Particularly problematic are those legacy applications still in production, which often have the following characteristics:

Missing or unclear build processes
Minimal to no automated testing or verification
Missing or flaky deployment pipelines
Minimal to no production monitoring

This burden was multiplied by the evolving CVEs arriving in rapid succession, requiring multiple waves of remediations, each with expensive manual verification or no verification at all.

In many cases, organizations had previously made intentional risk-benefit choices not to invest in robust processes and pipelines for legacy applications. After all, the applications themselves are no longer changing. This is a deeply flawed assumption in the modern environment—while the application itself might not change, its underlying dependencies and foundations absolutely will need to be upgraded over time. In the modern world, the risk associated with any application or repository is proportional to how long it has been since it was last built, assembled, and deployed.

In Example 2, the organization had made an ongoing investment in continuous delivery pipelines for all their relevant systems. As a consequence, remediation was able to begin within hours and was completed within two days for the externally facing applications. Teams with active, effective pipelines found Log4j remediation almost trivial. The authors believe that if any application runs in production, it must be live—both capable of being updated on demand and regularly updated.

Automated Verification

The highest-performing organizations had automated mechanisms to verify that the remediated applications no longer had the vulnerability in production. An organization could take a blanket approach to update dependencies and packages to invulnerable versions, but this would introduce a significant amount of adjacent unknown risk into the running systems, which could remain under the radar for a long time. This may show up in the form of bugs or casual performance degradation. To avoid this, the highest-performing organizations use automation to continuously verify integration of updated packages from a centralized policy against SCA results.

In Example 2, several days after the remediations had begun, the organization was able to use its SCA tool to verify that all of the externally facing applications had successfully applied the first round of remediations. This, combined with the other automation mechanisms, made the next several rounds even easier than the first. Examples 1 and 3 never developed an automated verification mechanism, and were both still struggling weeks later.

Automation Everywhere

Prioritizing speed of vulnerability remediation is directly analogous to the more general DevOps recommendation to prioritize time-to-recover over incident prevention. Robust automation changes the entire cost-benefit analysis of vulnerability remediation. There is no need to be deeply nuanced about determining the severity of the vulnerability if it is trivial to fix; we just fix it. Moreover, confidence in automation improves time to remediate and substantially reduces team stress. Automation turned what was an all-hands-on-deck situation for many organizations into just another day at the office for the organization in Example 2.

Heavily leveraging both continuous delivery and various types of automation puts more data and evidence in the hands of a potentially emotional decision maker during an active cyber threat, allowing them to make more informed and rational decisions. For example, Dependabot will directly notify you that a vulnerable package is being used. However, the real value of something like Dependabot is that it helps organizations actively exercise their change muscle. As we saw in the examples, organizations that maintain a strong change muscle are the organizations to remediate first.

In our next post, we’ll go beyond the organizational response and explore the human response.

- About The Authors

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

with Dominica DeGrandis

with Matthew Skelton & Manuel Pais

Through May 3, 2024

April 24-25, 2024

August 20-22, 2024

By Gene Kim

By Dr. André Martin

Novel Security Vulnerabilities: The Organizational Response

Interteam Collaboration