Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
Helping leaders succeed and organizations thrive.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
October 5, 2022
This post has been adapted from the 2022 DevOps Enterprise Forum guidance paper Responding to Novel Security Vulnerabilities by Randy Shoup, Tapabrata Pal, Michael Nygard, Chris Hill, and Dominica DeGrandis.
In our previous post, we looked at three specific examples of reactions to the Log4Shell/Log4J vulnerability.
Beyond human responses, organizational responses also play a critical role in detecting, assessing, mitigating, and remediating security vulnerabilities. This section outlines several lessons to learn from the varied organizational responses to Log4Shell. We found that the organizations that had richer interteam collaboration, more adaptive security processes, clearer prioritization mechanisms, and more robust continuous delivery pipelines achieved faster and better outcomes from their Log4Shell responses.
Many organizations struggled with Log4Shell due to a dysfunctional relationship between their engineering organization(s) and their security team. As an immediately exploitable zero-day vulnerability (i.e., a vulnerability that was not previously known), Log4Shell magnified the existing lack of trust or communication between Security and Engineering.
In the organizations with the most effective responses to Log4Shell, teams shared responsibility for remediating the vulnerability. In Examples 1 and 2 presented earlier in this paper, once the vulnerability was declared, there was no debate about whether it needed to be remediated or who was responsible for doing what. Any debate was about the how. In Example 3, development teams did not initially understand why they were even attending a meeting about a critical security vulnerability. This led to long delays and ineffective organizational response to the vulnerability.
In Example 2, the security team automatically generated tickets for development teams to remediate, and development teams started remediating them within hours. There was an understood division of labor: Each team had its clear role, and each team worked in its area of expertise with a clear interface between them.
In Example 3, the mitigation response was delayed by several days due to mistrust and lack of coordination between the development teams and the security team. Even a working solution was delayed by interteam approvals. In contrast to the other examples, mitigation and remediation started almost immediately.
In Example 3, development teams resisted taking time away from committed features to remediate security vulnerabilities. They felt that they did not have the team capacity to fix the vulnerabilities in the face of standing, conflicting product and business priorities.
In summary, the most effective relationships between an organization’s development teams and security team were characterized by:
Rapid adaptation to the changing understanding and circumstances. As new information was discovered, it was immediately incorporated into the shared understanding and acted upon.
For many organizations, overburdensome security processes are a significant drag on productivity and flow. In several Log4Shell examples, processes to control developer access directly hindered efforts to detect and remediate the vulnerability.
In one case, the security team’s ability to assess the extent of the vulnerability was hampered by the company’s approach to code repositories. This organization does not create a bill of materials or update its CMDB with information about libraries used in applications. As a result, in order to assess the scope of vulnerability to Log4Shell, the security team needed to scan source repositories for dependency information. In this “default closed” company, access to repos is controlled on a need-to-know basis. One team’s imperative to protect the company’s intellectual property impeded another team’s imperative to assess and remediate this vulnerability.
In security, as in other areas of software development, a natural reaction to a bad event is to institute additional processes. While not bad in and of themselves, policies and processes often build up over time as scar tissue from past failures. If not checked and reevaluated, accumulated processes from “fighting the last war” can overwhelm and choke an organization. Moreover, good-faith attempts to prevent the known sets of previous attacks can make remediating unknown threats more difficult.
No process is as bad as a bad process. In both Example 1 and Example 3, the lack of a clear process for evaluating the Log4Shell threat and the lack of clear process for remediating it slowed down the organizational response.
In the best example, a minimal and effective security process streamlined the efficient and effective remediation of vulnerabilities. In Example 2, the established security process acted as an enabler. Security and development teams communicated in clear, simple, and automated ways. Prioritization of which applications to tackle first (the externally facing ones) was clear and decisive.
In summary, organizations achieve better outcomes when Security leverages the minimal amount of process to achieve its goals. As threats and capabilities evolve, regularly reevaluating processes for their cost-effectiveness can help to keep them optimal.
One consistent theme that separated the high-performing organizations from the lower-performing ones was the extent and quality of automated processes. Organizations that had better outcomes were able to automatically:
The immediate next task after the declaration of the vulnerability was discovering which applications were using vulnerable versions of the Log4j library. Among the examples we surveyed, the ease of this task varied significantly between organizations.
In Example 2, the organization used their SCA tooling to discover all vulnerable repositories within an hour and could immediately begin remediating them. In Example 1, the organization needed to develop on the fly a capability to discover vulnerable applications. With various fits and starts, it took two weeks for the process to reach a passable state.
Worst, in Example 3, the organization had next to no broad-spectrum way of discovering applications and was therefore still in the process of discovering vulnerable applications more than eight weeks after discovering the vulnerability.
Similarly, the highest-performing organizations had automated mechanisms to inform teams about which applications were vulnerable. In Example 2, GitHub issues were being filed within an hour, and a separate set of automated communications over multiple communication channels went out in several hours. In Example 3, on the other hand, the organization had no alerting mechanism, which led to multiple days of chaos and confusion before remediation could begin in earnest.
For some organizations, one of the hardest parts of the Log4Shell remediation was simply validating that changing the Log4j version does not break anything else. Because Log4j has been used extensively across the Java ecosystem for more than twenty years, many applications, both modern and legacy, depend on it.
The burden of rebuilding, testing (and retesting), and redeploying was staggering. Particularly problematic are those legacy applications still in production, which often have the following characteristics:
This burden was multiplied by the evolving CVEs arriving in rapid succession, requiring multiple waves of remediations, each with expensive manual verification or no verification at all.
In many cases, organizations had previously made intentional risk-benefit choices not to invest in robust processes and pipelines for legacy applications. After all, the applications themselves are no longer changing. This is a deeply flawed assumption in the modern environment—while the application itself might not change, its underlying dependencies and foundations absolutely will need to be upgraded over time. In the modern world, the risk associated with any application or repository is proportional to how long it has been since it was last built, assembled, and deployed.
In Example 2, the organization had made an ongoing investment in continuous delivery pipelines for all their relevant systems. As a consequence, remediation was able to begin within hours and was completed within two days for the externally facing applications. Teams with active, effective pipelines found Log4j remediation almost trivial. The authors believe that if any application runs in production, it must be live—both capable of being updated on demand and regularly updated.
The highest-performing organizations had automated mechanisms to verify that the remediated applications no longer had the vulnerability in production. An organization could take a blanket approach to update dependencies and packages to invulnerable versions, but this would introduce a significant amount of adjacent unknown risk into the running systems, which could remain under the radar for a long time. This may show up in the form of bugs or casual performance degradation. To avoid this, the highest-performing organizations use automation to continuously verify integration of updated packages from a centralized policy against SCA results.
In Example 2, several days after the remediations had begun, the organization was able to use its SCA tool to verify that all of the externally facing applications had successfully applied the first round of remediations. This, combined with the other automation mechanisms, made the next several rounds even easier than the first. Examples 1 and 3 never developed an automated verification mechanism, and were both still struggling weeks later.
Prioritizing speed of vulnerability remediation is directly analogous to the more general DevOps recommendation to prioritize time-to-recover over incident prevention. Robust automation changes the entire cost-benefit analysis of vulnerability remediation. There is no need to be deeply nuanced about determining the severity of the vulnerability if it is trivial to fix; we just fix it. Moreover, confidence in automation improves time to remediate and substantially reduces team stress. Automation turned what was an all-hands-on-deck situation for many organizations into just another day at the office for the organization in Example 2.
Heavily leveraging both continuous delivery and various types of automation puts more data and evidence in the hands of a potentially emotional decision maker during an active cyber threat, allowing them to make more informed and rational decisions. For example, Dependabot will directly notify you that a vulnerable package is being used. However, the real value of something like Dependabot is that it helps organizations actively exercise their change muscle. As we saw in the examples, organizations that maintain a strong change muscle are the organizations to remediate first.
In our next post, we’ll go beyond the organizational response and explore the human response.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
As artificial intelligence continues to evolve at a breakneck pace, many organizations are grappling…
In the five years since the publication of Team Topologies: Organizing Business and Technology…
Digital twins have emerged as a game-changing technology for organizations developing and managing complex…
Don’t worry. We get it. Whether you’re from IT or you work with someone…