Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
June 15, 2021
While complete outages are still a key trigger for response, degraded service is considered incredibly impactful in the minds of consumers.
As part of the new Incident Management Framework proposed in this white paper, organizations need to identify and remediate potential issues before customers are aware there is a problem.
Both the business and the incident-response team also require better visibility into the business impacts of service outages and degradation.
To help incident response teams identify and respond to incidents quicker, and potentially before they become whole outages, organizations need to provide mechanisms to trigger incident detection that reflect customer usage patterns and the business impact they see. In essence, we need to identify issues before customers do, moving from reactive to proactive to predictive.
To do so, we’ve identified seven incident response triggers that organizations and incident response teams can use to identify and manage incidents earlier.
It is easy to create alerts for any event that can be tied to an outage, but it is imperative that all alerts be actionable. When developing alerts it is important to first ask: What should an incident responder do when receiving this alert?
The severity of a trigger does not always indicate the response that is required from first responders. When criteria for an alert is being defined, it should inform the responder of what action is expected.
If a service is in a down state, for instance, the alert should inform the responder that the service may need to be restarted. Triggered alerts that are for information purposes only, such as restoration of a service, are still important but should be defined as such in the alert. This can be done by explicitly reporting “no action required” or “informational only.”
It is tempting to monitor backend system synthetics such as CPU, disk, and memory. Although useful for catching infrastructure saturation, these alerts rarely represent the true impact on customers. Additionally, monitoring these shallow metrics will miss key degradation events that are due to multiple causal factors.
Instead, we recommend having a rich suite of business synthetics that represent true user scenarios and experiences, such as page login response, page rendering times, and transaction response times that are executed in the context of an end user and ideally from end-user locations.
If a triggered event results in an alert, and that alert requires action from the responder, it is important that the alert also provides the responder with the necessary information to act. This may be a link to a predefined playbook, a link to an automation resource, a package of log information, or a dashboard in a monitoring tool.
Alerts that require action but do not provide the necessary information to resolve the issue reduce the effectiveness of responders. This leads to longer response times and contributes to tribal knowledge, as only the more experienced responders are able to recognize the alert and respond quickly based on previous knowledge of the trigger.
Increase the effectiveness of detecting incidents and degraded states. No one wants to hear about an outage issue from a customer first. The ideal state is to proactively identify a potential issue and remediate it before customers are impacted. This can be done in a variety of ways, but the best method is to create triggers and alerts based on known thresholds and events with cascading impacts.
For example, full-capacity conditions for storage, memory, and CPU still show up regularly today, causing unexpected outages. Present-day monitoring tools have better capacity for monitoring utilization state and consumption trends. Effort should be taken to apply resource-utilization alerts that can provide sufficient notification to proactively address the utilization constraint before it impacts system availability.
Automated responses can be applied to these triggers to programmatically flush cache, delete logs, or autoscale resources. The same methodology should be applied to other regular incident conditions in order to increase availability and minimize incident-response toil.
The target goal for an organization should be to predict when service degradation or outages may occur based on historical data. This is a more complicated pattern to implement because it often requires implementation of sophisticated tooling that has a comprehensive understanding of complex system behavior over time. Emerging AI-Ops platforms are working to provide this capability, helping teams predict issues in advance based on previous patterns.
Organizations should review the learning capabilities of their tooling or adopt new tooling that permits them to train the system to understand ideal state and what should be considered normal operating parameters. As they develop trust in the system’s ability to predict potential service interruptions, they should begin developing automated remediation where possible, providing higher levels of availability and reducing toil on responders.
As mentioned above, systems have increased in complexity and interdependence. Also, many application services can be distributed across multiple infrastructure resources. This provides greater resilience and allows services to continue functioning under most scenarios, although at a degraded state. As such, it’s necessary to begin monitoring these services based on their overall availability versus a binary online/offline state.
Organizations should adopt a Service Level Objective (SLO) for services in their environments in order to define the target operating boundaries. These objectives measure the Service Level Indicator (SLI) for the services. Common examples of SLIs are latency, transaction rate, throughput, etc. Consider your service-level indicators carefully and ensure that they accurately align with the intended state for your target operating window or service-level objective.
Adopting service-level monitoring also increases an organization’s ability to understand and monitor the business impact, both positive and negative, on system health. Teams should look to connect their service-level objectives with business metrics to track business performance in real time with system performance. Also, triggers should be developed to notify business stakeholders when service degradation or outages are impacting target business objectives. This information gives them an opportunity to initiate their responses to the incident in partnership with the technology responders.
These patterns may not encompass 100% of outage scenarios, because hardware fails unexpectedly, software fails unexpectedly, and there is always a backhoe looking for a fiber line to cut. Focus on managing impact scope and utilizing the outlined patterns to accelerate the identification of the cause and provide the necessary information to both resolve the issue and identify system gaps that could be addressed to prevent a similar incident in the future. The successful adoption of the patterns will ensure that your responders are able to act quickly when necessary, reduce your responders from becoming overwhelmed with unnecessary information, and ensure that everyone has a better understanding about how a triggered event and system performance impact business performance.
Using these seven triggers, organizations can move faster to remedy incidents and even predict them before they occur. But identifying incident triggers is only part of a larger incident management framework. To read the full framework, download the free white paper here.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
If you haven’t already read Unbundling the Enterprise: APIs, Optionality, and the Science of…
Organizations face critical decisions when selecting cloud service providers (CSPs). A recent paper titled…
We're thrilled to announce the release of The Phoenix Project: A Graphic Novel (Volume…
The following post is an excerpt from the book Unbundling the Enterprise: APIs, Optionality, and…