While complete outages are still a key trigger for response, degraded service is considered incredibly impactful in the minds of consumers.
Both the business and the incident-response team also require better visibility into the business impacts of service outages and degradation.
To help incident response teams identify and respond to incidents quicker, and potentially before they become whole outages, organizations need to provide mechanisms to trigger incident detection that reflect customer usage patterns and the business impact they see. In essence, we need to identify issues before customers do, moving from reactive to proactive to predictive.
To do so, we’ve identified seven incident response triggers that organizations and incident response teams can use to identify and manage incidents earlier.
Make Alerts Actionable
It is easy to create alerts for any event that can be tied to an outage, but it is imperative that all alerts be actionable. When developing alerts it is important to first ask: What should an incident responder do when receiving this alert?
The severity of a trigger does not always indicate the response that is required from first responders. When criteria for an alert is being defined, it should inform the responder of what action is expected.
If a service is in a down state, for instance, the alert should inform the responder that the service may need to be restarted. Triggered alerts that are for information purposes only, such as restoration of a service, are still important but should be defined as such in the alert. This can be done by explicitly reporting “no action required” or “informational only.”
Leverage Business Synthetics that Mimic Customer Use Patterns
It is tempting to monitor backend system synthetics such as CPU, disk, and memory. Although useful for catching infrastructure saturation, these alerts rarely represent the true impact on customers. Additionally, monitoring these shallow metrics will miss key degradation events that are due to multiple causal factors.
Instead, we recommend having a rich suite of business synthetics that represent true user scenarios and experiences, such as page login response, page rendering times, and transaction response times that are executed in the context of an end user and ideally from end-user locations.
Notify Responders with Necessary Information to Resolve the Issue
If a triggered event results in an alert, and that alert requires action from the responder, it is important that the alert also provides the responder with the necessary information to act. This may be a link to a predefined playbook, a link to an automation resource, a package of log information, or a dashboard in a monitoring tool.
Alerts that require action but do not provide the necessary information to resolve the issue reduce the effectiveness of responders. This leads to longer response times and contributes to tribal knowledge, as only the more experienced responders are able to recognize the alert and respond quickly based on previous knowledge of the trigger.
Move from Reactive to Proactive
Increase the effectiveness of detecting incidents and degraded states. No one wants to hear about an outage issue from a customer first. The ideal state is to proactively identify a potential issue and remediate it before customers are impacted. This can be done in a variety of ways, but the best method is to create triggers and alerts based on known thresholds and events with cascading impacts.
For example, full-capacity conditions for storage, memory, and CPU still show up regularly today, causing unexpected outages. Present-day monitoring tools have better capacity for monitoring utilization state and consumption trends. Effort should be taken to apply resource-utilization alerts that can provide sufficient notification to proactively address the utilization constraint before it impacts system availability.
Automated responses can be applied to these triggers to programmatically flush cache, delete logs, or autoscale resources. The same methodology should be applied to other regular incident conditions in order to increase availability and minimize incident-response toil.
Move from Proactive to Predictive
The target goal for an organization should be to predict when service degradation or outages may occur based on historical data. This is a more complicated pattern to implement because it often requires implementation of sophisticated tooling that has a comprehensive understanding of complex system behavior over time. Emerging AI-Ops platforms are working to provide this capability, helping teams predict issues in advance based on previous patterns.
Organizations should review the learning capabilities of their tooling or adopt new tooling that permits them to train the system to understand ideal state and what should be considered normal operating parameters. As they develop trust in the system’s ability to predict potential service interruptions, they should begin developing automated remediation where possible, providing higher levels of availability and reducing toil on responders.
Adopt New Service-Level Objectives
As mentioned above, systems have increased in complexity and interdependence. Also, many application services can be distributed across multiple infrastructure resources. This provides greater resilience and allows services to continue functioning under most scenarios, although at a degraded state. As such, it’s necessary to begin monitoring these services based on their overall availability versus a binary online/offline state.
Organizations should adopt a Service Level Objective (SLO) for services in their environments in order to define the target operating boundaries. These objectives measure the Service Level Indicator (SLI) for the services. Common examples of SLIs are latency, transaction rate, throughput, etc. Consider your service-level indicators carefully and ensure that they accurately align with the intended state for your target operating window or service-level objective.
Develop a Macro View of System Health to Track Business Health
Adopting service-level monitoring also increases an organization’s ability to understand and monitor the business impact, both positive and negative, on system health. Teams should look to connect their service-level objectives with business metrics to track business performance in real time with system performance. Also, triggers should be developed to notify business stakeholders when service degradation or outages are impacting target business objectives. This information gives them an opportunity to initiate their responses to the incident in partnership with the technology responders.
These patterns may not encompass 100% of outage scenarios, because hardware fails unexpectedly, software fails unexpectedly, and there is always a backhoe looking for a fiber line to cut. Focus on managing impact scope and utilizing the outlined patterns to accelerate the identification of the cause and provide the necessary information to both resolve the issue and identify system gaps that could be addressed to prevent a similar incident in the future. The successful adoption of the patterns will ensure that your responders are able to act quickly when necessary, reduce your responders from becoming overwhelmed with unnecessary information, and ensure that everyone has a better understanding about how a triggered event and system performance impact business performance.
Using these seven triggers, organizations can move faster to remedy incidents and even predict them before they occur. But identifying incident triggers is only part of a larger incident management framework. To read the full framework, download the free white paper here.