Inspire, develop, and guide a winning organization.
Create visible workflows to achieve well-architected software.
Understand and use meaningful data to measure success.
Integrate and automate quality, security, and compliance into daily work.
Understand the unique values and behaviors of a successful organization.
LLMs and Generative AI in the enterprise.
An on-demand learning experience from the people who brought you The Phoenix Project, Team Topologies, Accelerate, and more.
Learn how making work visible, value stream management, and flow metrics can affect change in your organization.
Clarify team interactions for fast flow using simple sense-making approaches and tools.
Multiple award-winning CTO, researcher, and bestselling author Gene Kim hosts enterprise technology and business leaders.
In the first part of this two-part episode of The Idealcast, Gene Kim speaks with Dr. Ron Westrum, Emeritus Professor of Sociology at Eastern Michigan University.
In the first episode of Season 2 of The Idealcast, Gene Kim speaks with Admiral John Richardson, who served as Chief of Naval Operations for four years.
New half-day virtual events with live watch parties worldwide!
DevOps best practices, case studies, organizational change, ways of working, and the latest thinking affecting business and technology leadership.
Is slowify a real word?
Could right fit help talent discover more meaning and satisfaction at work and help companies find lost productivity?
The values and philosophies that frame the processes, procedures, and practices of DevOps.
This post presents the four key metrics to measure software delivery performance.
June 9, 2021
When many organizations are faced with an incident, the lack of clear roles and responsibilities among the teams leads to poor collaboration, communication, and work overload. This in turn leads to missed tasks, redundant work, loss of information, delays, and frustration within the team. Without clear roles, incident response and resolution can be delayed, and the quality of response for the team and the customer is poor.
As part of the Prepare/Respond/Review Incident Management Framework, the following incident management response team helps eliminate these challenges.
For a smooth and effective incident response, organizations must create clear roles and responsibilities ahead of time that increase the core team’s effectiveness. These roles should divide the work so that no individual team member is overwhelmed. This incident response team provide the foundation for the business and external stakeholders to know who to interface with and the proper protocols in which to do so.
Effective incident response teams designate clear roles and responsibilities. Team members know what the different roles are, what they are responsible for, and who is in which role during an incident.
To help remediate incidents as quickly and effectively as possible, it is critical to assign clear roles and responsibilities to the team working on the incident. This practice avoids creating any confusion and delays throughout the incident management and resolution. The most common and important roles are:
Sometimes called the Incident Manager, the Incident Commander is responsible for driving the incident to resolution and facilitating all activities during the incident. They act as the primary point of contact during the end-to-end incident, from identification to postmortem. The Incident Commander ensures that all activities are coordinated and move along in a timely manner. They secure consensus from key team members and stakeholders before taking action and adjust the course of decisions and action based on feedback.
During the incident-preparation planning phase, the Incident Commander identifies appropriate communication channels while identifying the internal and external stakeholders that need to be included in communications. They share best communication practices with the team and help them communicate effectively across all stakeholders.
During the incident, they are responsible for driving the resolution activities to completion. They assemble the incident-management team and provide them with a communication channel. That could be a common war room or a Slack channel to ensure real-time communication.
Throughout the incident-management process, the Incident Commander gathers information, details, impediments, and delays from each team member in order to compile and maintain clear visibility on the resolution status.They gather all recommendations about fixes that need to be made and contribute and validate incident-resolution actions. This information is compiled and shared with key stakeholders to provide them with status updates, current actions underway, and any needs the response team requires.
While the Incident Commander owns the critical and central role of driving the incident to resolution by facilitating activities across all stakeholders, they are not responsible for solving the incident or applying any fixes. They may not have the domain knowledge or system access required to resolve the incident. Instead, they partner closely with their primary Subject Matter Expert who leads the incident identification, analysis, and resolution. The Incident Commander collects feedback from the SMEs working on the incident and serves as the single source of information and the single authority on both the incident and the system status.
After the incident is resolved, the Incident Commander is responsible for designing the postmortem activities, developing any artifacts required, and facilitating the sessions in a safe manner. During the postmortem, they gather the team’s feedback, run the session, and report out on the feedback and improvements proposed by the team. The Incident Commander drives implementing the prioritized incident-management process improvements and leverages those improvements during future incidents.
The primary Subject Matter Expert is the domain expert or owner of the applications and systems being investigated. Primary SMEs are responsible for running any system diagnostics, identifying the source of issues, proposing a course of actions to the Incident Commander, and quickly fixing all issues found. The Primary SME generally writes the Condition, Actions, Needs (CAN) report. This report contains the following information:
The Primary Subject Matter Expert will also be responsible for identifying if an SME from another team is needed to help diagnose or resolve the issue and will report that need to the Incident Commander.
The Customer Liaison is responsible for interacting with customers to keep them accurately informed about the status of the incident resolution in a timely manner. Because they manage the relationship with customers, they are often a member of the customer-support team and should be well trained in communication, customer care, and support. The Customer Liaison is exclusively focused on the customer in order to support them and provide them with the best possible experience throughout the incident-management process, from identification to postmortem improvement items. They write and publish all customer/public-facing status communications using internal communication channels or external social media (Twitter, the company’s website, etc.).
As part of their responsibilities, the Customer Liaison works closely with the Incident Commander. They inform the Incident Commander about which customers are impacted by the incident, and the Incident Commander keeps them up to date on the status of the resolution progress. The Customer Liaison needs to receive continuous, timely, and granular information about the resolution plan in order to provide customized communication to their various customer groups.
The Scribe is identified and nominated by the Incident Commander based on skills and available resources. Typically, anyone can fill this role. The Scribe’s responsibilities are to document the incident-management events and activities as they unfold while recording important information and decisions. They are responsible for sharing timely notes and important data with the team and may use a dedicated Slack channel to communicate in real time with all stakeholders. They capture notes shared by the Incident Commander and the SMEs.
Team members need to know what the different roles are, what they’re responsible for, and who is in which role during an incident. When everyone on the team knows what their responsibilities are, they cooperate better and faster. An incident is no time to have multiple people doing duplicate work. It’s also a terrible time to have important tasks ignored all because everyone thought somebody else was working on the problem. Incidents are made worse when incident-response team members can’t communicate, can’t cooperate, and don’t know what the others are working on. Work gets repeated, work gets ignored, and customers and the business suffer.
During the incident, the team may need to review some events, how fixes were applied, who applied which fixes, etc. Because incidents need to be resolved as quickly as possible, each role needs to maintain focus on their respective responsibilities. They may not have recorded the timeline of events and information. Therefore, it is valuable to have a dedicated scribe who captures the chronology of events and all activities and decisions as they unfold to allow anyone to retrieve notes on them and consult them when needed.
The Incident Commander leverages the Scribe’s notes and recordings to prepare for the postmortem and uses them during the postmortem to identify any performance opportunities and improvement areas. The team reviews the Scribe’s documentation and collaboratively shares feedback about any area that could have been handled differently and better.
With an incident management team in place and roles clearly defined, organizations are setting themselves up for success when the next inevitable incident or outage occurs.
But the incident management team is only one piece of the full Prepare/Respond/Review Incident Management Framework. Download the full white paper here to read the full details of the framework.
Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.
No comments found
Your email address will not be published.
First Name Last Name
Δ
"This feels pointless." "My brain is fried." "Why can't I think straight?" These aren't…
As manufacturers embrace Industry 4.0, many find that implementing new technologies isn't enough to…
I know. You’re thinking I'm talking about Napster, right? Nope. Napster was launched in…
When Southwest Airlines' crew scheduling system became overwhelmed during the 2022 holiday season, the…