Skip to content

June 9, 2021

Building an Incident Management Response Team

By IT Revolution

When many organizations are faced with an incident, the lack of clear roles and responsibilities among the teams leads to poor collaboration, communication, and work overload. This in turn leads to missed tasks, redundant work, loss of information, delays, and frustration within the team. Without clear roles, incident response and resolution can be delayed, and the quality of response for the team and the customer is poor.

As part of the Prepare/Respond/Review Incident Management Framework, the following incident management response team helps eliminate these challenges.

Effective Incident Response Teams

For a smooth and effective incident response, organizations must create clear roles and responsibilities ahead of time that increase the core team’s effectiveness. These roles should divide the work so that no individual team member is overwhelmed. This incident response team provide the foundation for the business and external stakeholders to know who to interface with and the proper protocols in which to do so.

Roles and Responsibilities

Effective incident response teams designate clear roles and responsibilities. Team members know what the different roles are, what they are responsible for, and who is in which role during an incident.

To help remediate incidents as quickly and effectively as possible, it is critical to assign clear roles and responsibilities to the team working on the incident. This practice avoids creating any confusion and delays throughout the incident management and resolution. The most common and important roles are:

  • Incident Commander
  • Primary Subject Matter Expert (SME)
  • Customer Liaison
  • Scribe

 

Incident Commander

Sometimes called the Incident Manager, the Incident Commander is responsible for driving the incident to resolution and facilitating all activities during the incident. They act as the primary point of contact during the end-to-end incident, from identification to postmortem. The Incident Commander ensures that all activities are coordinated and move along in a timely manner. They secure consensus from key team members and stakeholders before taking action and adjust the course of decisions and action based on feedback.

During the incident-preparation planning phase, the Incident Commander identifies appropriate communication channels while identifying the internal and external stakeholders that need to be included in communications. They share best communication practices with the team and help them communicate effectively across all stakeholders.

During the incident, they are responsible for driving the resolution activities to completion. They assemble the incident-management team and provide them with a communication channel. That could be a common war room or a Slack channel to ensure real-time communication.

Throughout the incident-management process, the Incident Commander gathers information, details, impediments, and delays from each team member in order to compile and maintain clear visibility on the resolution status.They gather all recommendations about fixes that need to be made and contribute and validate incident-resolution actions. This information is compiled and shared with key stakeholders to provide them with status updates, current actions underway, and any needs the response team requires.

While the Incident Commander owns the critical and central role of driving the incident to resolution by facilitating activities across all stakeholders, they are not responsible for solving the incident or applying any fixes. They may not have the domain knowledge or system access required to resolve the incident. Instead, they partner closely with their primary Subject Matter Expert who leads the incident identification, analysis, and resolution. The Incident Commander collects feedback from the SMEs working on the incident and serves as the single source of information and the single authority on both the incident and the system status.

After the incident is resolved, the Incident Commander is responsible for designing the postmortem activities, developing any artifacts required, and facilitating the sessions in a safe manner. During the postmortem, they gather the team’s feedback, run the session, and report out on the feedback and improvements proposed by the team. The Incident Commander drives implementing the prioritized incident-management process improvements and leverages those improvements during future incidents.

Primary Subject-Matter Expert

The primary Subject Matter Expert is the domain expert or owner of the applications and systems being investigated. Primary SMEs are responsible for running any system diagnostics, identifying the source of issues, proposing a course of actions to the Incident Commander, and quickly fixing all issues found. The Primary SME generally writes the Condition, Actions, Needs (CAN) report. This report contains the following information:

  • Condition: What is the current state of the system and service?
  • Actions: What actions need to be taken if the service is not in a healthy state?
  • Needs: What support does the resolver need to perform any resolution action?

The Primary Subject Matter Expert will also be responsible for identifying if an SME from another team is needed to help diagnose or resolve the issue and will report that need to the Incident Commander.

Customer Liaison

The Customer Liaison is responsible for interacting with customers to keep them accurately informed about the status of the incident resolution in a timely manner. Because they manage the relationship with customers, they are often a member of the customer-support team and should be well trained in communication, customer care, and support. The Customer Liaison is exclusively focused on the customer in order to support them and provide them with the best possible experience throughout the incident-
management process, from identification to postmortem improvement items. They write and publish all customer/public-facing status communications using internal communication channels or external social media (Twitter, the company’s website, etc.).

As part of their responsibilities, the Customer Liaison works closely with the Incident Commander. They inform the Incident Commander about which customers are impacted by the incident, and the Incident Commander keeps them up to date on the status of the resolution progress. The Customer Liaison needs to receive continuous, timely, and granular information about the resolution plan in order to provide customized communication to their various customer groups.

Scribe

The Scribe is identified and nominated by the Incident Commander based on skills and available resources. Typically, anyone can fill this role. The Scribe’s responsibilities are to document the incident-management events and activities as they unfold while recording important information and decisions. They are responsible for sharing timely notes and important data with the team and may use a dedicated Slack channel to communicate in real time with all stakeholders. They capture notes shared by the
Incident Commander and the SMEs.

Team members need to know what the different roles are, what they’re responsible for, and who is in which role during an incident. When everyone on the team knows what their responsibilities are, they cooperate better and faster. An incident is no time to have multiple people doing duplicate work. It’s also a terrible time to have important tasks ignored all because everyone thought somebody else was working on the problem. Incidents are made worse when incident-response team members can’t communicate, can’t cooperate, and don’t know what the others are working on. Work gets repeated, work gets ignored, and customers and the business suffer.

During the incident, the team may need to review some events, how fixes were applied, who applied which fixes, etc. Because incidents need to be resolved as quickly as possible, each role needs to maintain focus on their respective responsibilities. They may not have recorded the timeline of events and information. Therefore, it is valuable to have a dedicated scribe who captures the chronology of events and all activities and decisions as they unfold to allow anyone to retrieve notes on them and consult them when needed.

The Incident Commander leverages the Scribe’s notes and recordings to prepare for the postmortem and uses them during the postmortem to identify any performance opportunities and improvement areas. The team reviews the Scribe’s documentation and collaboratively shares feedback about any area that could have been handled differently and better.

Conclusion

With an incident management team in place and roles clearly defined, organizations are setting themselves up for success when the next inevitable incident or outage occurs.

But the incident management team is only one piece of the full Prepare/Respond/Review Incident Management Framework. Download the full white paper here to read the full details of the framework.

- About The Authors
Avatar photo

IT Revolution

Trusted by technology leaders worldwide. Since publishing The Phoenix Project in 2013, and launching DevOps Enterprise Summit in 2014, we’ve been assembling guidance from industry experts and top practitioners.

Follow IT on Social Media

No comments found

Leave a Comment

Your email address will not be published.



Jump to Section

    More Like This

    Map Camp: Weird Mapping – How to Create a Revolution
    By David Anderson

    A version of this post was originally published at TheServerlessEdge.com. Dave Anderson, author of…

    Serverless Myths
    By David Anderson , Michael O’Reilly , Mark McCann

    The term “serverless myths” could also be “modern cloud myths.” The myths highlighted here…

    What is the Modern Cloud/Serverless?
    By David Anderson , Michael O’Reilly , Mark McCann

    What is the Modern Cloud? What is Serverless? This post, adapted from The Value…

    Using Wardley Mapping with the Value Flywheel
    By David Anderson , Michael O’Reilly , Mark McCann

    Now that we have our flywheel turning (see our posts What is the Value…