2021-05-28

On the Incident Commander

As described in my last post, it’s important that stakeholders can get information on an ongoing incident, and trust that the information that they have is relevant, recent, and reputable.

Without a clear understanding of your teams’ incident management protocols, or a lack of  trust in the information that they are presented with, it’s understandable that management and other stakeholders will attempt to get status info in the best way that they know how.

As I described last time, this will mean interrupting engineers who are actively working on a problem.  This adds stress to the situation, and clogs communication channels, both of which can easily exacerbate the problem.

Enter the Incident Commander

The Incident Commander (IC) is a role that exists to coordinate and delegate effort, and to provide communications.  They are the single source of truth for the status of an incident.  They are the hub for all an incident’s communications.

One facet of their purpose is to serve as a buffer between stakeholders and the various subject matter experts (SME) who are working to resolve the incident.  They are responsible for updating the outward-facing communications channels that the stakeholders rely on for status, whether they be a status website, a chat channel, or a call bridge, or something else entirely.  Even when the status of an incident hasn’t meaningfully changed in awhile, such as in a long-running fix, the IC will still update the status.  Seeing an updated timestamp on the status, even if that status is still “we’re working on it” helps to reassure the stakeholders that they are being provided with current information.

When engineers are troubleshooting an incident, it’s important that they stay as focused as possible, and delegate as many communication tasks as possible.  When SMEs determine that people (teams, skills, etc) are needed, they will call upon the IC to pull in the needed assistance.  The IC will be responsible for “paging” out to different teams and individuals, bringing them up-to-speed on the incident, and having them join in the channels where the incident is being coordinated.

When an incident is declared, someone capable of taking on the role of IC should ask the group (whether in a call or chat) “Is there an Incident Commander here?”  If nobody identifies as the IC, that person will announce “I assume the role of Incident Commander”, make note of that in the incident timeline, begin gathering status, and communicate outward to stakeholders.

This begs the question “what makes someone capable of taking on the role of Incident Commander?”  At this time there are no specific qualifications for someone to take on that role.  Some organizations have more rigorous internal programs, but PagerDuty has provided some guidelines at https://response.pagerduty.com/training/incident_commander.  Characteristics to look for in an IC is someone who can communicate effectively, delegate, and are familiar with the organization.

In the case of larger or more complex incidents, an IC may find themselves in need of assistance.  A common tactic is to delegate some or all of the communications duties.  The IC is still the “hub” of all communications, but the communications are executed by a Communications Manager.

It’s important to note that the IC does not actively work to resolve incidents; they aren’t hands-on-keyboard working to resolve the issue.  Should it become necessary that the IC becomes hands-on to resolve the issue, it is incumbent upon them to pass on the IC duties to someone else.

In the end, as PagerDuty says, the single sentence description of the Incident Commander’s purpose is to “Keep the incident moving towards resolution.”

No comments:

Post a Comment