On the Artifacts of an Incident

“Never let a crisis go to waste.”

-- Apocryphal

The dust has settled, service has been restored, the monitors are green, the status update announcing the issue is resolved has been posted, and the Incident Commander thanked everyone for their effort.

The issue may be resolved, but the incident is not yet done.

The Artifacts of an Incident

When addressing the incident, actions will be taken in the course of the diagnosis, troubleshooting, and experimentation to solve the issue. How things are done is almost as important as what is done to resolve the issue. If, for instance, isolated changes are made manually that aren’t reflected in codebases, it is likely that they will be overridden in the next deployment, and lost.

It’s less than ideal to have to make changes that way, but it is something that happens in real life. It’s possible that a config change needs to be made on the fly that isn’t well supported by your existing deployment systems. This is why it’s vitally important to make a record of changes being done. This should be made in the running chat log of your incident. They should be tagged for review (during the postmortem?) to determine if they should be reverted or captured and made a permanent part of the environment.

Further, if that change is needed in Production, should it be rolled out to other environments as well?

Equally important is determining how the change impacted the incident, and whether it should be treated as an experiment to revert the change.

For these reasons, in the course of your incident, it should be noted in the timeline actions that are taken and how (#review, changed config manually), properly merge code that was deploy out-of-band (#todo, implement the bar limits to foo), and other housekeeping and hygiene that is noticed along the way (#todo, database is running high on it’s connection count) (#todo, add a bounding limit to the workers queue).

While these items should become tickets in whatever work-tracking system that you use, an active incident isn’t the time to do that. This is part of what a postmortem is for.

If you personally have a lull in the action, it’s ok to create these kinds of tasks then, and include them in the chat. That will make it easier to track them during the postmortem.

The Postmortem

One of the hallmarks of a healthy postmortem culture is the expectation that after the incident is resolved, a report of the incident is produced and published.

A Brief Note on Terminology

In most cases, software outages are not responsible for a loss of life. It is for this reason that in many cases the term “Postmortem” is recommended against. While I agree that terms like “Incident Review” or “After Action” are more appropriate, I currently find the term “postmortem” more convenient. I may review my stance at some time in the future, but for now the use of the term will stand.

The Postmortem

I won’t be covering the culture of “Blameless Postmortems”, as that topic and its benefits are well-covered elsewhere. Let us assume that the incident was not the result of a bad actor, and that the people who were involved were authorized to act in the ways that they did, and were working with good intentions. The short version is that a postmortem should identify contributing factors to an incident and identify remediations; it should rarely result in someone losing their job.

In fact, the topic of conducting a postmortem is a large topic by itself. In the interest of not performing a disservice, I will recommend that you look elsewhere for recommended practices for performing a postmortem.

In extremely short detail, all of the participants should come to the postmortem with any supplemental materials that they saw, used, or created during the remediation of the incident. These will be used as supporting documentation in order to build the timeline of events and draw conclusions of the findings during troubleshooting.

Postmortem Supplements

There are a variety of supplements that should be used during a postmortem meeting, and may be included in the report. Here are many of the most common ones.

Notifying Signals: These can be anything that draws attention to the fact that a problem exists. Typical examples could include a pager notification or other alert, or an observation from a customer support team that there is an uptick in customer complaints.
Metrics, Traces, Logs, etc: These may be noticed during the course of other activities, but any metrics, traces, or logs that were helpful for identifying contributing factors should be included.
Chat Logs: While each individual chat log won’t be included in the report itself, it is helpful for constructing the overall incident’s timeline.
Working Notebooks: In organizations that use a shared notebook or other co-edited document, these can be helpful for drafting the postmortem in real-time, as the incident is being worked on. These can also include avenues of approach that don’t ultimately pan out, but they can help tell a more complete story.
Other Timeline Events: These can include anything else that happened during the course of the outline, but outside of the above components. These may include things such as pulling in additional people, or publishing updates to customers.

Actions Taken

In many of the incidents that I’ve participated in, it was very common to attempt small fixes, often somewhat manually, to see if a local experiment shows improvement. These are among the items that should have been tagged by individuals as they communicate during the incident. These should be collected and noted to either clean up experiments that didn’t yield results (undoing changes or re-deploying fresh assets), or be committed and made part of the relevant repositories so that the next deployment doesn’t erase the changes that stabilized the incident.

Long-term Remediations

No team is immune to a “quick solution that we’ll replace later” that lives much longer than intended. The postmortem is also a time to surface work items that may have been identified in the past but have never been prioritized. It is a good time to scan the backlog for old issues that your team had intended to circle back to, “and do it right this time”...

Monitors and other Tests

A healthy question that should be asked in any postmortem is “what could have helped us discover this problem earlier?” In fact, this question should come up for each of the contributors to the incident.

This is something to keep in the back of your head as you work on an incident and you discover contributing factors. As these come up during the incident, note #todo comments of monitors in chat or the working notebook that would have helped alert of the problem earlier.

Additionally, if a software fix was needed, it’s possible that a test for that behavior should be added to the application’s test suite.

Where we got lucky

Sometimes there was foresight or an accidental decision made in the past that kept this incident from being worse than it could have been. It’s a healthy practice to call out these things. Keeping these in mind as you build new products can help bring good practices to the fore, and contribute to building more defensive and resilient applications from the beginning.

The Report

Once all is said and done, you have a detailed, involved document that details the timeline, actions taken, and tasks to be completed. This is a document that you can take pride in. It is a document that most likely very few people will read.

And that is alright.

The document should be widely available, and attached or linked to a broadly-sent announcement of the postmortem, but the announcement will contain what most people will consume: The summary.

The summary doesn’t need to be complicated. A couple paragraphs describing the core dates and times, a broad overview of what was broken and how it was fixed, some highlights of the longer-term fixes that need to be made, and the link to the full postmortem report.

Fixes!

The final artifact that should come out of an incident are closed tasks! It should go without saying that the remediation items that are identified during the incident and postmortem are prioritized, but it is worth calling out.

Those contributing factors have shown themselves capable of contributing to an incident! The mitigations that can be put in place, should be. I would go so far as to say that they should take priority over new-feature development, in fact.

If you should be challenged on that, I would offer that availability is the most important feature of any service!

The Purpose

This may seem like a lot of effort to “rehash yesterday’s news”. However, postmortems can be a key to your future. If your customers see repeated mistakes, that will quickly erode their confidence in you, regardless of whether they are internal or external customers. Postmortems are an opportunity to pause the often frantic pace of development and reflect on past decisions, and should be used to improve the resilience of your systems.

Almost every mistake that I have made in my career has been forgiven. There have been instances where I’ve been rewarded for creating “new and novel” mistakes! No matter how much good will that you have, it will surely erode by repeating the same mistakes.

A Place For My Thoughts :: gWaldo

2021-06-21