2021-06-21

On the Artifacts of an Incident

 “Never let a crisis go to waste.”

-- Apocryphal


The dust has settled, service has been restored, the monitors are green, the status update announcing the issue is resolved has been posted, and the Incident Commander thanked everyone for their effort.


The issue may be resolved, but the incident is not yet done.

The Artifacts of an Incident

When addressing the incident, actions will be taken in the course of the diagnosis, troubleshooting, and experimentation to solve the issue.  How things are done is almost as important as what is done to resolve the issue.  If, for instance, isolated changes are made manually that aren’t reflected in codebases, it is likely that they will be overridden in the next deployment, and lost.


It’s less than ideal to have to make changes that way, but it is something that happens in real life.  It’s possible that a config change needs to be made on the fly that isn’t well supported by your existing deployment systems.  This is why it’s vitally important to make a record of changes being done.  This should be made in the running chat log of your incident.  They should be tagged for review (during the postmortem?) to determine if they should be reverted or captured and made a permanent part of the environment.


Further, if that change is needed in Production, should it be rolled out to other environments as well?


Equally important is determining how the change impacted the incident, and whether it should be treated as an experiment to revert the change.


For these reasons, in the course of your incident, it should be noted in the timeline actions that are taken and how (#review, changed config manually), properly merge code that was deploy out-of-band (#todo, implement the bar limits to foo), and other housekeeping and hygiene that is noticed along the way (#todo, database is running high on it’s connection count) (#todo, add a bounding limit to the workers queue).


While these items should become tickets in whatever work-tracking system that you use, an active incident isn’t the time to do that.  This is part of what a postmortem is for.


If you personally have a lull in the action, it’s ok to create these kinds of tasks then, and include them in the chat.  That will make it easier to track them during the postmortem.


The Postmortem

One of the hallmarks of a healthy postmortem culture is the expectation that after the incident is resolved, a report of the incident is produced and published.

A Brief Note on Terminology

In most cases, software outages are not responsible for a loss of life.  It is for this reason that in many cases the term “Postmortem” is recommended against.  While I agree that terms like “Incident Review” or “After Action” are more appropriate, I currently find the term “postmortem” more convenient.  I may review my stance at some time in the future, but for now the use of the term will stand.

The Postmortem

I won’t be covering the culture of “Blameless Postmortems”, as that topic and its benefits are well-covered elsewhere.  Let us assume that the incident was not the result of a bad actor, and that the people who were involved were authorized to act in the ways that they did, and were working with good intentions.  The short version is that a postmortem should identify contributing factors to an incident and identify remediations; it should rarely result in someone losing their job.


In fact, the topic of conducting a postmortem is a large topic by itself.  In the interest of not performing a disservice, I will recommend that you look elsewhere for recommended practices for performing a postmortem.


In extremely short detail, all of the participants should come to the postmortem with any supplemental materials that they saw, used, or created during the remediation of the incident.  These will be used as supporting documentation in order to build the timeline of events and draw conclusions of the findings during troubleshooting.

Postmortem Supplements

There are a variety of supplements that should be used during a postmortem meeting, and may be included in the report.  Here are many of the most common ones.


  • Notifying Signals: These can be anything that draws attention to the fact that a problem exists.  Typical examples could include a pager notification or other alert, or an observation from a customer support team that there is an uptick in customer complaints.

  • Metrics, Traces, Logs, etc: These may be noticed during the course of other activities, but any metrics, traces, or logs that were helpful for identifying contributing factors should be included.

  • Chat Logs: While each individual chat log won’t be included in the report itself, it is helpful for constructing the overall incident’s timeline.

  • Working Notebooks:  In organizations that use a shared notebook or other co-edited document, these can be helpful for drafting the postmortem in real-time, as the incident is being worked on.  These can also include avenues of approach that don’t ultimately pan out, but they can help tell a more complete story.

  • Other Timeline Events: These can include anything else that happened during the course of the outline, but outside of the above components.  These may include things such as pulling in additional people, or publishing updates to customers.

Actions Taken

In many of the incidents that I’ve participated in, it was very common to attempt small fixes, often somewhat manually, to see if a local experiment shows improvement.  These are among the items that should have been tagged by individuals as they communicate during the incident.  These should be collected and noted to either clean up experiments that didn’t yield results (undoing changes or re-deploying fresh assets), or be committed and made part of the relevant repositories so that the next deployment doesn’t erase the changes that stabilized the incident.

Long-term Remediations

No team is immune to a “quick solution that we’ll replace later” that lives much longer than intended.  The postmortem is also a time to surface work items that may have been identified in the past but have never been prioritized.  It is a good time to scan the backlog for old issues that your team had intended to circle back to, “and do it right this time”...

Monitors and other Tests

A healthy question that should be asked in any postmortem is “what could have helped us discover this problem earlier?”  In fact, this question should come up for each of the contributors to the incident.


This is something to keep in the back of your head as you work on an incident and you discover contributing factors.  As these come up during the incident, note #todo comments of monitors in chat or the working notebook that would have helped alert of the problem earlier.


Additionally, if a software fix was needed, it’s possible that a test for that behavior should be added to the application’s test suite.

Where we got lucky

Sometimes there was foresight or an accidental decision made in the past that kept this incident from being worse than it could have been.  It’s a healthy practice to call out these things.  Keeping these in mind as you build new products can help bring good practices to the fore, and contribute to building more defensive and resilient applications from the beginning.

The Report

Once all is said and done, you have a detailed, involved document that details the timeline, actions taken, and tasks to be completed.  This is a document that you can take pride in.  It is a document that most likely very few people will read.


And that is alright.


The document should be widely available, and attached or linked to a broadly-sent announcement of the postmortem, but the announcement will contain what most people will consume: The summary.


The summary doesn’t need to be complicated.  A couple paragraphs describing the core dates and times, a broad overview of what was broken and how it was fixed, some highlights of the longer-term fixes that need to be made, and the link to the full postmortem report.

Fixes!

The final artifact that should come out of an incident are closed tasks!  It should go without saying that the remediation items that are identified during the incident and postmortem are prioritized, but it is worth calling out.


Those contributing factors have shown themselves capable of contributing to an incident!  The mitigations that can be put in place, should be.  I would go so far as to say that they should take priority over new-feature development, in fact.


If you should be challenged on that, I would offer that availability is the most important feature of any service!


The Purpose

This may seem like a lot of effort to “rehash yesterday’s news”.  However, postmortems can be a key to your future.  If your customers see repeated mistakes, that will quickly erode their confidence in you, regardless of whether they are internal or external customers.  Postmortems are an opportunity to pause the often frantic pace of development and reflect on past decisions, and should be used to improve the resilience of your systems.


Almost every mistake that I have made in my career has been forgiven.  There have been instances where I’ve been rewarded for creating “new and novel” mistakes!  No matter how much good will that you have, it will surely erode by repeating the same mistakes.


2021-05-28

On the Incident Commander

As described in my last post, it’s important that stakeholders can get information on an ongoing incident, and trust that the information that they have is relevant, recent, and reputable.

Without a clear understanding of your teams’ incident management protocols, or a lack of  trust in the information that they are presented with, it’s understandable that management and other stakeholders will attempt to get status info in the best way that they know how.

As I described last time, this will mean interrupting engineers who are actively working on a problem.  This adds stress to the situation, and clogs communication channels, both of which can easily exacerbate the problem.

Enter the Incident Commander

The Incident Commander (IC) is a role that exists to coordinate and delegate effort, and to provide communications.  They are the single source of truth for the status of an incident.  They are the hub for all an incident’s communications.

One facet of their purpose is to serve as a buffer between stakeholders and the various subject matter experts (SME) who are working to resolve the incident.  They are responsible for updating the outward-facing communications channels that the stakeholders rely on for status, whether they be a status website, a chat channel, or a call bridge, or something else entirely.  Even when the status of an incident hasn’t meaningfully changed in awhile, such as in a long-running fix, the IC will still update the status.  Seeing an updated timestamp on the status, even if that status is still “we’re working on it” helps to reassure the stakeholders that they are being provided with current information.

When engineers are troubleshooting an incident, it’s important that they stay as focused as possible, and delegate as many communication tasks as possible.  When SMEs determine that people (teams, skills, etc) are needed, they will call upon the IC to pull in the needed assistance.  The IC will be responsible for “paging” out to different teams and individuals, bringing them up-to-speed on the incident, and having them join in the channels where the incident is being coordinated.

When an incident is declared, someone capable of taking on the role of IC should ask the group (whether in a call or chat) “Is there an Incident Commander here?”  If nobody identifies as the IC, that person will announce “I assume the role of Incident Commander”, make note of that in the incident timeline, begin gathering status, and communicate outward to stakeholders.

This begs the question “what makes someone capable of taking on the role of Incident Commander?”  At this time there are no specific qualifications for someone to take on that role.  Some organizations have more rigorous internal programs, but PagerDuty has provided some guidelines at https://response.pagerduty.com/training/incident_commander.  Characteristics to look for in an IC is someone who can communicate effectively, delegate, and are familiar with the organization.

In the case of larger or more complex incidents, an IC may find themselves in need of assistance.  A common tactic is to delegate some or all of the communications duties.  The IC is still the “hub” of all communications, but the communications are executed by a Communications Manager.

It’s important to note that the IC does not actively work to resolve incidents; they aren’t hands-on-keyboard working to resolve the issue.  Should it become necessary that the IC becomes hands-on to resolve the issue, it is incumbent upon them to pass on the IC duties to someone else.

In the end, as PagerDuty says, the single sentence description of the Incident Commander’s purpose is to “Keep the incident moving towards resolution.”

2021-04-16

On Incident Management

 If you’re making software, you’re breaking software.

You are probably familiar with the yarn from the DevOps community that the source of friction between the historical groups is that Developers want to make changes, and Operators want stability.


The customers want both.


When your company is small enough to fit around a conference room table, and most people can hold the state of the entire environment in their head, incidents don’t require a capital-P “Process”.  It’s a very different story when that’s not the case anymore.


When you grow beyond that - where changes are being made without your knowledge, and you can’t predict the impact that one services’ changes might influence another service - that is when you need to consider a more formal process for managing incidents.

Story Time

Once upon a time, I worked for a large company.  (Well, this has happened several times…)  This company had a lot of customers, and a lot of products, and ran a lot of services with various levels of interdependence.  The team that I worked on was a relatively large systems/operations engineering team, and we supported a vast array of services.  We were responsible for everything from the OS on up, databases, queues, caching, app routing, monitoring, config management, CI tooling, deployment workflows and tooling, and the availability of the in-house applications themselves.


We also supported a vast number of development teams, all of whom were constantly creating new services, and updating existing ones.


While we had an “Incident Management Process”, it was relatively chaotic.  Not for any technical reason.  Well, for lots of technical reasons, but what made things worse was the non-technical reasons.  As you might expect, this company had many, many layers of leadership hierarchy.


What typically would happen is that when the incident was called, a phone bridge and chat channel would be opened for the engineers to collaborate on the issue.  Almost invariably, someone from leadership would join the call asking for a status update.  This is not a problem; they should be able to get the current status.


The problem was that they felt the need to “go to the source”.  The reason that it was a problem was that demanding to be updated in the channels meant that they were clogging the channels where people were trying to diagnose and solve the issue.


Even that by itself isn’t a huge problem.  But as the scale of the company increases, so too do the numbers of stakeholders.  And they all want to “go to the source”.


More leaders would join the call, ask for the status, and each would ask for updates.


Eventually, the engineers would be flooded out.  They’d start ignoring the requests for updates, and take up backchannels to actually get work done.

What can we take from this?

Those experiences that I described above aren’t unique, but the layers of management strongly exacerbated the problem.


Lack of communications discipline, lack of trust in the update mechanisms, and not trusting that postmortems would be produced all contributed to the frustrations experienced by all involved.


In my next posts, I will describe what an Incident Management process looks like, the artifacts that should come out of an incident, and I will describe a special role that can mitigate these issues.


2020-06-04

On Brownness

If you wonder about the protests:

This is rage at a system that has been abusing it’s unbalanced power for time immoral.

This is a symbol of unjust killings and lesser mistreatment by those who should be protecting all of us!

And I say this as a straight, white, CIS, disabled war veteran of the USMC.

If you are not with the protesters, you aren’t paying attention.

If you don’t live in fear of extrajudicial murder, you’re not paying attention.

If you’re not sure that your death would be adequately investigated, you’re not paying attention.

I have close family members who are police. I love them.

I have partaken of lethal-force training, and riot-control training, from which I still have physical scars. My (minor) injuries are far more than today’s “law enforcement” have sustained before attacking peaceful protesters. (To be clear: I suffered a severe abrasion on top of my right wrist when detaining/wrestling a resisting “protester”. An extremely minor injury.)

I’m ashamed for our current military. I ache for the soul of our law enforcement.

As the palest among us, I proudly say that “We are all Brown.”

What happens to any of us can happen to all of us.

We are all brown.

2019-04-24

On Windermere Sound, Part 2

If you have the opportunity to move into Windermere Sound in the Orlando area, don’t.

Also, specifically Cindy Riggs of Wemert Realty, you should know that she is a toxic enabling person who is not to be trusted.

You can do better.  And not far away, either.

2018-05-23

On Windermere Sound

A little over a month ago, I moved to a new house.

When I mentioned that I was moving to people in the community (such as doctors, dentists, and other professionals in my area), the topic of "Why" would come up.  The answer was "Problems with the HOA".

This person would inevitably sigh knowingly and heavily, and launch into a story about the HOA notice that they received saying that their tree bark was exactly the wrong shade of grey-brown, but that tree had been planted at least 7 years ago.

That is not the kind of "HOA Troubles" that I was having.

I can honestly say that it's refreshing where "Crazy HOA Stories" involve notices that the Internet Cable is hung in an unsightly way, rather than "The HOA Board are having parties that exceed 80 decibles as measured inside my closed house, or setting off illegal fireworks", or "The HOA President is setting off a siren while a hurricane is approaching, or he's physically threatening residents".  It's refreshing.  I haven't had to call the Orange County Sheriff's in over a month.

If you are looking to move to the Orlando area, I would caution you to be aware of the HOA that you are moving into.  The HOA's behavior and demeanor can make life either wonderful or awful.  When looking at a home, ask who the board members are.

If the board includes a man who's first name sounds like he could be Dr. Frankenstein's assistant and who's last name sounds a bit like "Thor's Son", move on.

If the board includes a woman who's first name sounds like "Sindy", and who's last name might include either a Japanese surname and/or sound like Murtaugh's partner, move on.

Windermere Sound is a nice neighborhood with the exception of a few people, but those people happen to be on the never-voted-upon HOA Board.  Life is too short, and these people and those houses are not worth it.  Some of their friends are quite morally compromised, too.  They're obviously not complete fuck-ups, because they have great kids with better morals than their parents, but at this point being involved with the Windermere Sound HOA in Windermere, FL is a condemnation.

I hope that this changes, and the board should become filled with responsible adults.  I'd love to be able to take this post down with a massive "EDIT" at the beginning.  But that will require that the existing HOA to follow it's own rules, provide proper notice, and run a legitimate Board vote.  Unfortunately, it is not in their own best interests to do so.

Central Florida is nice.  I like it here.  I used to love it, and would tell everyone I met how great my neighborhood was.   This was until the people who became the HOA turned toxic.

I'm happy with my new house.  The neighbors are quiet, respectful adults.  Which is to say "they act like Adults".  Their children are well-behaved.  We don't have any fear about letting our kids outside to play, for fear of an unbalanced person harming them.  I haven't called the Sheriff's department in over a month.

Things are good in Windermere, FL.  Just mind where in Windermere you are.

2017-11-12

On Perfume

Perfume.  Originally created to cover the fact that you may have never bathed, and that you may have one or more infections that are rotting your flesh away.

Now, it serves mainly to warn the herd that someone named "Chad" has wandered into your proximity, and may try to corner you into listening about Crossfit.

Men and women, if you MUST, here's how to apply it:

1) If there is a pump that sprays the perfume out, point the nozzle away from you, pump once, then walk into the cloud.  Now put the perfume away.  Do not pump more than once.  Do not repeat.  You have enough.

2) If there is no pump, with the cap on the bottle, shake the bottle.  Remove the cap, and prepare to stick one finger into the cap.  NOTE: Not the bottle!  Now, barely touch the part of the inside of the cap that is wet. Put the cap back on the bottle, and put the perfume away.  Lightly touch the perfumed part of your finger to your collarbone, or behind the ear.  Do not do this more than once.  Do not repeat.  You have enough.

Here are a couple Don'ts:

* Do not wear perfume to the office.  You're not there to pick up anyone unless you're a walking HR Violation.  It's only distracting, and not in the good way.

* Do not wear perfume out to dinner.  You're fucking with everyone else's sense of smell and taste.

=====

This part is for the men.

Guys, look, I get it.  I used to bathe myself in perfume because I thought women were into guys that smelled like the inside of bottles that were shown in the foreground of cowboys.  I didn't want to have "too little" perfume on, and miss out on them being unable to keep their hands off of me.  I get it.

I even had one girlfriend who Loooooved a particular perfume, so bought me a bottle of "Curve".  Yes, she did like that perfume.

Here's the thing: If you want to attract the kinds of women that are attracted to a perfume, you're going to be easily replaced.  Shallow can be fun for awhile.  But if some bottled chemicals are what separate you from the next guy, I feel sad for you, bro.

So, that girlfriend?  She tried to play a bunch of headgames to get me to... fuck, I don't know what she was doing.  Maybe get me to try to prove to her that I was worthy, or mental judo to convince myself that putting in that much effort meant that she must be worth it.  Needless to say, it didn't last, and the way she ended things won her no friends.  She bought the new boyfriend some Curve, too.

Perfumes might be a cheat against pheromones, but that's all they'll be.  If you stink, you'll be better off getting clean.

If you want a woman to miss your smell when you're not there, you want that smell to be you, not something from a cosmetic.  If "your smell" that she misses is a perfume, you're just a vibrator and a heated blanket away from being completely replaceable.

In the days before I went to Iraq for most of 2005, I set aside a t-shirt, and wore it at night while I slept.  As I was doing my final packing up, I put it into a gallon ziplock bag, and set it aside in a drawer.  After I got in-country and could make a call, I told my wife what I'd done and where it was.  This was so that when she missed me the most, she could open that bag, and have a little bit of my smell.  I don't know how long she kept it, but it was appreciated, especially as time passed and my smell faded from our house.

As a side note, if you take exception to it being called "perfume" and insist on calling it "cologne" or "body spray", I challenge you to tell me the fucking difference, especially if yours wasn't created and packaged in a particular region of France.

=====

Those ads with hordes of women swarming a guy because he's spraying himself are appealing.  What if you look at it this way:  What if they're not drawn to him?  What if they're rushing at him to beat him to death?  He has marked himself, and they figure that one murder shared among them is worth not having to hear any more goddamn Crossfit stories.