On Incident Management

If you’re making software, you’re breaking software.

You are probably familiar with the yarn from the DevOps community that the source of friction between the historical groups is that Developers want to make changes, and Operators want stability.

The customers want both.

When your company is small enough to fit around a conference room table, and most people can hold the state of the entire environment in their head, incidents don’t require a capital-P “Process”. It’s a very different story when that’s not the case anymore.

When you grow beyond that - where changes are being made without your knowledge, and you can’t predict the impact that one services’ changes might influence another service - that is when you need to consider a more formal process for managing incidents.

Story Time

Once upon a time, I worked for a large company. (Well, this has happened several times…) This company had a lot of customers, and a lot of products, and ran a lot of services with various levels of interdependence. The team that I worked on was a relatively large systems/operations engineering team, and we supported a vast array of services. We were responsible for everything from the OS on up, databases, queues, caching, app routing, monitoring, config management, CI tooling, deployment workflows and tooling, and the availability of the in-house applications themselves.

We also supported a vast number of development teams, all of whom were constantly creating new services, and updating existing ones.

While we had an “Incident Management Process”, it was relatively chaotic. Not for any technical reason. Well, for lots of technical reasons, but what made things worse was the non-technical reasons. As you might expect, this company had many, many layers of leadership hierarchy.

What typically would happen is that when the incident was called, a phone bridge and chat channel would be opened for the engineers to collaborate on the issue. Almost invariably, someone from leadership would join the call asking for a status update. This is not a problem; they should be able to get the current status.

The problem was that they felt the need to “go to the source”. The reason that it was a problem was that demanding to be updated in the channels meant that they were clogging the channels where people were trying to diagnose and solve the issue.

Even that by itself isn’t a huge problem. But as the scale of the company increases, so too do the numbers of stakeholders. And they all want to “go to the source”.

More leaders would join the call, ask for the status, and each would ask for updates.

Eventually, the engineers would be flooded out. They’d start ignoring the requests for updates, and take up backchannels to actually get work done.

What can we take from this?

Those experiences that I described above aren’t unique, but the layers of management strongly exacerbated the problem.

Lack of communications discipline, lack of trust in the update mechanisms, and not trusting that postmortems would be produced all contributed to the frustrations experienced by all involved.

In my next posts, I will describe what an Incident Management process looks like, the artifacts that should come out of an incident, and I will describe a special role that can mitigate these issues.

A Place For My Thoughts :: gWaldo

2021-04-16