A Place For My Thoughts :: gWaldo: Work

Showing posts with label Work. Show all posts

2021-06-21

On the Artifacts of an Incident

“Never let a crisis go to waste.”

-- Apocryphal

The dust has settled, service has been restored, the monitors are green, the status update announcing the issue is resolved has been posted, and the Incident Commander thanked everyone for their effort.

The issue may be resolved, but the incident is not yet done.

The Artifacts of an Incident

When addressing the incident, actions will be taken in the course of the diagnosis, troubleshooting, and experimentation to solve the issue. How things are done is almost as important as what is done to resolve the issue. If, for instance, isolated changes are made manually that aren’t reflected in codebases, it is likely that they will be overridden in the next deployment, and lost.

It’s less than ideal to have to make changes that way, but it is something that happens in real life. It’s possible that a config change needs to be made on the fly that isn’t well supported by your existing deployment systems. This is why it’s vitally important to make a record of changes being done. This should be made in the running chat log of your incident. They should be tagged for review (during the postmortem?) to determine if they should be reverted or captured and made a permanent part of the environment.

Further, if that change is needed in Production, should it be rolled out to other environments as well?

Equally important is determining how the change impacted the incident, and whether it should be treated as an experiment to revert the change.

For these reasons, in the course of your incident, it should be noted in the timeline actions that are taken and how (#review, changed config manually), properly merge code that was deploy out-of-band (#todo, implement the bar limits to foo), and other housekeeping and hygiene that is noticed along the way (#todo, database is running high on it’s connection count) (#todo, add a bounding limit to the workers queue).

While these items should become tickets in whatever work-tracking system that you use, an active incident isn’t the time to do that. This is part of what a postmortem is for.

If you personally have a lull in the action, it’s ok to create these kinds of tasks then, and include them in the chat. That will make it easier to track them during the postmortem.

The Postmortem

One of the hallmarks of a healthy postmortem culture is the expectation that after the incident is resolved, a report of the incident is produced and published.

A Brief Note on Terminology

In most cases, software outages are not responsible for a loss of life. It is for this reason that in many cases the term “Postmortem” is recommended against. While I agree that terms like “Incident Review” or “After Action” are more appropriate, I currently find the term “postmortem” more convenient. I may review my stance at some time in the future, but for now the use of the term will stand.

The Postmortem

I won’t be covering the culture of “Blameless Postmortems”, as that topic and its benefits are well-covered elsewhere. Let us assume that the incident was not the result of a bad actor, and that the people who were involved were authorized to act in the ways that they did, and were working with good intentions. The short version is that a postmortem should identify contributing factors to an incident and identify remediations; it should rarely result in someone losing their job.

In fact, the topic of conducting a postmortem is a large topic by itself. In the interest of not performing a disservice, I will recommend that you look elsewhere for recommended practices for performing a postmortem.

In extremely short detail, all of the participants should come to the postmortem with any supplemental materials that they saw, used, or created during the remediation of the incident. These will be used as supporting documentation in order to build the timeline of events and draw conclusions of the findings during troubleshooting.

Postmortem Supplements

There are a variety of supplements that should be used during a postmortem meeting, and may be included in the report. Here are many of the most common ones.

Notifying Signals: These can be anything that draws attention to the fact that a problem exists. Typical examples could include a pager notification or other alert, or an observation from a customer support team that there is an uptick in customer complaints.
Metrics, Traces, Logs, etc: These may be noticed during the course of other activities, but any metrics, traces, or logs that were helpful for identifying contributing factors should be included.
Chat Logs: While each individual chat log won’t be included in the report itself, it is helpful for constructing the overall incident’s timeline.
Working Notebooks: In organizations that use a shared notebook or other co-edited document, these can be helpful for drafting the postmortem in real-time, as the incident is being worked on. These can also include avenues of approach that don’t ultimately pan out, but they can help tell a more complete story.
Other Timeline Events: These can include anything else that happened during the course of the outline, but outside of the above components. These may include things such as pulling in additional people, or publishing updates to customers.

Actions Taken

In many of the incidents that I’ve participated in, it was very common to attempt small fixes, often somewhat manually, to see if a local experiment shows improvement. These are among the items that should have been tagged by individuals as they communicate during the incident. These should be collected and noted to either clean up experiments that didn’t yield results (undoing changes or re-deploying fresh assets), or be committed and made part of the relevant repositories so that the next deployment doesn’t erase the changes that stabilized the incident.

Long-term Remediations

No team is immune to a “quick solution that we’ll replace later” that lives much longer than intended. The postmortem is also a time to surface work items that may have been identified in the past but have never been prioritized. It is a good time to scan the backlog for old issues that your team had intended to circle back to, “and do it right this time”...

Monitors and other Tests

A healthy question that should be asked in any postmortem is “what could have helped us discover this problem earlier?” In fact, this question should come up for each of the contributors to the incident.

This is something to keep in the back of your head as you work on an incident and you discover contributing factors. As these come up during the incident, note #todo comments of monitors in chat or the working notebook that would have helped alert of the problem earlier.

Additionally, if a software fix was needed, it’s possible that a test for that behavior should be added to the application’s test suite.

Where we got lucky

Sometimes there was foresight or an accidental decision made in the past that kept this incident from being worse than it could have been. It’s a healthy practice to call out these things. Keeping these in mind as you build new products can help bring good practices to the fore, and contribute to building more defensive and resilient applications from the beginning.

The Report

Once all is said and done, you have a detailed, involved document that details the timeline, actions taken, and tasks to be completed. This is a document that you can take pride in. It is a document that most likely very few people will read.

And that is alright.

The document should be widely available, and attached or linked to a broadly-sent announcement of the postmortem, but the announcement will contain what most people will consume: The summary.

The summary doesn’t need to be complicated. A couple paragraphs describing the core dates and times, a broad overview of what was broken and how it was fixed, some highlights of the longer-term fixes that need to be made, and the link to the full postmortem report.

Fixes!

The final artifact that should come out of an incident are closed tasks! It should go without saying that the remediation items that are identified during the incident and postmortem are prioritized, but it is worth calling out.

Those contributing factors have shown themselves capable of contributing to an incident! The mitigations that can be put in place, should be. I would go so far as to say that they should take priority over new-feature development, in fact.

If you should be challenged on that, I would offer that availability is the most important feature of any service!

The Purpose

This may seem like a lot of effort to “rehash yesterday’s news”. However, postmortems can be a key to your future. If your customers see repeated mistakes, that will quickly erode their confidence in you, regardless of whether they are internal or external customers. Postmortems are an opportunity to pause the often frantic pace of development and reflect on past decisions, and should be used to improve the resilience of your systems.

Almost every mistake that I have made in my career has been forgiven. There have been instances where I’ve been rewarded for creating “new and novel” mistakes! No matter how much good will that you have, it will surely erode by repeating the same mistakes.

2021-04-16

On Incident Management

If you’re making software, you’re breaking software.

You are probably familiar with the yarn from the DevOps community that the source of friction between the historical groups is that Developers want to make changes, and Operators want stability.

The customers want both.

When your company is small enough to fit around a conference room table, and most people can hold the state of the entire environment in their head, incidents don’t require a capital-P “Process”. It’s a very different story when that’s not the case anymore.

When you grow beyond that - where changes are being made without your knowledge, and you can’t predict the impact that one services’ changes might influence another service - that is when you need to consider a more formal process for managing incidents.

Story Time

Once upon a time, I worked for a large company. (Well, this has happened several times…) This company had a lot of customers, and a lot of products, and ran a lot of services with various levels of interdependence. The team that I worked on was a relatively large systems/operations engineering team, and we supported a vast array of services. We were responsible for everything from the OS on up, databases, queues, caching, app routing, monitoring, config management, CI tooling, deployment workflows and tooling, and the availability of the in-house applications themselves.

We also supported a vast number of development teams, all of whom were constantly creating new services, and updating existing ones.

While we had an “Incident Management Process”, it was relatively chaotic. Not for any technical reason. Well, for lots of technical reasons, but what made things worse was the non-technical reasons. As you might expect, this company had many, many layers of leadership hierarchy.

What typically would happen is that when the incident was called, a phone bridge and chat channel would be opened for the engineers to collaborate on the issue. Almost invariably, someone from leadership would join the call asking for a status update. This is not a problem; they should be able to get the current status.

The problem was that they felt the need to “go to the source”. The reason that it was a problem was that demanding to be updated in the channels meant that they were clogging the channels where people were trying to diagnose and solve the issue.

Even that by itself isn’t a huge problem. But as the scale of the company increases, so too do the numbers of stakeholders. And they all want to “go to the source”.

More leaders would join the call, ask for the status, and each would ask for updates.

Eventually, the engineers would be flooded out. They’d start ignoring the requests for updates, and take up backchannels to actually get work done.

What can we take from this?

Those experiences that I described above aren’t unique, but the layers of management strongly exacerbated the problem.

Lack of communications discipline, lack of trust in the update mechanisms, and not trusting that postmortems would be produced all contributed to the frustrations experienced by all involved.

In my next posts, I will describe what an Incident Management process looks like, the artifacts that should come out of an incident, and I will describe a special role that can mitigate these issues.

2011-08-10

Chef Explosion

Here at Agora, we use a product from Opscode called Chef to manage our server environments. Chef allows us to reliably manage our infrastructure by providing us with the ability to write code that describes how a server should be configured. While not perfect, it has served us well.

Chef leverages CouchDB for it's datastore. CouchDB is "NoSQL" database product, similar in concept to MongoDB. CouchDB provides a lot of features and usability, but as a tradeoff for versioning, speed, and convenience it sacrifices disk space. In OpsCode's documentation, they do helpfully point out in the "CouchDB Administration for Chef Server" page that you should periodically run a Compaction. Basically what this does is remove some of the older versions of documents.

Following their advice, we set it up as a weekly cron (in our Cron cookbook, naturally), and so it looks like this:


cron 'Compact Chef DB' do
  user 'nobody'
  weekday '1'
  hour '4'
  minute '0'
  command 'curl -X POST http://localhost:5984/chef/_compact'
end

which results in a crontab entry that looks like this:


# Chef Name: Compact Chef DB
0 4 * * 1 curl -X POST http://localhost:5984/chef/_compact

One fine summer morning I came in one morning to several thousand emails saying "Chef Run Failed." This, as you may understand, severely degraded my opinion of the morning.

Cue the Swedish Chef crying "Bork Bork Bork!"

After I determined that a full disk was the problem and deleting an old unneeded backup file to get some headroom, I found that the biggest contributor was the /var/lib/couchdb/0.10.0/.chef_design directory.


root@chefserver:/var/lib/couchdb/0.10.0/.chef_design# ls -lh
total 96G
-rw-rw-r-- 1 couchdb couchdb  30M 2011-07-21 18:26 07ccb0c12664d1f1ca746003182b521a.view
-rw-r--r-- 1 couchdb couchdb 1.7G 2011-05-11 12:03 178087e2a7c06ff437482555acf60bab.view
-rw-rw-r-- 1 couchdb couchdb 8.5G 2011-07-22 08:24 18757f7428c465dd0504ac3d5d7ce577.view
-rw-rw-r-- 1 couchdb couchdb 8.9G 2011-07-22 08:24 367772ed026257ff1f88a1011576c9c3.view
-rw-rw-r-- 1 couchdb couchdb 6.6M 2011-07-21 15:52 3970d32b6acb424bb4d19684bdf9aff1.view
-rw-r--r-- 1 couchdb couchdb 8.6M 2011-07-22 08:11 91188e3c7d61bdf079eee6ca719be05c.view
-rw-rw-r-- 1 couchdb couchdb 6.0G 2011-03-16 16:44 9f39fce5f578a23cc8cad7b3fe9b8ce9.view
-rw-r--r-- 1 couchdb couchdb 1.4G 2011-07-22 08:24 af280ad217f6edca6276d1d1bcbc069d.view
-rw-rw-r-- 1 couchdb couchdb  19G 2011-05-11 12:00 b96879fe1377e2b91f228109f3aac384.view
-rw-rw-r-- 1 couchdb couchdb 565K 2011-07-20 09:31 be708387555557a5b4886292346da6bb.view
-rw-rw-r-- 1 couchdb couchdb 3.0M 2011-07-20 11:27 d381d1f4b207dc3d9624720a7e88f881.view
-rw-r--r-- 1 couchdb couchdb  51G 2011-07-22 08:20 fe06cf9119d23dd7fec2492b79e7ebef.view

I was surprised that there was so much disk use, since we had been running the Chef Compactions, and expected this kind of thing to be taken care of. Wondering if it was throwing some kind of error that we weren't seeing (since it's running as a cron), I ran it manually:


root@chefserver:~# curl -H "Content-Type: application/json" -X POST http://localhost:5984/chef/_view_cleanup
{"ok":true}

Which yielded:


root@chefserver:/var/lib/couchdb/0.10.0/.chef_design# ls -lh
total 70G
-rw-rw-r-- 1 couchdb couchdb  30M 2011-07-22 08:43 07ccb0c12664d1f1ca746003182b521a.view
-rw-rw-r-- 1 couchdb couchdb 8.5G 2011-07-22 08:43 18757f7428c465dd0504ac3d5d7ce577.view
-rw-rw-r-- 1 couchdb couchdb 8.9G 2011-07-22 08:43 367772ed026257ff1f88a1011576c9c3.view
-rw-rw-r-- 1 couchdb couchdb 6.6M 2011-07-22 08:43 3970d32b6acb424bb4d19684bdf9aff1.view
-rw-r--r-- 1 couchdb couchdb   51 2011-07-22 08:42 7bbcbf585caef33abc0733282f40a22a.view
-rw-r--r-- 1 couchdb couchdb 8.6M 2011-07-22 08:43 91188e3c7d61bdf079eee6ca719be05c.view
-rw-r--r-- 1 couchdb couchdb 1.4G 2011-07-22 08:42 af280ad217f6edca6276d1d1bcbc069d.view
-rw-rw-r-- 1 couchdb couchdb 573K 2011-07-22 08:43 be708387555557a5b4886292346da6bb.view
-rw-rw-r-- 1 couchdb couchdb 3.0M 2011-07-22 08:43 d381d1f4b207dc3d9624720a7e88f881.view
-rw-r--r-- 1 couchdb couchdb  51G 2011-07-22 08:26 fe06cf9119d23dd7fec2492b79e7ebef.view

Well, that was a significant but only partial win. Why do I still have 70GB in .view files?

What Opscode hasn't told us about is that CouchDB has a thing called "Views", and these can - over time - come to take up space. A lot of space. (CouchDB views are the "primary tool used for querying and reporting on CouchDB documents" according to the CouchDB Wiki.) Opscode also hadn't mentioned that CouchDB says that these, too, need to be compacted.

The good folks on the internet, notably the CouchDB docs and a question on StackOverflow "CouchDB .view file growing out of control".

Among our findings we came upon this link to the Compaction page in the CouchDB Documentation.

My compatriot Jeff Hagadorn and I were both looking into identifing the design view names, and he beat me to the solution:


bash -c \'for x in checksums clients cookbooks data_bags environments id_map nodes roles sandboxes users; do curl -H "Content-Type: application/json" -X POST http://localhost:5984/chef/_compact/$x ; done\'

(I had found a posting on the couchdbkit Google Group describing a script a user had written to solve this very problem here, if you prefer a Python-based solution which doesn't require you to know your view names.)

After doing that, our disk was in a much healthier state, and our chef-db-compact recipe now looks like this:


cron 'Compact Chef DB' do
  user 'nobody'
  weekday '1'
  hour '4'
  minute '0'
  command 'curl -X POST http://localhost:5984/chef/_compact'
end

cron 'Compact Chef Views' do
  user 'nobody'
  weekday '1'
  hour '5'
  minutes '0'
  command 'bash -c \'for x in checksums clients cookbooks data_bags environments id_map nodes roles sandboxes users; do curl -H "Content-Type: application/json" -X POST http://localhost:5984/chef/_compact/$x ; done\''
end

which produces a crontab that looks like this:


# Chef Name: Compact Chef DB
0 4 * * 1 curl -X POST http://localhost:5984/chef/_compact
# Chef Name: Compact Chef Views
0 5 * * 1 bash -c 'for x in checksums clients cookbooks data_bags environments id_map nodes roles sandboxes users; do curl -H "Content-Type: application/json" -X POST http://localhost:5984/chef/_compact/$x ; done'

Now, you may suggest that we mount this location on a separate disk. The answer is that we had. /var/lib/couchdb is a separate 100GB physical disk. The problem was that /var/log is on the / partition, and that is only a 7GB disk. Once the views had filled their disk, the couchdb and chef logfiles had swelled with errors, and even mighty logrotate could only held them off for so long.

Bear in mind that there was no impact to Production during this event; the only outcome was that new changes would not have been able to be pushed out via Chef, and a couple of filled inboxes. Nevertheless, this highlighted some of our flaws. The most important of which is that our monitoring of the server was imperfect, and we missed the alerts that the CouchDB disk was filling. Had we not missed those alerts we could have diagnosed this before it was a problem.

As an aside, in addition to just alerting when disk reaches a certain capacity, you should also watch for sudden increases in utilization. If a particular disk normally runs at 20% capacity, but overnight a logfile swells the disk to 73%, it won't trigger your "75% Full" alert, but there is very likely a problem. One way to solve this is to record a "previous percentage" and compare that to the "current percentage" and alert in the event that there is a sudden increase.

(NOTE: Server Names have been changed to protect the guilty.)

UPDATE: I'm notified by the Senior Systems Admin at Opscode that they have added these compactions to the chef-server recipe. Their implementation is quite a bit different than ours, but no matter.

NOTE: This has been republished here with permission from my employer. Original post here

-Waldo
@gwaldo

2011-05-12

On White Box Computers

TechRepublic recently ran an article "Why consultants should not sell generic PCs and servers", which my "shame company" (the regret on my resume - my first job out of the Marine Corps) desperately needs to read.

For years, they've been selling 'white box' servers by Seneca Data to their clients. The main reason was that the owner was on Seneca Data's Board of Advisors. (He apparently never advised them to "stop selling us shit", "stop being stupid about fixing the broken shit you send us", and later "stop charging more than retail to your business partner.") It used to be that we would get a moderate discount for buying things from them (5-10% usually) versus buying from that paragon of deals , Staples. But at one point, Seneca Data determined that we were no longer spending sufficient money with them to warrant discounts, so they started quoting us about 10+% over Staples or Best Buy prices for things like laserjet printers. The servers that they sold were technically cheaper for equivalent specs, however they left out little things like redundant power supplies and hot-swappable hard drives and power supplies, which are little things that come included on all but the most bottom-line of name-brand servers.

They also left out little things like compatible drivers, crash-free machines (upon delivery), and firmware that didn't run the fans at full speed for the entire time the server was on. Without exaggerating I can say that that particular incident, the server was louder than my phone's ring, and the department supervisor dismissed it as "not that bad", then retreated to his office on the other side of a partition wall and closed his office door.

Then, good luck getting somebody useful on the phone - support and the personal touch are supposed benefits from using a local company. I got a lot more of the personal touch (but I wished they'd consider using lube once in a while) than I did actual Support. It took a ridiculous amount of time to get anything fixed through them.

Don't get me wrong; working at EXEControl wasn't all bad. Just most of it. I learned plenty of valuable lessons there, but sadly they were mostly anti-patterns.

(This, by no means is a complete account of everything that is wrong at that company. I'm honestly surprised that they still exist. But let us say that companies don't publish their newsletter when they don't have anything good to say...)

-Waldo

2010-11-09

On Being Wanted, Again

I got the gig, and that makes me happy. It's a pay cut, but I can at least rationalize it.

I've been using Linux for years, but this will be my first time administering it. Fortunately I already have a beard.

This will be my fourth job in 2010. I think that I will hereafter refer to this year as "The Year of the Resume".

-Waldo

2010-05-05

On Professional Communications

In these days of always-on always-available communication, it's more important than ever to be clear in your communications so that you're not wasting people's time. As I see it (and that's really the only opinion that matters), Instant Messenger and Texting have each allowed lazy people (and/or morons) to waste my time more effectively.

In my opinion, IM and SMS have been huge contributors in lazy communications due to their immediacy and the casual attitude associated with their use. If you're at work, try to be professional. Sure, they facilitate communications, and they're often used for informal communications, but there's no reason that you should be unclear. Being unclear is ineffective and wasteful.

The only things that I desire from communications is to communicate effectively. Seems like circular reasoning, doesn't it. I assure you that it isn't. For many people with which I interact in a given day it seems that they just don't get that. When people reach out to me, they usually need my help.

As one who's job it is to help people, here's a hint:
Help me to help you!

The single easiest way to do this is to communicate clearly. Fix bad spelling. Use punctuation. Attempt to use context. Be descriptive. Before sending, read what you've typed and correct anything that's wrong.

No, the occasional numeral 0 instead of the letter 'o' won't bother me, but runningwordstogether or gorsslee misspelling words isn't helpful neither is having multiple clauses or sentences without punctuation or capitalization dusnt make it EESEEER 4 me 2 REED!!1!! LOLZ

Abusing my time and brainpower in these ways will make me hate you more.

Really, if you must interrupt what I was trying to accomplish, don't waste my time and energy trying to figure out what you want. Before you came by, I was probably happily getting my own work done. Yes, you're an interruption; I hope what you want is important.

Sure, the eventual emoticon (smiley) or a or tag can certainly help tell me that you're being sarcastic or attempting a joke. Please keep them to a minimum.

=====

Once I've gotten the gist of what you're asking for, I'm probably going to have follow-up questions for you. This is never a time to get pissy because you 'just want him to fix it, damnit. Gosh!' Chances are, you don't really know what you're talking about, but if you do, I likely don't know everything about your system. (This is actually true; the more you know about your system means that I consequently need to know less about it. Unless of course we're actually peers in this subject, in which case this post probably isn't for you; you probably already know how to communicate effectively with me.)

Now is not the time to get snide, snotty or sarcastic. Now is the time to be more helpful. Remember that you asked me for help; I was happy without your interruption. Asking questions means that I am going to try to help you, but I don't yet have all the information that I need to do so effectively. Sure, I may be able to discover the answers myself, but you providing useful answers saves my valuable time.

If I present follow-up responses, being questions, suggestions or recommendations, it is never, EVER acceptable to respond to me with only punctuation. Today, a particular individual customer (I'm refraining from calling him a jackass, but that's what I'm thinking) responded to my four response statements with a single question mark.

Yes, he typed "?" and sent that to me.

In no way was that an acceptable response. The only intelligent conclusion that I can draw from a response like that is either that is a floating leftover punctuation left over from the dozens that had been ignored thus far in the conversation, or that he typed an entire intelligible question but his keyboard's keys have all stuck except for Shift, the question mark, and Enter. (They are right next to each other. The Right Control probably works, too....) So, he's still guilty of not checking what he typed before sending. This tells me that you don't value me or my time. So I hate you more...

=====

This is all exacerbated by the fact that I need SMS for my job in order to receive pages, and at my last two jobs are required that I stay logged into IM while at work. Any person who happens to come across or guess my cell number or look up my ID at work can now waste my time, 'round the clock! It's these types of interactions that make me firmly believe that IM is a wonderful technology for facilitating the interruption and distraction of otherwise productive people.

Additionally, communicating effectively means that I don't need to post another rant and spreading vitriol.

-Waldo

2009-11-17

On Trust (or "Whither Trust?")

So, what happens whe you think that your two managers are lying to you? And when called them on it, it seems like they deflected...?

I could go into great depth, but I don't think that it's necessary. In short, it was said that a former coworker cited me (personality, communication style, etc.) as a major reason to find employ elsewhere.

Now, I fully admit that I'm not the easiest person to get along with. (My bride is a freaking saint!). I just don't have time or patience for bullshit, so I can come off as surly or brusque at times. My friends know and usually appreciate this, whether or not they got it when we first met. (As is typical, most of my friends are from work, and usually first get to know me for being good at what I do. After they get to know me professionally, they then then get to know me socially. By this point they're already attuned to me.) When it's time to bullshit, I'm all fun. When it's work, and I like you and have a cycle or two to spare, I'll joke while I work. But yes, otherwise I'm brusque.

Anyway, apparently my managers didn't know that he and I are still on good terms, and didn't figure that I would bother to check with him. They seemed stunned when I said that I had, but asked "why would he tell the truth when you called him on this?"

To me it seems simple:
-He has nothing to gain or lose by lying to me.
-My managers do stand to gain by this lie by keeping me on the defensive.

Well, we'll see what the future brings... (It just now occurs to me what a Bleeding obvious statement that is...).

I know that my work is held in high regard (having been told a few times recently that it is appreciated that my work comes in on-time and is of universally high quality, including today's meeting), but that I need to work on being "friendlier"...

Ah... I'll work on that...

One of the most frustrating parts of this whole thing is that being good at my job isn't enough. I'm catching flak for not playing some game.

-Waldo

2009-11-04

Rather be an Inventor than a Firefighter

Yesterday two separate people told me how much more relaxed I've been this week than the previous month or two.

Before this week, I'd been primarily responsible for fixing at least three or four major crises in a row. Fortunately, each waited before the previous had been resolved before kicking in, but still... In fact, in each instance there was a half-to full day between one and the next. It was just long enough to remember what the Urgent task that I was supposed to be working on, figure out where I'd left off, and start getting back into that when a cooling device and excrement would again meet. This has been since the beginning of Sept that this has been going on. A lot of things (including this blog and personal correspondance, as well as a significant amount of coding) had been set aside for the duration because of exhaustion.

I realized today how to explain what I enjoy and dislike about my career in IT: "I'd rather be an Inventor or Engineer than a Firefighter."

-Waldo

2009-09-17

Bad Week

It's been an exceptionally bad week. Unfortunately, I can't talk about it except in vague terms. But here we go...

Huge bombshell was dripped last Thursday in the early afternoon. (A week ago today, but feels much longer.) No one's fault; well, it was SOMEONE's fault, but there was no negligence. It was the kind of bomb that demands that all other work ceases and efforts are diverted to this activity. In addition, I worked a 15-hour day on Friday, and significant hours on Saturday and Sunday. It's also consumed almost my entire week.

Things are pretty much under control, but there is still stuff to do. On top of everything, this weekend is a scheduled server room outage, so odd hours for me and plenty of 'em. Oh, and I'm on pager duty this week, too.

Now, while this may sound like a gripe-fest, it isn't. I know that I'll be compensated for my time (in comp time), and all of the right kinds of attention have been garnered. More credit than I deserve (in my opinion) has been laid upon my shoulders, but it is nice to hear praise bestowed in my presence to the higher-ups. Not to mention the renewed respect of my peers.

No, crises sometumes aren't all bad, but I prefer to keep them to a minimum. I gave been exhausted, and likely irritable. I just wish that my Bride and Childe didn't have to suffer from my work. Jen's been a saint.

The other casualties of work besides my peace-of-mind and my family include a perl script RSS-feed creator (freaking almost done), a DNS server that needs to be configured (shouldn't take too long, but still), and Jonathan's Project. I really hate that this last has sat just immediately to the right of my hand. I want to work on this so much, but I can't get more than 15 minutes of time without a damn interruption.

Enough people have commented on how I look or sound that I think that It may be prudent to take some time off. A weekend may be nice...

-Waldo