2016-12-23

That Product Team Really Brought The Room Together

As with last year, I participated in SysAdvent again, a tradition of Systems Administrators to submit to write a blog article of their choosing.  Similar to a Conference, a Call for Proposals is published, and if interested, you propose a Topic (a title and brief description are all that are required, if I recall correctly.  My proposal was accepted, an editor was assigned, and due dates were set.

I'd had a lot of notes of things that I wanted to say, but I ended up struggling with constructing a coherent narrative, but lots of smallish topics that would be highly overlapping in a Venn diagram, but didn't present linearly in my mind.  Again, being something of a followup of last year's article/talk, there's a lot of related material, but also a lot omitted for time and space reasons.  (I needed to end it _somewhere_...)

Again, I offer my thanks to my editor Cody and the rest of the SysAdvent team for maintaining this tradition, and I sincerely appreciate the work that goes into keeping it going.  While I'd had this bunch of ideas floating around in my head (largely as a conference presentation), it was their work that made me sit down and write it out as prose, rather than a slide deck.

Now, with permission from the author, I present:  



Day 23 - That Product Team Really Brought The Room Together



Written by: H. “Waldo” Grunenwald (@gwaldo)
Edited by: Cody Wilbourn (cody@codywilbourn.com)

There are plenty of articles talking about DevOps and Teamwork and Aligning Authority with Responsibility, but what does that look like in practice?
Having been on many different kinds of teams, and having run a Product Team, I will talk about why I think that Product Teams are the best way to create and run products sustainably.

HEY, DIDN’T YOU START WITH “DEVOPS DOESN’T WORK” LAST TIME?

Yes, (yes I did). And I believe every word of it. I consider Product Teams to be a definitive implementation of “Scaling DevOps” which so many people seem to struggle with when the number of people involved scales beyond a conference room.
To my mind, Product Teams are the best way to ensure that responsibility is aligned with authority, ensuring that the applications that you need are operated sustainably, and minimizes the likelihood that a given application becomes “Legacy”.

What do you mean “Legacy”?

There is a term that we use in this industry, but I don’t think that I’ve ever seen it be well-defined. In my mind, a Legacy Product is:
  1. Uncared For: Not under active development. Any releases are rare, using old patterns, and are often the result of a security update breaking functionality, causing a fire-drill of fixing dependencies.
  2. In an Orphanage: The people who are responsible for it don’t feel that they own it, but are stuck with it.
If there is a team that actively manages a legacy product, they might not be really equipped to make significant changes. Most of the time they are tasked only with keeping this product barely running, and may have a portfolio of other products in similar state. This “Legacy Team” might have some connotation associated with it of being “second-string” engineers, and it might be a dumping ground for many apps that aren’t currently in active development.

What are we coming from?

The assumed situation is there is a product or service that is defined by “business needs”.
A decision is come to that these goals are worthwhile, and a Project is defined.
This may be a new product or service, or it may be features to an existing product or service. At some point this Project goes into “Production”, where it is hopefully consumed by users, and hopefully it provides value.

Here’s where things get tricky.
In most companies, the team that writes the product is not the same team that runs the product. This is because many companies organize themselves into departments. Those departments often have technical distinctions like “Development” or “Engineering”, and “Quality Assurance”, and an “Operations” and/or “Systems” groups. In these companies, people are aligned along job function, but each group is responsible for a phase of a product’s lifecycle.
And this is exactly where the heart of the problem is:
The first people who respond to a failure of the application aren’t the application’s developers, creating a business inefficiency:
Your feedback loop is broken.

As a special bonus, some companies organize their development into a so-called “Studio Model”, where a “studio” of developers work on one project. When they are done with that project, it gets handed off to a separate team for operation, and another team will handle “maintenance” development work. That original Studio team may never touch that original codebase again! If you have ever had to maintain or operate someone else’s software, you might well imagine the incentives that this drives, like assumptions that everything is available, and latency is always low!
See, the Studio Model is patterned after Movie and Video Game Studios. This can work well if you are releasing a product that doesn’t have an operational component. Studios make a lot of sense if you’re releasing a film. Some applications like single-player Games, and Mobile Apps that don’t rely on Services are great examples of this.
If your product does have an operational component, this is great for the people on the original Studio team, for whom work is an evergreen pasture. Unfortunately it makes things more painful for everyone who has to deal with the aftermath, including the customers. In reality it’s a really efficient way of turning out Legacy code.
Let’s face it, your CEO doesn’t care that you wrote code real good. They care that the features and products work well, and are available so that they bring in money. They want an investment that pays off.
Having Projects isn’t a problem. But funding teams based on Projects is problematic. You should organize around Products.

OK, I’LL BITE. WHAT’S A PRODUCT TEAM?

Simply put, a Product Team is a team that is organized around a business problem. The Product Team is comprised of people such that it is largely Self-Contained, and collectively the team Owns it’s own Products. It is “long-lived”, as the intention behind it is that the team is left intact as long as the product is in service.
Individuals on the team will have “Specialties”, but “that’s not my job” doesn’t exist. The QA Engineer specializes in determining ways of assuring that software does what’s expected to. They are not responsible for the writing of useful test cases, but they are not limited to the writing of tests. Notably, they’re not solely responsible for the writing of tests. Likewise for Operations Engineers, who have specialties in operating software, infrastructure automation, and monitoring, but they aren’t limited to or solely responsible for those components. Likewise for Software Engineers…
But the Product Team doesn’t only include so-called “members of technical staff”. The Product Team may also need other expertise! Design might be an easy assumption, but perhaps you should have a team member from Marketing, or Payments Receivable, or anyone who has domain expertise in the product!
It’s not a matter of that lofty goal of “Everyone can do everything.” Even on Silo teams, this never works. This is “Everyone knows enough to figure anything out“, and ”Everyone feels enough ownership to be able to make changes."
The people on this team are on THIS team. Having or being an engineer on multiple teams is painful and will cause problems.

You mentioned “Aligning Authority with Responsibility” before…

By having the team be closely-knit, and long-lived, certain understandings need to be had. What I mean is that if you want to have a successful product, and a sustainable lifecycle, there are some understandings that need to take place with regards to the staffing:
  • Engineers have a one-to-one relationship to a Product Team.
  • Products have a one-to-one relationship with a Product Team.
  • A Product Team may have a one-to-many relationship with it’s Products.
  • A Product Team will have a one-to-one relationship with a Pager Rotation.
  • An Engineer will have a one-to-one membership with it’s Pager Rotation.
Simply put, having people split among many different teams sounds great in theory, but it never works out well for the individuals. The teams never seem to get the attention required from the Individual Contributors, and an Individual Contributor is in a position of effectively doubling their number of bosses having to appease them all.

Pager

Some developers might balk at being made to participate in the operation of the product that they’re building. This is a natural reaction.
They’ve never had to do that before. Yes, exactly.
That doesn’t mean that they shouldn’t have to. That is the “we’ve always done it this way” argument.

This topic has already been well-covered in another article in this year’s SysAdvent, in Alice Goldfuss’ “No More On-Call Martyrs”, itself well-followed up by @DBSmasher’s “On Being On-Call”.
In this regard, I say is that if one’s sleep is on the line - if you are on the hook for the pager - you will take much more care in your assumptions when building a product, than if that is someone else’s problem.
The last thing that amazes me is that this is a pattern that is well-documented in many of the so-called “Unicorn Companies”, who’s practices many companies seek to emulate, but somehow “Developers-on-Call” always is argued to be “A Bridge Too Far”.
I would argue that this is one of their keystones.

WHO’S IN CHARGE

Before I talk about anything else, I have to make one thing perfectly clear. If you have a role in Functional Leadership (Engineering Manager, Operations Director, etc), your role will probably change.
In Product Teams, the Product Owner decides work to be done and priorities.
Within the team you have the skills that you need to create and run it, delegating functions that you don’t possess to other Product Teams. (DBA’s being somewhat rare, and “DB-as-a-Service” is somewhat common.)
Many Engineering and Operations managers were promoted because they were good at Engineering or Ops. Unfortunately it’s then that it sets in that, in Lindsay Holmwood’s words, “It’s not a promotion, it’s a career change”, and also addressed in this year’s SysAdvent article “Trained Engineers - Overnight Managers (or ‘The Art of Not Destroying Your Company’)” by Nir Cohen.
How many of you miss Engineering, but spend all of your time doing… stuff?
Under an org that leverages Product Teams, Functional Leaders have a fundamentally different role than they did before.

Leadership Roles

Under Product Team paradigm, Product Managers are responsible for the work, while Functional Managers are responsible for passing of knowledge, and overseeing the career growth of Individual Contributors.
Product ManagersFunctional Managers
Owns ProductIC’s Professional Development
Product DirectionCoordinate Knowledge
Assign Work & PriorityKeeper of Culture
Hire & Fire from TeamInvolved in Community
Decide Team StandardsBullshit Detector / Voice of Reason

Product Managers

The Product Manager “Owns the Product”. They are ultimately responsible for the product successfully meeting business needs. Everything else is in support of that. I must stress that it isn’t necessary that a Product Manager be technical, though it does seem to help.
The product owner is the person who understands the business goals that knowledge and those stakes, they assign work and priorities such that it’s aligned with those business goals.
Knowing the specific problems that they’re solving and the makeup of their team, they are responsible for hiring and firing from the team.
Because the Product Team is responsible for their own success, and availability (by which I mean, of course, the Pager), they get to make decisions locally. They get to decide themselves what technologies they want to use and suffer.
Finally, the Product Manager evangalizes their product for other teams to leverage, and helps to on-board them as customers.

Functional Managers

At this point, I expect that the Functional managers are wondering “well what do I do?” Functional Managers aren’t dictating what work is done anymore, but there is still a lot of value that they bring. Their job becomes The People.
I don’t know a single functional manager who has been able to attend to their people’s professional development like they feel that they should.
Since technology decisions are made within the Product Team, the Functional Management has a key role in coordinating knowledge between the members of their Community, keeping track of who’s-using-what, and the relevant successes and pitfalls. When one team is considering a new tool that another is using, or a team is struggling with a tech, the functional manager is well-equipped for connecting people.
Functional Managers are the Keepers of Culture, and are encouraged to be involved in Community. That community-building is both within the company and in their physical region.
Functional managers are crucial for Hiring into the company, and helping Product Managers with hiring skills that they aren’t strong with. For instance, I would run a developer candidate by a development manager for a sanity-check, but for a DBA, I’d be very reliant on a DBA Manager’s expertise and opinion!
Relatedly, the Functional Manager serves as a combination Bullshit Detector and Voice-of-Reason when there are misunderstandings between the Product Owners and their Engineers.

The Reason for Broad Standards

Broad standards are often argued for one of two main reasons: either for “hiring purposes”, where engineers may be swapped relatively interchangably, or because there is a single Ops team responsible for many products, who doesn’t have ability to cope with multiple ways of doing things. (Since any one Engineer might be called upon to solve many apps in the dark of the night.)
Unfortunately, app development can often be hampered by those Standards that don’t fit their case and needs.
Hahahaha I’m kidding! What really happens is that Dev teams clam up about what they’re doing. They subvert the “standards” and don’t tell anyone, either pleading ignorance or claiming that they can’t go back and rewrite because of a deadline. Best case is that they run a request for an “exemption” up the flagpole, where Ops gets Over-riden. And Operations is still left with a “standard” and pile of “one-offs”.

Duplicate Effort

Another claimed reason for broad “Standards” is to “reduce the amount of duplicated effort”. While this is a great goal, again, it tends to cause more friction than is necessary.
The problem is the fallacy that comes from assuming that the way that a problem was solved for one team will be helpful to another. That solution may be helpful, but to assume that it will, and making it mandatory is going to cause unnecessary effort.
At one company, my team ran ELK as a product for other teams to consume. A new team was spun up, and asked about our offerings, but asked my opinion of them using a different service (an externally-hosted ELK-as-a-Service). I was thrilled, in fact! I want to see if we were solving the problem in the best way, or even a good way, and to be able to come back later for some lessons-learned!

Scaling Teams

At some point, your product is going to get bigger than everyone can keep in their head. It may be time to split up responsibilities into a new team. But where to draw boundaries? Interrogate them!
A trick that I learned a long time ago for testing your design in Object-Oriented Programming is to ask the object a question: “What are you?” or “What do you do?” If the answer includes an “And”, you have two things. This works well for evaluating both Class and Method design. (I think that this tidbit was from Sandi Metz’s “Practical Object-Oriented Design in Ruby” (aka “POODR”), which I was exposed to by Mark Menard of Enable Labs.)

What Doesn’t Work

Because this can be a change to how teams work, it’s important to be clear about the rules. If there is a misunderstanding about where work comes from, or who the individual contributors work for, or who decides the people who belong to what team, this begins to fall apart.
Having people work for multiple sets of managers is untenable.
Having people quit is an unavoidable problem in any company. Having a functional manager decide by themselves that they’re going to reassign one of your people away from you is worse, because they’re not playing by the rules.

WARNING: Matrix Organizations Considered Harmful

If someone proposes a Matrix Org, you need to be extremely careful. It’s important that you keep a separation of Church and State. Matrix Organizations instantly create a conflict between the different axes of managers, with the tension being centered on the individual contributor who just wants to do good work. A Matrix Org actively adds politics.
All Work comes from Product Management. Functional Management is for Individual Careers and Sharing Knowledge.
This shouldn’t be hard to remember, as the Functional Leaders shouldn’t have work to assign. But it will be hard, because they’ll probably have a lot of muscle-memory around prioritizing and assigning work.
Now, I’m sure a lot of you are skeptical about how a product team actually works. You might just not believe me.
If you properly staff a team, give them direction, authority, and responsibility, they will amaze you.

GETTING STARTED

As with anything, the hardest thing to do is begin.

Identifying Products

An easy candidate is a new intiative for development that may be coming down the pipeline, but if you aren’t aware of any new products, you probably have many “orphaned” products already running within your environment.
As I discussed last year, there are plenty of ways of finding products that are critical, but not actually maintained by anyone. Common places to look are tools arounddevelopment, like CI, SCM, and Wikis. Also commonly neglected are what I like to call “Insight Tools” like Logging, Metrics, and Monitoring/Alerting. These all tend to be installed and treated as appliances, not receiving any maintenance or attention unless something breaks. Sadly, it means that there’s a lot of value left on the table with these products!

Speaking with Leadership

If you say “I want to start doing Product Team”, they’re going to think of something along the lines of BizDev. A subtle but important difference is to say that you want to organize a cross-functional team, that is dedicated to the creation and long-term operation of the Product.
I don’t know why, but it seems that executive go gooey when they hear the phrase “cross-functional team”. So, go buzz-word away. While you’re at it, try to initiate some Thought Leadership and coin a term with them like “Product-Oriented Development”! (No, of course it doesn’t mean anything…)
What you’re looking for is a commitment to fund the product long-term. The idea is that your team will solve problems centered around a set of problems. The team is of “Your People”, that becomes a “we”. Oddly enough, when you have a team focused and aligned together, you have really built a capital-T “Team”.

SUSTAINED

The Product Team should be intact and in-development as long as the product is found to be necessary. When the product is retired, they product team may be disbanded, but nobody should be left with the check. Over time, the features should stabilize, and the bugs will disappear, and the operation of the application should stabilize to a low level of effort, even including external updates.
That doesn’t mean that your engineers need to be farmed out to other teams; you should take on new work, and begin development of new products that aid in your space!

CONCLUSION

I believe that organizing work in Product Teams is one of the best ways to run a responsible engineering organization. By orienting your work around the Product, you are aligning your people to business needs, and the individuals will have a better understanding of the value of their work. By keeping the team size small, they know how the parts work and fit. By everyone operating the product, they feel a sense of ownership, and by being responsible for the product’s availability, they’re much more likely to build resilient and fault-tolerant applications!
It is for these reasons and more, that I consider Product Teams to be the definitive DevOps implementation.

GRATITUDE

I’d like to thank my friends for listening to me rant, and my editor Cody Wilbourn for their help bringing this article together. I’d also like to thank the SysAdvent team for putting in the effort that keeps this fun tradition going.

CONTACT ME

If you wish to discuss with me further, please feel free to reach out to me. I am gwaldo on Twitter and Gmail/Hangouts and Steam, and seldom refuse hugs (or offers of beverage and company) at conferences. Death Threats and unpleasantness beyond the realm of constructive Criticism may be sent to:


Waldo  
c/o FBI Headquarters  
935 Pennsylvania Avenue, NW  
Washington, D.C.  
20535-0001

2016-11-01

On The DevOps Drinking Game

Just for laughs, as part of my "Fear and Loathing in Systems Administration" conference talk, I saw a real gap in our community's resources.  My research turned up nothing, so I took it upon myself to "be the change that I want to see in the world.

Thus was born a new GitHub project: The DevOps Drinking Game.

Enjoy, and Be Safe!

-Waldo

2016-08-29

On the Loss of a Star

Someone that I don't know died today. Apparently many of my friends have close personal relationships with famous actors and musicians. Apparently they don't invite me along when they hang out with their celebrity-friends. I can't blame them.
Fortunately for me, the works that I knew these celebrities from has been recorded, so that I may continue to enjoy their works as I always have.
For those of you who had a relationship with the deceased, I am sorry for your loss.

2015-12-31

Fear and Loathing in Systems Administration

This year I participated in SysAdvent, a tradition of Systems Administrators to submit to write a blog article of their choosing.  Similar to a Conference, a Call for Proposals is published, and if interested, you propose a Topic (a title and brief description are all that are required, if I recall correctly.  My proposal was accepted, an editor was assigned, and due dates were set.

After fleshing out some further notes and constructing an outline, I set out to procrastinate until a couple of days before my due-to-editor deadline.  My editor, the fantastic Shaun Mouton (@sdmouton), promptly reviewed my content for style, sense, and sanity, and when we came to consensus that it was good enough, he submitted it for publication.

I offer my thanks to Shaun and the rest of the SysAdvent team for maintaining this tradition, and I sincerely appreciate the work that goes into keeping it going.  While I'd had this bunch of ideas floating around in my head (largely as a conference presentation), it was their work that made me sit down and write it out as prose, rather than a slide deck.

Now, with permission from the author (), I present:  


Fear and Loathing in Systems Administration

Written by H. “Waldo” Grunenwald (@gwaldo)
Edited by Shaun Mouton (@sdmouton)

“DEVOPS DOESN’T WORK”

The number of times that I’ve heard this is amazing. The best thing about this phrase is that the people who say it are often completely right, even if for very wrong reasons.

Who Says This?

Well, let’s talk about the people who most commonly have this reaction: SysAdmins. I’m going to use the term “SysAdmins” as a shorthand for a broad group. The people in this group have widely varying titles, but it is most commonly “Systems”, “Network”, or “Operations” follwed by “Administrator”, “Engineer”, “Technician”, or “Analyst”.

In some companies, these folks have the best computers in the place. In others, they have to live with the worst. Their workspace probably isn’t very nice, and almost certainly has no natural light. If there is a pager rotation, they are almost certainly on it. If there isn’t a rotation, they’re basically on-call all of the time.

During the course of a normal day they might have to switch contexts between disaster planning, calculating power and HVAC needs for a new datacenter, scrambling to complete an outage-driven SAN migration, rushing to address urgent requests to help people with their email, to troubleshoot a printing problem, or suss out why someone can’t get to their electric company’s bill pay website. They may be the sole person with database expertise in the company, or they may work on a team of dozens.

The work is largely invisible except when something fails, in which case it’s highly visible and widely impacting.

Bug vs Incident

These are typically cynical people, because there are only so many times that you can’t make the team/department/company party for ostensibly “celebrating our successes” because something’s broken, and you’re left to clean up after the “success”. There are only so many times that one sees a new project announced and begins to hire more people. When asked who’s going to support the new project, the response is a blank look and “you are”. The “…of course.” may not be vocalized, but it’s probably there. When asked how many people they get to hire to help with the workload, the response is a combination of “sorry, but there wasn’t anything left in the budget”, “it won’t be that much more work”, or a variation of the “team player / good soldier” speech. There are only so many times one can take getting your requests for training or conference budget rejected out of hand, and have your requests for training or conference budget laughed out of the room.

They probably have basic working knowledge of a half-dozen programming languages, but most likely they often think in Shell. They probably know at least three ways of testing if a port is open, and probably have a soft spot in their heart for a couple of shell commands.

They may have seen or participated in a DevOps initiative that consisted of a team or position rename, or helpfully suggested that they install some Config Management and Monitoring software so that “we can DevOps now…” or “so we can do Agile”. When they hear “DevOps” or “Agile”, what they are hearing is is Let’s take the same people who can’t handle a planned release schedule or make whatever effort that they need to squeak by the Change Board and Release Management requirements, and give them unfettered access to Production. Clearly, I’m not paged often enough.

So what is one to do? How is one to maintain their sanity in the face of increasing job scope, increasing demand for access and velocity, and little hope for an effective new-hire count? Not to mention continuing to juggle the existing volume of requests, and continuing to grease the existing gears to keep the machine running.

GET HELP

Get Help

Please note that I’m not saying “just”. There’s nothing just about this situation; there is nothing simple about any of this, and Justice hasn’t been seen in a long time in an environment where this is the norm. Most of these changes are difficult. They will take work, and will require convincing other teams to join in your cause.

Admitting you have a Problem

The problem (probably) isn’t technical. It’s almost entirely social.

Because SysAdmins are typically responsible for the environment, the easiest way to assure that the state is stable is to lock everyone else from it. While this helped with the goal of “keeping out unexpected changes”, it had a number of side effects.

First, a kind of learned helplessness has set in. Your customers and teammates became so used to being “hands-off”, that they don’t have the wherewithal to meet reasonable expectations. Since they’re uncomfortable making any changes, all changes must be made by the SysAdmins. This leads to your time being taken by having to perform lots of low-value tasks.

Some teams settle on the pattern of “hands off Production, but you have access to Staging”, but this is fraught with peril. The most common problem that stems from this is “Configuration Drift.” Config Drift is when you have different settings in one environment (or server) than the others. When the cost to discover what Production looks like is high, it’s more likely that people will either use defaults, make assumptions, or use the same configs that they use in their IDEs. “Works on my machine”, indeed.

This is a problem well-solved by Configuration Management tools, but you still need to be willing to trust your peers and give them access. If you want to be part of the process of validating changes, you could put in place the structures that allow a pull-request and code-review workflow, something that your Software Engineering peers should be very accustomed to! Granting access to see the existing configs and the ability to propose changes also shares responsibility for your team’s environments and contributes to feelings of ownership. Denying colleagues the ability to effect necessary configuration changes contributes to the root problems of configuration drift and learned helplessness.

Stop Feeding the Machine

Don't feed the machineYour value is not in doing the work, but rather being able to make the decision to do the work.
I’ll be the first to say that “Automating ALL THE THINGS” is a flawed goal. At work, it’s usually said in the context of a Project, rather than part of a philosophy of Continual Improvement (Think Toyota Kata). You shouldn’t have to engage in an “Automation Project” to improve your environment. Build into your schedule time to solve one problem. Pick something that is rough, manual, and repeatable. Remove a small piece of friction. Move on to the next one. Hint: Logging into a server to make a configuration change should be a cue to implement configuration management!

While I agree that everything being automated, not everything should be automatic. Decision-making is complex, and attempting to codify all of the possible decision-making points is a fantastic way to make yourself insane. Not to mention that documenting your decision-making processes may be an unwanted look inside your brain. Caveat Implementor. (Or perhaps that’s just me…)

All of the units of work should be automated. But the decision to run the now-automated tasks can be left to a human. When you find that there is a correlation between steps, those pieces should be wrapped together. Automation isn’t a project into itself. It should be iterative. Pick something that’s painful. Make it a little smoother. Repeat. Ideally, you have time blocked out for Continuous Improvement. If not, create a meeting, or create a weekly project to do so. Review the issues that you’ve experienced lately, and pick something to make better. It might be worth making into a project, but it won’t be an ALL-THE-THINGS project. Create a scope of effort. Take the time to plan goals and requirements.

Whatever you don’t automate must be documented. Beyond the typical benefits of documentation, it also serves as “Functional Requirements” for someone else to pick up when they can help you with providing a solution. Try to recognize whether documenting or automating takes longer. Perhaps this piece of documentation will bet better served by “Executable Documentation” (i.e. code).

Clarify Your Role

Role-Playing Group

You should attempt to pick apart the parts of your work, and attempt to describe them. One way to make this a fun exercise is to use other job titles to describe the work.

Are you an “Internet Plumber”? How much of your job could be described as “Spelunking” into the deep dark caverns of Legacy systems?

If you want, you could ascribe Superhero names to these parts of your work. The added bonus is that it not only describes a role, but also a demeanor associated with them. When ‘bad code’ makes it to Production, do you go “Wolverine” on that dev team?

Could you describe part of your role as “Production Customs Official”? Are you the gateway to Production? If so, are you actually equipped to do that? Here’s a quick test: When you say “no, that can’t go”, do you get overridden?

More importantly, is this what you want to do?

Prepwork

You will need to prepare for this. Most SysAdmin teams do not have a healthy relationship with the rest of the business. You will need to initiate the healing.

Take someone to lunch. Preferably someone who you don’t know well. Ask questions, and listen to the answers. It is not time to defend yourself or your team. It’s time to find out what the business needs from someone else’s perspective. Ask what they think that your team’s role is in toward achieving that success. Ask what they think your team does well, and where there are gaps between what you have now and excellence.

Speak their language

Rosetta Stone

You probably recognize their words, but you need to go out of your way to speak them. To communicate your message, you will need meet them on their turf. This may seem terribly unfair - “Why can’t they meet me on my terms?!” - but I’m guessing that has not been working out well for you so far.

Not only do you need to use their language, but you need to communicate over their medium. And identifying who they are is step one in learning to speak it. It’s probably not IRC, and only writing it in email is a good way for it to be ignored.

If you’re speaking to management, be prepared to write a presentation. Executives especially like to see a slide-deck. It doesn’t have to be slick. It probably shouldn’t have sounds or much in the way of transitions, but a presentation can help to lay the groundwork for a conversation.

Discuss Scope, Staffing, and Priorities

Gantt Chart

Now that you have described your role, we also need to describe everything that you support.
What Products do you support? It’s entirely possible (likely, even) that the people and teams that you support don’t actually know what you’re responsible for. It could be argued that most of them shouldn’t need to know. But if you have been saying “no” to protect yourself, it’s a sign that you are significantly overextended. You need to have a real discussion with your leadership about your role, scope, and staffing.

In order to have this discussion, you need to prepare. You need to come up with a fairly comprehensive list of the products and teams that you support. This is a list of every team, and their products, the components and tasks that belong to you for each. Don’t forget all of the components that “nobody owns” but somehow people come to you to fix or implement (CI, SCM, Ticketing, Project, and Wiki tools seem to be common examples). Are you also responsible for Directory Services? Virtualization platform? Mail/Chat/Phones? Workstation Purchasing and provisioning? Printers? Do you manage the Storage, Networking, etc? Don’t be afraid of getting into details. It can help to provide clearly written potential impacts the company if some of these “hidden” services stop working? Your leadership might not know what LDAP or Directory Services are, but they’ll understand if nobody can log into their machines, they can’t pull information to build reports, and by-the-way nobody can deploy code because it relies on validating credentials…

What is most important to the company? What do you need to succeed? How much more staff do you need? What tooling or equipment would help you work more efficiently? Does code deploy even when it fails testing? How many outages have arisen due to this happening?

Demonstrate Cost and Value and Revisit Priorities


faux ink stamp "Priority"In order to have meaningful discussions with people in your company who aren’t necessarily technical, you need to be able to relate to a language that they speak. Regardless of team duties, the lingua franca of most teams is money. As Engineers, most of us prefer to think in terms of the tech itself, but in order to describe an impact, a unit of monetary value is a proxy for impact that most non-technical people can understand, even if they don’t grasp the details.

It is a helpful (if difficult and uncomfortable) habit to get into, but I encourage you to consider the components of cost that goes into every incident or task.

What is the cost of a main-site outage? How much revenue does this feature bring in? Why are you spending so much on infrastructure and effort to make that component Highly-Available? Why does it matter that you do that piece of maintenance? Show the negative value of doing things they way they are (Opportunity Cost), versus investing time to improve the automation around it. Describe how doing this maintenance work reduces your context switching, unplanned outages, and lost reputation of your company. Describe the benefit in increased visibility to the business, and Agency to be gained by your peers on other teams.

Why put in place these tools to let product teams self-serve? Describe that the features that the company’s teams spend so much time and effort (read: “money”) creating means nothing if those features aren’t available for customers to use. That having those features not available costs money in terms of feature billing, and reputation cost. If they claim that they’re doing Agile, but can’t do Continuous Delivery, they’re not really Agile, and the whole point of that framework is to improve delivery of value to the customer and the business!

Further, show how systems relate. It doesn’t have to be terribly detailed. Describe that the features that the customers use are reliant on xy, and z components of infrastructure. Draw the lines from LDAP to storage to your CI tool to testing code to artifacts delivered to Production. Then show some of the other systems that have similar dependencies.

Once the picture emerges showing how everything is reliant on unexciting things like LDAP, your Storage cluster, and that janky collection of angry shell and perl scripts that keep everything working, realization will begin to dawn.

Congratulations, you’ve just effectively communicated Value.

Align Responsibility with Authority

Are you held responsible for apps written by other people? Who gets paged when “the app” goes down? How does that make sense?

Get Devs on-call for their apps. SysAdmins should be escalated to. Devs can triage and troubleshoot their own apps more readily than you can. They get to call in the cavalry when they get stuck. They don’t need to know everything about the systems, and they don’t need to resolve everything. When a fault occurs and they need help, they stay on the call, pairing with you as you diagnose, troubleshoot, and resolve. That way, they don’t need to escalate to you for that thing the next time it occurs, and can collaborate on automating a permanent fix.

When teams aren’t responsible for their products - When they aren’t paged when it fails - they are numb to the pain that they inflict. They’re not trying to cause pain; they just don’t feel it. It’s especially easy to argue this for teams that proclaim that they use Agile development methods: If they claim to want “continuous feedback”, there is nothing more visceral for providing feedback than the feeling of being awoken by a pager in the middle of the night. When the inevitable exclamation comes that “we can’t interrupt our developers”, ask if it makes sense to interrupt someone else.
Even being aware of the pain (say, hearing how many times you were paged last night) can elicit sympathy, but that’s a far cry from the experience of being paged yourself.

Further, this is what that list of responsibilities is for. Asking each team to take responsibly for their own products, you will still likely have a hefty list of services that you provide that you are on-call for. As these set in, point out the staffing numbers. This may be a matter of the places that I have worked, but I have never seen a Developer-to-SysAdmin ratio of less than 5-1. In most places it is much higher. By adding these teams to pager rotations, they drastically reduce the load on you. By not adding them to pager rotations, they are complicit in your burnout.

Stop saying “No”


No No'sSysAdmins have a reputation for saying “No”. The people who are asking are probably not trying to make your life worse; They’re probably just trying to get their work done. They might not know what their “simple request” involves, and that it probably isn’t necessary.

But by not having Responsibility aligned with Authority, you may have been stuck with the pain of other people’s wishes. You know that fulfilling their request will cause you pain, so understandably, you say “no”. What often happens next is that they escalate until they hit someone sufficiently important enough to override you.

This is the basis for why SysAdmins feel steamrolled by everyone else, and everyone else feels held hostage by SysAdmins.

But all hope is not lost.

Stop saying “No”.

“Yes, but …” is a very powerful thing.

“Yes, but …” can be used to get you help.

“Yes, I can set that up for you, but we don’t have capacity to run it for you.” What happened there? You agreed that the request is reasonable. You set expectations of the level of support that you can give. You left the requestor with several options to continue the conversation.
  • They might have hiring reqs that they can’t fill. You can negotiate for some of them to go to your team, as you’re clearly understaffed.
  • Some of their engineers may join your team as a lateral move. They’ll need mentorship and training, but this kind of cross-training is invaluable. It’s a force multiplier. It also sets precedent.
  • They might take the responsibility for the Thing. They run it. They get paged for it. Of course you will probably have to be an escalation point to assist when it fails, but it’s their product. This again sets precedent.

Delegate

Most SysAdmins are stuck doing tasks that provide very little value because they restrict access to their peers. To my mind, there is one perfect example: “Playing Telephone”.

When I say “Playing Telephone”, I’m talking about the situation where someone (let’s say a Developer) wants logs from the application, but they don’t have access to get them. They request the logs from you. You fetch the log requested and provide it to them. “No, not that log, this log…” You fetch. “Hmm, I’m not seeing what I’m looking for, could you check in here for something that says something like this …?” And so on, and so on…

I don’t know what you’re hoping to prevent by restricting access, but if this scenario ever happens, you should know that you’re providing Negative Value. Again, let’s try to remember that your peers are not out to get you, and can probably be trusted to be reasonable humans if you meet them mid-way.

With that framework in mind, it’s time that you demonstrate some trust, and Delegate to them. Give them access. Your value is not in the logon credentials that you have, otherwise you’re just a poorly-implemented “Terminal-as-a-Service”.

Even better than giving access, is giving Tooling. Logging into a server should be an antipattern for most work! You need some better tooling. So, with the example of logging, let’s talk tooling.

Logging

First, logging into boxes to get logs is just dumb. Sure, you could wrap a tail command in a Rundeck job, but let’s Centralize those logs while we’re at it.

SysLog is better than nothing, but not by much. Shipping logs is easy, but consuming them as something useful is not. Batteries not included.

If your company wants to spend the money on Splunk, then encourage that. Splunk is a fantastic suite of tools, but I might wave you away from it if you’re not going to use it for everything. It’s going to be expensive, and if you’re not going to spend enough to use it for everything, there will be confusion as to what’s in there, and what’s stored elsewhere.

ELK (ElasticSearch + Logstash + Kibana, sometimes mistakenly simplified to “Logstash”), or a “Cloudy Elk” / “ELK-as-a-Service” is a good middle-ground. ELK is Free (as-in-beer), and very featureful.

Take your Centralized logging of choice, and provide your customers with the url to the web interface. Send them links to the “How to use” docs, and get out of their way!

Terminal-as-a-Service

Put a Bird on itIf someone asks you to “run this command for me”, you need to put a button on it.

You don’t need to RUN-ALL-THE-THINGS!

Rundeck is a fantastic tool to “Put a button on it”. Other people use their CI tools (like Jenkins or Bamboo) for this. My friend Jeremy Price gave an Ignite Talk at DevOpsDays NYC 2015 that describes this.

Personally I like Rundeck, because it’s pretty easy to make HA, tie it into LDAP for credentials, manage permissions, and by shipping it’s logs (see what I did there?), you get Auditing of who ran what and when!

If you have some data that Must be restricted, try to isolate those cases from the rest of your environment. You shouldn’t have to restrict Everything just because Something does need isolation.

Deploying Code. Yes, to Production

Why would you want to have to deploy other people’s code?! Do you really provide any value in that activity? If the deployment doesn’t go well, you’re launching another game of “Telephone”.
What if you make it easy for them to do it? Empower them with trust and tooling, making it easy to do the right thing! Give them tooling to see that the deploy succeeded! Logs are a start, but Metrics Dashboards that show changes in performance conditions and error rates will make it plain to see if a deployment was successful!

This Freedom doesn’t come free. Providing tooling doesn’t absolve the development teams of the need to communicate; in fact, it’s likely that they’ll have to communicate more. They will need to be watching those dashboards and logs to see for themselves the success of every deploy. They will also be more readily on-hand to help triage the inevitable instances when it doesn’t go swimmingly.

US

I say “They” in this article a lot. And that is because, by default, most organizations that I have been a part of or heard stories of have had a strong component of “Us-Versus-Them.” It’s only natural for there to be an “Us” and a “Them”, but thinking in those terms should be a very short-term use of the language. Strive for the goal of a “We” in your interactions at work, and reinforce that language wherever possible. While it may not be My job to do “foo”, it is Our job to ensure the team and company is successful.

While that may sounds like some happy-go-lucky, tree-hugging, pop-psychological nonsense (and it is…:), the goal here is to get you, the beleaguered SysAdmin the help that you need, in order to improve the capabilities of the business.

CODA

There is so much more to this topic, particularly the shift away from a Systems team supporting a bunch of Project teams to a series of largely self-sustaining Product teams, but that will have to wait for another day.

The psychological damage done to SysAdmins by their peers can make us bitter and cynical. I encourage my people to try to see that “They” aren’t trying to make life difficult for you, but it’s very likely that Authority and Responsibility are misaligned. I likewise encourage my people to take steps to make their lives better. A ship’s course is changed in small degrees over time.

When someone says “DevOps Doesn’t Work”, they’re absolutely correct. DevOps is a concept, a philosophy, a professional movement based in trust and collaboration among teams, to align them to business needs. A concept doesn’t do work, and a philosophy does not meet goals - people do. I encourage you to seek out ways of working better with your fellow people.

GRATITUDE

I’d like to thank my friends for listening to me rant, and my editor Shaun Mouton for their help bringing this article together. I’d also like to thank the SysAdvent team for putting in the effort that keeps this fun tradition going.

CONTACT ME

If you wish to discuss with me further, please feel free to reach out to me. I am gwaldo on Twitter and Gmail/Hangouts, and seldom refuse hugs (or offers of beverage and company) at conferences. Death Threats and unpleasantness beyond the realm of constructive Criticism may be sent to:

Waldo
c/o FBI Headquarters 
935 Pennsylvania Avenue, NW
Washington, D.C.
20535-0001

2015-12-15

On SysAdvent 2015

Knowing that I'd regret it if I didn't, I took on yet another task, and wrote a thing for SysAdvent.

"Fear and Loathing in Systems Administration"

This is a title that I'd had kicking around in my head for awhile, and have started shopping it around as a conference talk.  I'm hoping to get some feedback on the article content in order to tweak the presentation content.

2014-07-02

On Raymond Ramos

I want to talk about my friend Raymond Ramos.  My friend is dead.  He's been dead for nine years, and I still miss him.

One night while deployed to Iraq, I found myself "on duty".  At this particular instance, my duty was to inflate the ego of the CO and SgtMaj, and relay useful news to anyone who happened by the desk.  While armed.  In theory, if Al Qaeda came by, it was my job to kill them, but in reality there was no way for them to not have the jump on me, and the best that I could do is die a loud death.  Even in Iraq, Egos must be stroked.  And they were.

While I sat bored, uncomfortable, and awaiting zealous but non-specific rage (or a passing ego), someone who knew both of us told me that Ray had died.  My friend, who had been honorably discharged months earlier, had died in Alabama while traveling for a job interview.  My assistant took over the watch, as I walked away in tears.

He had made it through his enlistment.  He was supposed to have been safe.  That's probably what had hit me the hardest; "unsafe" was a deployment to a hotspot, or a mission.  "Unsafe" meant armed, briefed, and geared-up.  He was supposed to have been safe.  He made it through his enlistment intact.  Game over, you win, enter initials.

Nope.  See, "unsafe" is a sliding scale.  While I was on the far side of the scale (but not all the way over), in my mind, he was swaddled safe in civilian life.

While our CO was a raving asshole (which contrary to public image, is absolutely not normal), he did say exactly one useful thing when we landed in-theatre.  I don't recall exactly what he said, but what I absorbed is this: "Death can be a lightning strike on a clear day.  It can come at any time, from any direction.  When it's your time, there's nothing to do about it."

I'd absorbed this for myself, and my friends and colleagues in-theatre.  But Ray had been safe.

I never got an official story, but this is what I recall being told.  Ray had been driving for a job interview, and lost control of his car, went off the road, and was severely injured.  I seem to remember that he'd left messages for his girlfriend, but she didn't get them until he had died.  I wonder if he didn't have cell reception, and just left voice memos.  As I'd heard it, the last one (possibly the third), he was resigned to death, and said goodbye.

When a Marine departs a unit (whether a change of duty or when leaving service), it is customary - at least for those in good standing - to receive a plaque commemorating their service.  Naturally, the more rank that one has, the more extravagent the plaque.  Those who won't be missed get something generic, but a good plaque is apparent.  Ray was Corporal when he got out.  He may have picked up Sergeant right before he got out, but it would have been on his way out.

Normally there is a collection is taken from among everyone in a given unit for a given Marine's plaque.  You pay into each plaque as one of the best and most direct examples of "paying it forward" that there is in this world.  Plaques usually come from a local awards company (where you'd go have bowling trophies made), and the standard no-effort plaque has the Marine's name and dates with the unit, and the unit logo.  Perhaps a motivational phrase or a quote that the Marine was fond of saying.

Now, Ray wasn't what you could call a "Marine's Marine."  He wasn't "hard-core" or "gung-ho", or any of that bullshit.  You couldn't imagine him shouting "Let's take that fucking hill" outside of a joke.  In fact, Ray's demeanor could be said to piss some people off; Sergeants Major either loved him or hated him.  You could reliably predict how much a given Staff NCO would like him based on how comfortable they were with technology newer than, say, the typical ballpoint pen.  If they were the kind to send a "runner" than a email, they hated him; if they theoretically knew the difference between "Reply" and "Reply All", they probably loved him.  (NOTE: Senior Staff NCOs are very rarely more savvy than this.)

No, Ray was never going to win a "Marine / NCO of the Quarter" board.  He didn't really know Marine Corps history.  He knew the day-to-day customs & courtesy, but not much beyond that.  He shuffled when he walked, and never re-soled his boots, so much so that he and a few of our friends referred to them as his "Combat Slippers"; it didn't help that they weren't polished, and were only barely what you could describe as "black".  His uniform looked like it had once met an iron, but they weren't more than casual acquantences.  This was very foreign to me coming from an occupational field that was very starch-and-polish, as well as dirty-in-the-field; I was used to being one or the other.

Please keep in mind that this was circa-2004.  Google was around, but not in common use yet.  Technet and MSDN were around, but search sucked, and information was categorized.  (Which meant invariably that the taxonomy made sense, but not to you.)  There was no Stack Overflow.  While Windows XP/2003 were out, most of our domains were NT, groupware was Exchange 5.5, and workstations were either NT or 2000.  The only Unix was in the specialized Intelligence boxes, and I was probably one of 10 people on the base who'd even installed Linux.  This is what we had.

Ray and I were SysAdmins.  SysAdmins in the Marine Corps don't code.  Programmers code, and SysAdmins aren't Programmers.  Being a Programmer in the Marine Corps means that you will be spending 80% of your time working on the CO's Pet Project, which is usually some stupid website that nobody will really use and won't survive your departure.  The consequence of this is that the work that you should be doing falls to your compatriots, who are few in the first place, and already barely keeping things afloat.

But he had this insatiable curiosity, and this provided a great intellectual focus for him.  Unfortunately the big boss found out, and the next thing you knew, a Pandora's Box of pet projects exploded in our office.

So, one day Ray's hacking on something like a CMS for shit nobody cares about, or an in-browser chat client, and he's pondering.  Pondering hard.  "How can I....?"  I don'
t remember if he exclaimed "EUREEKA!", but it might as well have been.  Next comes "I KNOW!  I'LL DO IT WITH A COOKIE!"  This was the first time I'd seen demonstrated the pure joy that comes when a solution presents itself.

So, when it came time for him to leave, knowing that the unit would get him something generic, I took care of it.  I wanted him to have something special.

Ray loved him some Star Trek.  Original, Next Generation, whatever.  So, I went to a hobby shop, picked up a roughly 10" version of NCC-1701, assembed, painted, and decaled it.  I took it into the awards shop with an idea and a quote, and they said "no problem."

While the quote that Ray was most fond of was to reply to statements with "...But does that make it right?", what I had inscribed on his plaque was "I'll do it with a Cookie!"  

A going-away party in the Marine Corps can range from a don't-let-the-door-hit-you-on-the-ass-on-the-way-out just about all the way through a weekend-long Roman orgy of an alcohol-and-grilling bender.  Most of the time, it's making sure that your immediate office and some unit friends have lunch before you leave.  Often it's with said Marine's car packed, and they're about to drive off base for the last time.

Ray's going-away party was a lunch at El Cerro Grande, the local Mexican establishment.  We had a pretty good turnout.  There were stories told, and a few last-minute items he'd forgotten.  After the food and nostalgia finished, and the awkward and anxious quiet settled, some of us gave mini-speeches.  MSgts Brown and Ayo said something, and they turned it over to me.  I said a few things, reflecting on our past few years.  Then I presented his plaque.

He cried and hugged me.  I cried.

This is how I remember my friend Ray.

Ray indirectly taught me much more important things than he directly taught me.  The most useful to my life and career is that being autodidactic is very useful.  Being able to state a problem, ask an actionable question, and research an answer is amazingly useful.  He also taught me that Programming is not (necessarily) evil, and formal training is not required to learn it.

Let us end with a toast.  "To Absent Friends..."

-Waldo