Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Keep calm and use the runbook (cortex.io)
120 points by kiyanwang on July 4, 2022 | hide | past | favorite | 50 comments


It is obvious to me that keeping the runbook up to date is really important. But how do you foster a culture in which everyone keeps it up to date?

Many people don't even read the readme, let alone correct it. Is it cultural? If you are so privileged that your books are updated, please share with us your ways!


Runbooks, checklists, playbooks, SOPs (standard operating procedures), broadly fall into two categories:

- those that always need to be followed

- those that provide helpful and welcome direction on how to do an unknown or rarely occurring task

It needs to be clear which is which.

An example of one that always needs to be followed is the deployment process. It changes from time to time. The person running it may perfectly remember all the steps from the last time it was run. But this time, a new step may have been added. Reading and following it step by step is critical.

How to keep these up to date? When things break, do a retrospective. Often, the action item required to prevent an issue will require you to update the runbook.

For the runbooks that are "nice to have", you should hear about problems from the people that run them. I regularly hear that such-and-such instructions no longer work, or such-and-such step needs to be changed. People tend to appreciate these types of instructions more as they can see how they make their lives easier. Because of that, they tend to take care of them more.


> But how do you foster a culture in which everyone keeps it up to date?

Every service has a runbook. To ensure that they're kept up to date, teams are required to do twice-yearly Disaster Recovery (DR) drills where they fail their service out of one region or data center and into another. (This is for each service. They can be rescheduled - modulo some nagging - if absolutely necessary, but you still need to do them eventually as a part of OKRs / KTLO.)

DR exercises by necessity require that the runbook is up to date and that capacity measures and load testing are recent and accurate. By doing these semi-regularly, your team retains full familiarity with the operational characteristics of the system. Your service, your data stores, your dependencies, and your dependents. Hopefully you (or adjacent teams) won't discover any unknowns, but if you do, they also belong in a runbook or postmortem with action items set to address the issues.

Furthermore, every team keeps a weekly "oncall log", which is essentially a human aggregated and summarized log of pages received. The oncall engineer logs anything that crops up, including new failure classes, and triages fixing them (or gets assistance if they're junior). Novel happenings can easily be discussed during oncall handoff or briefly during standup. Some of these items will naturally find their way into the relevant runbooks.

And as another commenter suggested, use Google Docs. These should be super easy to edit with zero friction. The requirement of creating a PR increases the odds that something will get skipped.

FWIW, several of my previous teams required more than five nines of uptime as we'd lose millions for outages. We had a very mature process.


It's cultural.

Many companies will block folks from releasing "incomplete" work, so you end up with either no runbooks, or "complete" runbooks that go out of date incredibly quickly.

Optimising for editing (https://onlineornot.com/incident-management/incident-respons...) helps.



Yes its cultural, but you can change culture.

Simple checklists that the team agrees upon for tasks that should be completed for a certain process. Just an example: after each incident do a postmortem, part of which is an update of runbooks and do a quick share in the team of those updates. Doesn't have to be bureaucratic and fancy, everything can be short and skipped when its really not relevant. But when everybody on the team agrees with them, and you have some sort of peer review baked in, it will be a lot easier.

Because the same people who don't update the runbooks do actually feel it would be better if they did.


I joined a new team about two years back.

They were very reliant on runbooks for operational things.

Problem was, at least half of them were out of date.

Need to access logs on a machine when they aren't being sent someplace like {Datadog, Cloudwatch, etc.}? Here are the SSH instructions. But wait, this key is no longer in the secure vault. etc. etc. with no updated information on how to get the new IAM or whatever permissions needed for SSH access.

I think calling it a culture problem is a cop-out, unfortunately.

Documentation woes, in and of themselves, exist everywhere in this industry. I don't know what the solution is, but I just don't think it's a culture thing.

If anything, it should be a mental health thing. You're on-call and get paged at 3AM about some metric drastically falling and you can't get the logs on your usual log system and need to log into a machine or cluster directly. And since your team is heavily runbook-reactive (new term I just coined to describe a team reliant on runbooks to solve "known" problems rather than learned knowledge), you're totally stuck and under a lot of pressure precisely because the runbook is no longer applicable. This is never fun for anyone.

One possible approach is setting up a type of Red vs Blue team monthly or quarterly event. Half the team spends the week randomly "breaking" things (ideally in a stage or other, more isolated environment) and the other half gets the pages, learns what runbooks are out of date or otherwise no longer correct, fix the issue both operationally and in the runbooks. Additionally, in the process one team learns about what kinds of issues can break the system(s) and the other team learns about what kinds of fixes can return the system(s) to working order.

Then switch off roles every other time.


In the onboarding welcome call, one of the first things that I share is that if the new team member notices an issue in any of the documentation, they should either correct it, or at least raise a flag. You need to build the culture from the very start.


Make it part of the onboarding process! Fresh eyes are critical in a way you haven't been in ages - it also doubles as a way to initiate new hires into your processes. If there's a problem - they might even notice it and learn something along the way!


This is a great idea, but unfortunately many runbooks can only be meaningfully evaluated in a situation in which they are useful. That is - you need to instigate an outage (ideally, in a controlled reversible way[0]) before you can "test" a runbook.

[0] https://sre.google/


If you're following Google's SRE all of your infrastructure should be pretty easy to replicate with a sample of production data. You can try to crash it in calorimetry to your heart's content, away from production systems. You might even learn (or teach) some things through the process of creating and attempting to destroy a test bed of this nature. They (https://sre.google) cover this in chapter 17.


> But how do you foster a culture in which everyone keeps it up to date?

give it, along with other tasks that ultimately increase entire team cadence, more value at review time than short term quarterly goals. recognize that at a certain scale, it's true 10x work.

i strongly believe in doing stuff like this, yet it seems that every time i've actually done it in recent memory, it has worked against me in performance reviews.


It's kind of crazy that we live in a world where questions like this are even a thing.

> how do you foster a culture in which everyone keeps it up to date

You hire the people who will do the stuff you need them to do, and you don't hire the people who won't.

Given all the well-known strife over hiring games that are the source of constant complaints, you'd think that meant that most orgs' resident personnel experts would be at a place where they already at least have the basics covered—ensuring that employees will be able to manage to do the stuff that they're supposed to do after being hired. And yet here we are.

How much market inefficiency and dysfunction in the industry would disappear if people abandoned any illusions about the efficacy of riddles to attempt to measure aptitude (usually by approaching it sideways), and just focused on these two things instead: "We are a company that does X. We need to hire someone to take care of Y. Part of that involves doing Z. Can you do Z (or figure out how)? Will you do Z?"


> Can you do Z (or figure out how)? Will you do Z?

You've replaced one problem with one just as hard! Predicting how people will behave is no easy feat.


It's not really a call to predict anything.

If you're hiring some who you're going to need to mop the store, and during the interview you make it clear that they're going to need to mop the store, and you ask them if they will mop the store, and it turns out that they don't mop the store, that's considered a sufficient reason to get rid of them.

Having an explicit discussion on the topic of "Will you do this thing that we expect you to do?" means that (a) the interview tracks what actual day-to-day expectations are, and (b) if they don't do the thing, there's no need to feel awkward about wanting to not keep them around after promising something and failing to follow through on it.


And ... will you update documentation? Do you have examples of prior work maintaining any documentation?


Post mortems/RCAs with after action items (that get followed)


Round robin on-call


This is the best suggestion imho. Most developers simply don't enjoy documentation because it doesn't "do anything" and good luck to those who say "employ people who will do it" because we all want to do that but when people don't do it, and you fire them, you have an even bigger issue with morale and rehiring.

Putting people on an on-call rota gives them lots more experiences and takes them out of their silos and allows them to see the effect that their work has on everybody else. It also gives them the time to be updating docs without feeling guilty about not coding.


What do you do in an org where developers aren’t included in the on call rotation and SREs are stuck with it alone?


Water flows down the easiest path. Make runbooks (and their maintenance) easy. This requires management buy-in to create incentives that make runbooks the easiest path.

In my experience, getting management on-board is the most difficult part since engineers will be lazy (I mean this here in a positive, efficient way) and do the easier thing.

I’ve set up runbooks on a few different teams at different companies and I’ve found some strategies that help. Feel free to pick and choose what would make sense from this list:

- Give runbooks (and on-call) time. Subtract the number of expected on-call weeks from your quarterly plan. This is where management buy-in is the most important since on the surface it impacts “throughput” or whatever metric is in vogue. As an aside, it’s unfortunate that most managers (IME) don’t track morale or operational burden as metrics…

- When an engineer is on-call they can do housekeeping, sweep runbooks, dashboards, etc. You want the on-call engineer to have time to do RCAs, and make reasonable changes in response to incidents. Downtime can be spent cleaning and updating. We’ve called out “operational improvements” during standup.

- Every alert links to a runbook. No alerts can be merged without a runbook. Dashboard, etc, are linked from the runbook. The engineer that created the system/alert might not be the first to encounter it at 3am and the person on-call should know what to do.

- Every time you get an alert, scan the runbook. Even if you are a system expert. Something may have changed or someone may have modified a process to be more automated. You could be pleasantly surprised that an alert is now easier to manage due to someone else’s operational improvements that you forgot about.

- If a runbook is wrong or out of date, fix it right then or flag it as a problem and bring it up during standup. The runbook is a part of the software “package” and issues should be treated like bugs that should be addressed.

- Do an on-call review and track the on-call engineer’s sentiment along with metrics around how many wake-ups, weekend alerts, etc. up-to-date, clear runbooks make on-call much less frustrating for systems the on all engineer didn’t directly write. It’s also a good forum for a team member feels they are taking on more of the operational upkeep.

- Call out and celebrate excellent, up to date runbooks! If you were on call and a teammate’s runbook was great and let you effectively manage an incident in a part of the system you don’t have as much experience in, let them know!

Runbook development/maintenance is a part of software development/maintenance (IMO). If someone isn’t pulling their weight with runbooks, then they aren’t doing all of their job. I see it the same as someone who only wants to work on greenfield projects. There’s some percentage of shit you need to deal with as a software engineer, operations is one of them. At least for online systems.


It reminded me of a simple but not so obvious concept - do nothing scripts.

https://blog.danslimmon.com/2019/07/15/do-nothing-scripting-...


This resembles autonomation in the Toyota Production System: https://en.wikipedia.org/wiki/Autonomation

Basically, you define a process, do it manually, then slowly automate bits of it.


This is a pretty neat idea. I'm an amateur programmer in addition to my job in medicine and I'm working in a new administrative role. There is so much being done by hand that I can see would be easily automated but, just as the blog post describes, I've been paralyzed by the all-or-none attitude towards doing so. This is a great way for me to start chipping away at some of these tasks.


Runbooks are very common in safety critical environments. E.g. aviation is completely driven by checklists. Everything has a checklist. Even amateur pilots tend to stick to their checklists.

There are a few useful properties with checklists:

- They stop you from having to improvise actions when time is too short for that. People under stress take bad decisions. It's better for them to stick to a pre-defined plan.

- They stop you from making preventable mistakes. People forget stuff. Even trained people. Especially routine stuff.

- They ensure some level of uniformity to people's actions. So, they can check each other's work and do some cross-checking.

For IT operations, run books are great. Especially if you can automate them. CI/CD is basically an automated run book. It simplifies the decision process for humans: do I want to put this live: yes/no?

But what do you do when the database goes down or there is some major hardware failure and you have to restore from a backup? In a lot of organizations, this is still a major crisis. There might be a run book but chances are nobody has practiced this in production in ages. Thinking through ahead of time what you would do is more than half of the success. It doesn't matter if it involves some manual steps as long as you know what they are. In aviation, a lot of stuff is not automated at all. And they still use run books and checklists. These are a great substitute for automation. Newer planes self-check a lot of stuff that you used to have to do manually.

Ideally you have both automation and run books. And automation can fail of course. What do you do when your cloudformation stack is corrupted or when your terraform is out of sync with what remains of your production environment? It happens and it's a bad time for experimentation. Usually such events lead to runbook updates. You learn some new stuff about exciting failure modes you did not consider before.

Having the practice of documenting non obvious things is what separates a good from an excellent engineer. I actually do it to help my future self as well. Because I know I'll go down a rabbit hole of reinventing things I already figured out a few months down the line.


The “preventable mistakes” point reminds me of this great video about “normalization of deviance” [from routine/checklists] https://youtu.be/Ljzj9Msli5o


I think my new favourite way of managing runbooks is to actually build them into a file tree of a bunch of simple python subcommand scripts, and have a run.sh script that scans the file system and uses argparse to construct a cli to call each script.

  # call ./runbooks/stack/update_secret.py
  # could update a secret in a vault, or update it in your deployed app
  ./run.sh stack update_secret --env=dev --name=foo --file=secret.txt
Most of the time my python scripts are glorified CLI commands like `docker service update` that are called through subprocess, so you shouldn't need to install dependencies beyond what you'd be typing in the CLI. It's also easy to add a verbose option to print out the commands it runs so you can do it manually.

  # call ./runbooks/services/build.py
  ./run.sh services build -v
  > #--- Building images ---
  > #> DOCKER_BUILDKIT=1 docker build --build-arg BUILDKIT_INLINE_CACHE=1 --label "myapp" -t example-admin-ui:local "./admin-ui"
  > #> DOCKER_BUILDKIT=1 docker build --build-arg BUILDKIT_INLINE_CACHE=1 --label "myapp" -t example-frontend:local "./front-end"
  > #> DOCKER_BUILDKIT=1 docker build --build-arg BUILDKIT_INLINE_CACHE=1 --label "myapp" -t example-nginx:local "./nginx"

Anything that can't be automated prints out an input line that gives instructions on what to do and just waits for you to input "yes/no"

  # call ./runbooks/get_crash_report.py
  ./run.sh get_crash_report --out=./crashes/
  > # Copying crashes from AWS to './crashes/
  > # Manual Step: Fill out crashes spreadsheet: docs.google/example_sheet
  > Continue [y/n]? 
  
The other really nice thing with this setup is the run.sh script is able to build up --help commands that can print out what actions are available and what params they use cause it's just python argparse. Makes discovery of what to do or looking up params really quick.

At this point, the only culture you need to build is one where everyone's supposed to use the run.sh scripts and not do things manually. This enforces people to fix the scripts when something changes.

YMMV, but I've found this has simplified a lot of processes for myself at least.


Runbooks the key ingredient to making any job unbearable. Sorta /s


No /s from me. If you want a chore that will never pay off, write a runbook.


Can you explain more about this? Most of the runbooks at my current job exist because it was too painful to leave them unwritten, and get used frequently.


Runbooks are, in my experience at least, generally sticking plasters over systems that are convoluted, unreliable, or otherwise broken. They're a symptom of teams that are running so hot that they can only put out fires, not prevent them. Again, just my experience, but whenever I see an extensive list of runbooks I run a mile.


generally sticking plasters over systems that are convoluted, unreliable, or otherwise broken

So pretty much every system not written from scratch in the past five years?


No, many older systems work perfectly well and are sensible and intuitive. Vertical scaling a single box or HA pair, for instance, is generally easier than working with distributed systems, for instance, and is perfect for many (not all) use cases. The idea that recent stuff isn't terrible is.. well, your anecdote against mine, I guess.


So what's your experience with them?


Generic corp stuff TBH i'm not sure I"ve ever worked on an official "ITIL Runbook" but its all the same stuff with different names. I think they are a good idea in theory but they get taken to religion places with middle management and project managers types that want to make it part of their excuse for having a job.

Despite that they are still never updated properly or accurately. Because when someone is tapping you on the should saying "where is the runbook update" while you are buried in more work than you can already do this week people just put something in there. THe people "enforcing" dont know anything or care to so they just update their checkbook/playbook as done.

They also tend to be a "people are replaceable cogs" champion for the sociopaths in the company.


Interesting experience, thanks.


This reminds me of the book The Checklist Manifesto and the show Air Crash Investigation.


My first thought when I read the title was "Sounds a lot like the non-normal checklists they use in aviation" :D


I was also about to come here and say "this sounds a lot like a checklist". The Checklist Manifesto is one of those books that should be really boring but is both valuable and well written and I recommend to everyone.


The book is overlong for what it has to say, and stuffed with things that are not really checklists (architect's building plans).

Fortunately there is a better version with not just all the meat, but also the best anecdotes included: Gawande's original article in The New Yorker.

https://www.newyorker.com/magazine/2007/12/10/the-checklist


I think this is an outdated approach. If you have the wherewithal to write down precise procedures to operate your system, then you should just be automating them.


In the ideal world, yes.

Some systems are, for any number of reasons, strongly resistant to automation. The obvious case is where security policies may require a human in the loop. Then there are the Rube Goldberg systems where the interface to the system involves VNC to a MacOS machine that runs a Virtualbox with a Linux guest where a native application modifies some database you have no other access to. Etc.


Stop publishing our trade secrets


A European car company--that I can't name--proposes to take the owner's manual out of the glovebox, and thanks to the Internet of things--put it online. Their cars are exported to countries where there are Internet dead zones.

What happens if you are in a dead zone and you need to look at the owner's manual in an emergency?

Too bad.


Who needs to look at their owner's manual in an emergency? Nearly 99% of car manuals collect dust in the glovebox. Now that I think about it, I have never not even once seen someone pull a manual out of their glovebox. The only person that I know that has even read through the first few pages is myself, because I used to read my parents' manuals when I got bored as a kid.


What is the difference (if any) between a runbook and a process? Genuine question; I am interested to hear people's answers.


IMHO:

- Process: follow it in order when you want result X, all steps need to be completed, each time every time. Routine stuff.

- Runbook: follow it in order when X happens, some steps may not apply depending on the situation or you may need to adapt. Non-routine stuff.


You can have a process without a runbook, but not the reverse. i.e. a process can work, but be undocumented. Whereas having a runbook means you've defined the process (or at least gotten a good start at it..)


Runbooks give more scope to human judgement and experience. But I don't know a formal difference.


It's not nice but a perfectly usable runbook includes all required passwords :-)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: