Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Let's Stop Talking About "Backups" (joelonsoftware.com)
83 points by JoelSutherland on Dec 14, 2009 | hide | past | favorite | 82 comments


It's not us you need to tell Joel, it's your business partner. And if this is your way of telling him, isn't it a little passive-aggressive to do it in a blog post?


I think that a lot of people think they have backups, but they've never restored them, so I thought it would be a good practice if everyone starting thinking in terms of "have we restored" rather than "are we backing up." Of course, there's no question that the thought process started when Jeff Atwood's personal blog was lost, but don't think for a minute that the only way I communicate with him is by blogging... we talk all the time, over skype, over email, over FogBugz, and sometimes, when there's something other people can learn, in public on the internet.


The weird part is not that you made a blog post about this issue now.

The weird part is that you didn't mention the 400 pound gorilla in the room. ("I've been thinking about this for a long time, but it really hit home recently when...")

To my mind, that's why the post seems odd or passive aggressive. (It's a relatively short post as well. It just feels clipped somehow. The reader inevitably says, "What's not here?")


It may be difficult to remember now that Twitter is all the rage, but essayists often aim for that "timeless" quality. You want the essay to seem as relevant three years from now as it is today.

The web has more than enough content that feels stale after a week. Better to aim for something with a slightly longer shelf life.


Yea, but how is "my friend/business partner recently lost his entire blog due to poor backups, and it got me thinking..." not 'timeless?'


To be honest, I don't give a shit about his friend/business partner.

But I do like his idea. So I appreciate that he omits pointless personal details and just sticks to the meat of the article.

When a writer expects me to read a bunch of self-indulgent cliche boilerplate I usually hit the back button.


Not poor backups, poor restores. You're missing the point of the article!


It's also poor backup. They thought the backup was being done by their hosting company while in reality it was failing silently.


A good backup is one that can be restored.


A poor backup becomes a poor restore, while a good backup isn't necessarily a good restore. It's like quibbling over whether I said, "his car was completely totaled in the race" rather than, "he didn't win the race." No one thinks that not crashing your car implies that you've won a race, but you can't win a race with a crashed car.


This doesn't make any sense. Unless you have some other reason to suspect there's animosity between them, why would you invent drama between the lines?


Check the URL in your browser....and the author of the article being discussed. Sufficient explanation for anything that will transpire here.


It's generally better if you try to avoid putting your business partners or employees down in public.


"Let’s stop asking people if they’re doing backups, and start asking if they’re doing restores."

No, because restores wouldn't have helped Jeff Atwood. You can backup a database onto the same server that hosts the database, and still successfully restore it later.

We should instead be talking about the concept of Continuity of Business. Decide how important it is for your data to continue to be available (constant availability vs. several days downtime vs. don't care if it disappears entirely?), then make a plan that gives you that availability.

What I'm describing isn't necessarily complicated. For example, I keep all my backup data (photos, music, etc) on a USB drive. Every morning it's rsynced to a hot spare. Every few weeks, I swap the spare with another one in a safe deposit box at the bank. (This means that we can recover from both immediate accidental deletions, and also ones that we don't catch for a few days, and disasters like theft or fire). I check every few days to make sure the drives are still working, and my wife knows how to switch the drives and has a key to the safe deposit box, just in case something happens to me.

Anyway, we feel that these are appropriate steps for protecting the data that we don't want to lose. But the important point is that everyone should make a plan that's appropriate for them.


In the article, I wrote

"If you’re running a web service, you need to be able to show me that you can build a reasonably recent copy of the entire site, in a reasonable amount of time, on a new server or servers without ever accessing anything that was in the original data center."


Yes, in many ways, Jeff 'took one for the team' in a big way, in that he reminded a lot of people to "Er...I'll just check my backups and make sure they're working."

We're tech types, but e.g. I know a programmer who never backed up his laptop (with important stuff) because he'd never had a hard drive go bad on him. You'll never guess what happened...

The hard part is getting 'normal' people to back up. It's really hard. They have all their data/documents on that laptop, and all the digital photographs from the first 18 months of their baby's life or whatever, but if you advise them to back up, you sound like some crazy lady who throws cats at people in the street.


lots of people want to, but it can seem overwhelming. i'm quite technical as a programmer, but even i avoid IT muck whenever possible.

that's probably why companies such as dropbox are doing so well. making backups/restores easy is sweet.

(ps - thanks superduper! http://www.shirt-pocket.com/SuperDuper/SuperDuperDescription... "heroic system recovery for mere mortals")


[ ] Your normal people do use Mac OS X Leopare [X] Your normal people do not use Mac OS X Leopard

Time Machine is the solution: everyone gets it (at least the normal people I introduced it to).


Also I see your business partner saying that the chance of a fire in the datacenter is so small that he should also worry about meteors at that point. I think that event is not that rare tough, just recall the last one in Seattle...


What's ironic (and somewhat irksome) is the preachy tone of your post.


Good blog post, particularly since it agrees with something I noted a little while ago: http://news.ycombinator.com/item?id=990903


Unless I've missed further developments in the story, his business partner trusted the "backups" of the hosting company. Which did not properly restore.

So this is aimed one step farther up the chain.

Or, rather, it's aimed at all of us. It is a lesson that a lot of people need to learn.


Well, Jeff was asking his host "do you back up?", not "can you restore?". That's what the post is about. (Reading comprehension FTW!)


I think this was directed more at his business partners hosting company.


Who chose them and what due diligence was performed? This is business 101.


Give me a break! (Hosting) companies will throw more references and specs at you than you know what to do with, and you have no choice but to take them for their word. The only thing you can do on your part is keep your own backups, independent of them.


> he only thing you can do on your part is keep your own backups, independent of them.

That's just common sense. What happens if your hosting company decides to be dicks about some billing dispute and holds your data hostage? Even the if hosting company-powered backups work, you're still hosed without your own backups.


Exactly. Companies can say one thing and do another. My argument was against the accusative tone of "whoever did the due diligence".


Doing your own backups isn't rocket science. Why you would outsource such a valuable thing is beyond me.


With backups being such a common thing, and something you probably could do but would rather not worry about, I don't see why you wouldn't outsource them.

Simple backups are easy. Understanding when something is sufficiently backed up and how to know whether it is backed up can be quite complex, particularly if large amounts of data is involved.

USB drives work great for a desktop with 1 TB, but not so well for 20TB. Then there is the question of how you know whether everything is sufficiently backed up, whether it is free from corruption, how you will move the backup data to a live server in the event of a restore.


I find it disturbing how many people decided to comment or up-vote on Joel-Atwood angle, or insult each other.

Remember: great people discuss ideas, normal people discuss events, shallow people discuss other people.


What I find someone charming/quaint about Joel's Posts on "Operations" over the last six years that I've been reading him, is that he is, slowly but surely, discovering the "Art of Operations" - albeit at a glacial pace.

Most People who work in Production Operations environments of any scale, discover what has taken Joel the better part of a decade, in the first two-three years of their career.

I almost feel like that Airplane passenger sitting beside Brooks Jr - Brooks saw him reading his book, "Mythical Man Month" - and asked the guy (who had no idea who he was sitting beside) what he thought of the book - The gentleman responded that it was basically a summary of things he knew already. Joel is a giant in the industry, but he does have a tendency to discover/restate the obvious.

"It's not backups, but the restores that matter" - is kind of the mantra of every single person who has ever been responsible for backups.

Then, you go to _any_ class on running a production environment, and you discover things like RPO, RTO, Dress-Rehearsals, etc.. and the whole "It's restores that matter" begins to look quaint.


What I find amusing about the whole thing is that it's like a microcosm representation of how developers always think operations is trivial and unimportant..... until they have to do it themselves! :-)


Totally agree on these points. The Joel-Atwood experience is not that of two programmers starting a company. It's a story of two programmers learning about System Administration.


the "Art of Operations" suggests a book or similar that you are referring to - is there one, or am I reading too much into some double quotes?


Yes there is a book: The Practice of System and Network Administration, Second Edition. This is the best Operations book I've ever bought. Worth every penny.



This is good food for thought. Let's also add to this a concept from a different realm: Everybody has at least two dns servers listed in their /etc/resolv.conf, right? The reason is that in case one of them goes down, there is the other one.

So this seems like a good lesson to take about backups. Mebbe three? One by your hosting provider, one at tarsnap, one on a separate dat tape, one on a usb stick?

A good point is though that even something as big as a dat tape looks pretty small by the standards of what we need to back up today.


It's helpful to boost signal for this message. It's an old message, but planning and practicing system restores can be as expensive in terms of equipment and manpower as actually making the backups. This leads to a lot of neglect.


Also this, similar take on the horror of backups gone wrong: http://www.penny-arcade.com/comic/2005/8/10/

Don't blindly rely on your partner to do it... Trust, but verify.


... shouldn't it be common practice to test your backup system to make sure that the restore procedure meets the requirements of the client (company, etc?)

The IT company that I work for creates a backup system based on the requirements of our clients and then demonstrates the whole backup and restore procedure to make sure that it falls in line with what our client actually wants. It's really not difficult to do. Sure, some of the restore procedures may be slower (depending on other requirements, such as cost,) but the client knows that will be the case and signs off on it.


Common sense and obvious. In the 1980s I worked on a large DARPA project where a huge hit was taken because our admins never tried to restore from backups. It is the kind of lesson that is (hopefully) learned with just one bad experience.

This is another reason why I like EC2 deployments: it is fairly easy to take your backups (automated deployment scripts, application, data) and spin up another copy of your whole system (except for flipping the DNS). Make sure those EBS-backed EC2 AMIs are really bootable and functioning :-)


What happens if Amazon runs out of machines, has a system-wide corruption that was noticed too late or decides to be dicks about something or other?


Good questions.

I would think that if they are making money with AWS then they will keep buying more servers.

I usually trust S3 for backups and restores but periodically back my own data off of AWS to local storage.

I expect both Amazon and Google infrastructure services to experience outages from time to time. However, they have far more resources and expertise than I do to provide scalable services for a low cost.


I was thinking about making a set up where my S3 backups are automatically mirrored to rackspace or something like that. That would be a really neat setup.


I tend to be a little paranoid about backups, so I have a few different disks backing up my main, desktop, machine. But I also use one of my backups to sync data to my laptop, not quite a full "restore" due to a big size difference in the respective drives. But, generally speaking, the two machines are in sync and I can be sure that at least one of my backups works reasonably well.


This whole ordeal is getting me motivated to actually buy a cloud backup service (personal use, not business use). I was thinking of Carbonite or Backblaze. Anyone have any experience with those?


I used carbonite in early 2006 and the software was horrible. Eventually, I tried to uninstall and the process failed. I got a half installation that wouldn't work and couldn't be removed. I didn't try too hard after that, because I was planning to format and start over.


Full disk image backups are a good solution for this problem. No worries about partial backups or a complex restoration process. It's totally inefficient but storage is cheaper than man hours.


I'm assuming you're talking about some kind of atomic file-or-block-level backup such as LVM snapshots? Large files such as databases can change while reading them over a long period of time, so a standard disk image or file copy wouldn't be reliable for a live system.


zfs send/receive is amazing. I wish other filesystems had it


As if I didn't lose enough sleep pondering backups already.


Testing your (off-site) backups is an obvious first item in many lists http://watsec.com/article/49


What he's saying is, "We failed miserably at having good process and procedures. Because of this we are going to lecture others, and point the blame at everybody but ourselves, in hopes that they'll stop pointing out how much credibility we lost over this."


Pointless write-up about linguistics. He says "restore" is the important thing, not the "backup". Well, duh.


Not really. That's actually one of my favorite "health check" questions: when was the last time you restored?

Most places have very reliable backup procedures. Most of those have very poor restore procedures - I'd say about half fail when put to the test.


To me that is like saying "You can put money on your bank account but you should make sure you get it back". Of course you have to test your backups.


Running a restore is also a very effective way to discover that your backup procedure wasn't actually very reliable after all.


If you think this post is about linguistics, you're missing the point

Since the restore is the important thing, that's the one you have to test. And if you haven't tested restoring, your backup is (quite possibly) worthless.


It's a well duh, but still most people fail to test restore. This may be the greatest thing about distributed version control. Every clone is a restore. A limited restore but still a restore.


backups are for suckers. keep the data on a few different spinning disks. if you can solve data synch between two sites, just keep your data synched.

it's much better to ask yourself how long to replicate your existing system then how to back up. pxe boot to a kernel that you can install over the network with, bcfg2 to get the thing up to spec, start copying data.

a lot of machines can be back and configured in 5 minutes.

that said, i'm not you. i don't have terabytes of data to do statistics on. maybe there are other horrible details i'm forgetting. fast rebuilding is a pretty awesome strategy for a lot of cases.


maybe there are other horrible details i'm forgetting

Yes.

Why should I bother to write this? I'll outsource the task to the authors of High Performance MySQL, Second Edition, page 475:

Backup Myth #1: "I Use Replication As a Backup"

This is a mistake we see quite often. A replication slave is not a backup. Neither is a RAID array. To see why, consider this: will they help you get back all your data if you accidentally execute DROP DATABASE on your production database? RAID and replication don't pass even this simple test. Not only are they not backups, they're not a substitute for backups. Nothing but backups fill the need for backups.


This is the incremental backup script I use on my Linux box at home, a quick-and-dirty imitation of what Time Machine does. Obviously $HOME/backup is a different physical disk. Feel free to improve on this.

--------

#!/bin/bash

HOME=

date=`date "+%Y-%m-%dT%H:%M:%S"`

rsync -aP --link-dest=$HOME/backup/current /home

$HOME/backup/back-$date

rsync -aP --link-dest=$HOME/backup/current /etc $HOME/backup/back-$date

rm $HOME/backup/current

ln -s back-$date $HOME/backup/current

#see if the disk is getting full

FREE=`df -lk|grep sdb1|awk -F" " '{print $5}'|awk -F"%" '{print $1}'`

#alert me if the backup disk is getting full.

T=80

if [ "$FREE" -gt "$T" ]

then

    df -lk| mail $myaddress -s"disk alert $T% capacity"
else

    echo "backup disk is less than $T full"
fi


You should think about keeping an offsite backup if possible. A second disk won't necessarily help if your computer gets dropped in a pool, house catches on fire, etc.


Completely reasonable response, forgive me for being an inarticulate noob in my post.

I thought delayed replication was one of the main strategies they advocated in that book. I don't have it on hand. my mistake.


that is very interesting - I though replication was the nearest to realtime "backup" as one could get. Is there any other way, one can do incremental, realtime backups for, say, mysql/postgres ?


Obligatory xkcd link: "Bobby Tables" http://xkcd.com/327/


Replication doesn't defend you against deleting the wrong file, or messing up an update to your database, or any of a large class of PEBKAC issues.

It's also not useful if your main files get corrupted and you diligently propagate the corruption. See: ma.gnolia


Setting aside the database, file problems are fairly easily solved via svn, .snapshot or a good configuration engine.

As far as the database goes, I'm big on stored procs + archive tables, but i'll leave that to the grown ups ;)


...until you forget to check in important changes, or neglect to add a seemingly trivial configuration file to your SCM.

Whereas, a policy of automatically backing up everything except for your exclude list would have saved your bacon.


What if you hit rm -rf * and that gets synched?

I don't mean to be rude, I don't know anything about you, but if you were a system administrator working for me, today would be your last day.


Of course it would be his last day, he's a sysadmin and he just typed rm -rf * on /


You know, you tell yourself that you understand rm -rf. And yet somehow you always find a way to type it on something you shouldn't. Computers are perverse objects.

It's like reading an airline crash report, which I'm told often ends up sounding like a comedy of errors. Most airline crashes have an entire handful of causes, all of which are individually innocent, but on the one extremely rare occasion when they all happen at the same time they add up to disaster.


I fat-fingered the destination directory in a script once and accidentally shrunk my entire photo archive to 128x128 thumbnails. Thankfully I did have backups of all but a handful.

I'm always amazed at the number of people who spend their time thinking about replication strategies without understanding that data can be accidentally deleted in production too. I guess backups look "easy", so it's not as sexy an area of architecture planning.


People just don't always grasp the big picture. People that see RAID as a backup strategy are only looking at 'hard drive failure' as potential scenario. The same for machine replication. They are only trying to prevent against what happens if 'the machine dies.' As long as they realize that this doesn't protect against, "I accidentally deleted the files," then it's all good.


In a previous job I had a so-called system administrator laugh at my paranoia when I suggested that his awesome backup system -- protecting the equivalent of several dozen person-years of scientific data, millions of dollars' worth -- wouldn't protect us against a fire in the building. Or, for that matter, a thief who liked to steal computer hardware.

Job security note for sysadmins: When someone suggests a disaster scenario, don't open your response with a laugh.


Well, you know that you're talking to someone that doesn't plan for the future when, "What happens if the building burns down?" is responded to with, "You're just being paranoid." I guess the appropriate response would be to point out that using his logic tons of taxpayer money could be saved by nixing the fire departments (and tons of corporate money could be saved by not paying for fire protection -- alarms, detectors, escapes, etc).


if you can solve data synch between two sites, just keep your data synched.

LOL! And if you get a corrupt block on your primary site, what're you going to do? All your standbys are instantly tainted!

Better leave this one to the grownups.


Replication is not backup. It only protects against hardware failure.

Spinning disks are good though. A lovely spot for backups. Just put them in a different building.

Edit: In the time it took me to write this 5 other people also lambasted this poor fellow. Ouch.


It doesn't matter if your data is replicated across 3 different continents. If all of your data is accessible behind the same admin login, or if it is all vulnerable to the corruption in the original, then it still has only 1 point of failure. A proper "spinning" backup is an immutable journaling system that can't be trimmed except via extremely secure methods (e.g. physical access to the machine).


A real-time mirror won't help if your data gets corrupted (via a fat-fingered shell command, errant script, or a compromised system). When it's appropriate, it's awesome, but I don't think it's a replacement for a static "offline" backup,


Sorry, this is not only WRONG but actually HARMFUL advice.

Redundancy is not a backup. If someone with full admin control to your system can destroy all of your data then you do not have backups. A proper backup is physically separate from your primary data and, preferably, can't be destroyed with mere admin access to the system. The number of sites that have had catastrophic data loss due to relying on mirroring instead of true backups is quite significant.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: