Tarsnap performance issues in late March, most of April

mtsmith85 · on May 6, 2015

This line: I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed" is such a common situation for most people, but I tend to see it with engineers especially. I find I struggle with it an incredible amount. In some ways, I guess it seems healthy or reassuring that incredibly smart people like Colin Percival suffer from similar challenges around fully understanding the scope of the problem and the solution.

All that being said, I really respect the detailed response from a technical perspective as well as owning up to (and the decisions that went into) a spell of downgraded performance.

Later edit because I don't want to spam the comments: I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions. This is well beyond my depth and while Colin mentions that it didn't necessarily increase performance, I'm wondering if the hope is it would longterm? Or were there other benefits to such an integration?

cperciva · on May 6, 2015

I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions.

I was using OpenSSL for that (which was using a software implementation). The code (you can see it in spiped) now detects the CPU feature and selects between AESNI or OpenSSL automatically. Given that the tarsnap server code was spending about 40% of its time running AES, it's a nontrivial CPU time saving.

I should probably have been clearer in my writeup though -- using AESNI was never a "once I roll this out everything will be good" fix. Rather, it was a case of "I have this well-tested code available which will help a bit while I finish testing the real fixes".

gonzo · on May 6, 2015

One wonders why you aren't using a version of OpenSSL that has the AESNI bits already in it.

cperciva · on May 6, 2015

I've learned that wondering about OpenSSL internals is detrimental to my sanity.

cperciva · on May 6, 2015

I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed

This ties in to the last lesson I mentioned at the bottom:

5. When performance drops, it's not always due to a single problem; sometimes there are multiple interacting bottlenecks.

Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.

jcrites · on May 6, 2015

> Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.

Very common! One thing that's been helpful for us is establishing predefined system performance thresholds that, if exceeded, initiate the chain of events that will lead to customer communication. "If X% of requests are failing, then we had better advertise that the system is degraded." Discussing and setting these thresholds in advance and the expectation that they'll result in communication helps drive the right outcome. It's not perfect, because one is always tempted to make a judgment call in the circumstance, which is vulnerable to the same effect, but it's a good start.

Thanks for sharing!

mtsmith85 · on May 6, 2015

That totally jibes with what I found "reassuring" in a sense. That even very smart people sometimes get hit with inadvertent "multiple problems looking like a single issue" situations.

spydum · on May 6, 2015

i tend to get to debug problems like this (usually in 3rd party code i dont know the internals of) pretty frequently.. my experience has been it tends to follow a curve..MOST of the time, the problem is simple and you can quickly dispatch it. the scary (or fun, depending on your perspective) part hits when you pass the first level, and there are still problems.. and you dont know if it's two or ten levels deeper. then you get into that crazy test/optimize cycle and crawl out two weeks later wondering when you last ate..

mryan · on May 6, 2015

This "it's almost fixed, I'll email the client soon" pattern is something I have personally struggled with a lot, and I agree it appears to be common with engineers.

My workaround has been to make something else responsible for sending the email. In a team, this could be a manager setting a cut-off point after which communication must be made. When working on my own, I set an alarm for X minutes. When that alarm goes off I ignore the internal voice which says "just try one more thing, then send the email", and send an update to let the relevant people know my current progress, ETA to fix, and when they can expect the next update.

I think this is similar to how GTD encourages us to use systems for storing to-do lists instead of trying to remember them - our fragile human brains are not always to be trusted.

Poiesis · on May 6, 2015

I came here to write this comment essentially.

Very much of the time I feel, "If I knew what the problem[s] [was|were] it'd be solved by now!" That's not exactly true of course but of course diagnosis is a large part of the total solution.

This type of an answer that Colin gave above does not exactly win friends and influence people in most situations where you're part of a team or hierarchy. Can anyone share what they've done to give better answers in these cases? I understand why people want the answers, but I don't have them to give right away particularly when it's Someone Else's system.

jballanc · on May 6, 2015

One trick that I've learned (though I still have trouble routinely applying it myself) for these situations is: less is more.

That is, as engineers we tend to want details. All the details. We want to know what happened, why it happened, how it's going to be fixed, and how long that will take. Because we want all that detail for ourselves, we hesitate to contact our customers/boss until we have all the details. Combine that with a desire to fix problems as they come up, and you end up with, "I never told you there was a problem because I was always one fix away from the solution."

But most people are not engineers. They want to be acknowledged. They want to feel informed, even if they have less details than what you would like to provide for them. Sometimes, something as simple as, "We've noticed that there is an issue and are currently working on a fix," goes a long way. Also don't be afraid to pull out, "Users have been reporting issues with backup performance. We do not currently believe this represents a service failure, but we are working to return performance to normal levels."

Your users trust you (otherwise they wouldn't pay you). If you "believe" something, they will too.

annnnd · on May 6, 2015

This. There is also a saying on a similar note: "It's OK to disappoint, it's not OK to surprise."

psykovsky · on May 6, 2015

What if the disappointment is a surprise?

cperciva · on May 6, 2015

Just to be clear, when Tarsnap users wrote to me I told them everything I could. The "I think it will be fixed soon" delay in sending out an email to the lists affected only people who didn't notice or noticed but didn't ask about the issue.

patio11 · on May 6, 2015

In case any other customer is wondering "Wait, I didn't hear anything from my monitoring about that and I'm retroactively worried. How worried should I be?" like I was: I just pulled our logs and reconstructed them, and it shows over the last ~30 days that the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case. This didn't trip our monitoring at the time because they all completed successfully.

n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future.

vidarh · on May 6, 2015

Explicit support for randomizing timers across multiple hosts is a really nice features of the timers provided by systemd:

"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."

You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.

cperciva · on May 6, 2015

the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case

Yes, that sounds about right. I had maybe half a dozen people write to me who had noticed performance problems, and after the initial "backups failed because the server hit its connection limit" issue, it was people whose backups were already very long-running -- if your daily backups normally take 20 hours to complete, a 40% slowdown is painful.

NeutronBoy · on May 6, 2015

I run my backups overnight and get a status email each morning, and I didn't even realise there were performance issues until now. As you said, unless you run your backups multiple times per day, or have long-running backups, it may not have had a lot of impact.

FWIW, I live in Australia (so an 'off-peak' timezone), and schedule my cronjob on an odd minute offset, so it may not have been an issue for me anyway!

mtsmith85 · on May 6, 2015

Hear hear on said Old Wizened Graybeard habit. The amount of pain inflicted from twenty jobs all starting up at :00 (or even :30, :45, etc.) when they could easily run at :04 or :17 can be huge. Anecdotally I once "lost" a sandbox server to a ton of developer sandbox jobs starting at :00 and not completing before the next batch started.

protomyth · on May 6, 2015

Funny part to that, was on a project with multiple teams with multiple crontabs. Each team took that advice to heart for some jobs. Sadly, we had too many Hitchhiker fans and :42 became a bit too common.

kijin · on May 6, 2015

Use the following shell command to decide when to run cron jobs.

    echo $((RANDOM % 60))

It's not a CSPRNG, but good enough for this kind of load balancing!

cperciva · on May 6, 2015

Or schedule your cron job for :00, but add "sleep `jot -r 1 0 3600` &&" to the start of the command. (jot is a BSDism, but I assume you can do the same with GNU seq.)

rlpb · on May 6, 2015

This is a pain when deciphering a series of events later, though, because you don't know when a particular job was supposed to start. I'd prefer the delay to be stable on a per-host basis.

JoachimSchipper · on May 6, 2015

Don't use that for hourly jobs, though - things are liable to break when you randomly run a command at, say, 12:59 and 13:00.

cperciva · on May 6, 2015

Right, I usually do that for my daily jobs.

junkblocker · on May 6, 2015

sleep $[RANDOM/3600] works everywhere without requiring jot/seq etc. on BSD/Mac/Linux.

hjnilsson · on May 6, 2015

That will be a number between 1 and 10 ($RANDOM only goes to 32767), sleep $[RANDOM/10] would be better. :)

This might be platform dependent though, I can't find any standard RAND_MAX in bash so it's difficult to make this work everywhere.

lloeki · on May 6, 2015

This works in (da)sh (tweak 2 and 65536 if needed):

    sleep $(( 0x$(xxd -l2 -p /dev/random) * 3600 / 65536 ))

cperciva · on May 6, 2015

s/\//%/ I assume?

junkblocker · on May 6, 2015

Oops

  s/\//\\%/

yeah.

  sleep $[RANDOM\%3600]

protomyth · on May 6, 2015

We just went with a single group text file with all the jobs and which ones could be spread out. Saves the programming and gives the sys admins / DBAs an idea what goes when.

NDizzle · on May 6, 2015

Don't run on :17 and :39. Those are mine. Thanks!

pquerna · on May 6, 2015

One way to think about your fear is, shouldn't that just be a tarsnap feature?

Add some metadata for a machine that tarsnap should expect a once a day/week/month backup from this machine, and if it doesn't get one, to send you an email?

patio11 · on May 6, 2015

whistles

Until the day when Colin considers it in-scope for Tarsnap, I recommend Deadman's Snitch for this purpose. I literally spend more on DMS to monitor Tarsnap than I spend on Tarsnap. No, I don't think that is just, either.

RexM · on May 6, 2015

For those interested in patio11's thoughts on how he would run tarsnap http://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/

And the discussion on HN https://news.ycombinator.com/item?id=7523953

dlgeek · on May 6, 2015

Did Colin ever reply to that? I've always wondered what his response was.

snowwrestler · on May 6, 2015

In this thread: https://news.ycombinator.com/item?id=9496561

ma2rten · on May 6, 2015

He did just now:

https://news.ycombinator.com/user?id=cperciva

ploxiln · on May 6, 2015

Don't you have some other servers running other services? So you must already have some monitoring and alerting system like Nagios, to which you can add one more little "passive check" that does the same thing, for no incremental cost?

patio11 · on May 6, 2015

I have roughly fourish separate monitoring systems for Appointment Reminder. DMS is the one which is least tied to me, so I use it for Tarsnap (the most critical thing about AR that can fail "quietly") and as the fourthish line of defense for the core AR functionality.

(This may be slightly overbuilt, but I felt it justified to get peace of mind, given AR's fair degree of importance to customers/myself and the enterprise-y customer base. In particular, I would not have been happy with any monitoring solution which would fail if I lost network connectivity at the data center.)

$15 a month is far below my care floor for making sure that my backups are working and that I do not get sued into bits.

ploxiln · on May 6, 2015

Touché :)

stavros · on May 6, 2015

I'll second it (https://deadmanssnitch.com/). It's such a useful tool, it's saved my bacon more than once.

jldugger · on May 6, 2015

We actually have our Chef rdiff backup cookbook randomly distribute jobs across a buckets of time using a hash function of the hostname.

sillysaurus3 · on May 6, 2015

I have to know: Why a hash function of the hostname?

pliu · on May 6, 2015

The chef-client cookbook does a similar thing in its cron recipe:

  # Generate a uniformly distributed unique number to sleep.
  if node['chef_client']['splay'].to_i > 0
    checksum   = Digest::MD5.hexdigest(node['fqdn'] || 'unknown-hostname')
    sleep_time = checksum.to_s.hex % node['chef_client']['splay'].to_i
  else
    sleep_time = nil
  end

https://github.com/opscode-cookbooks/chef-client/blob/master...

This is random enough so you won't kill the server, and deterministic so the resource isn't always changing every Chef run.

toomuchtodo · on May 6, 2015

No hash collisions, hostnames (in almost all practical environments) are never identical.

jldugger · on May 6, 2015

It's more or less random, but stable.

vacri · on May 6, 2015

I ran into this recently, backing up munin data to s3. I ran it at a time point offset from an hour to avoid those 'on-the-hour' rushes, but I was getting problems with the copy. Took me a moment to realise I was doing it on a 5-minute boundary, and munin fires on a 5-minute boundary - the data was being updated as I was copying it...

mental note: think harder, next time.

cperciva · on May 6, 2015

I suppose I should have known that this would end up at the top of Hacker News...

JacobAldridge · on May 6, 2015

It's the picodollars - tarsnap was the second business I fell in love with on HN (the late Kiko was #1) purely because of the awesome vibe I felt emanating from your enterprise (which I'm assuming is a reflection on you as well).

Years later, you've also become a cause celebre for holding true to a clear business and lifestyle vision (again, perceived at distance), in spite of the recommendations and 'support' provided by Patrick and others, including myself. Keep being true, and I suspect the community will keep learning from you Colin.

cperciva · on May 6, 2015

It's the picodollars

Hey Thomas, are you listening here?

In all seriousness, the picodollars do an excellent job of attracting exactly the sort of customers I want... and turning away the customers I don't want. They were originally part joke and part a way to avoid arguments with customers who don't understand that 1 GB < 1 GiB, but now it's way more than that.

in spite of the recommendations and 'support' provided by Patrick and others

Don't be too harsh on Patrick. His vision for Tarsnap is not my vision for Tarsnap, but he has helped me to orient myself: The projection of "business" onto the subspace "geek" doesn't look very much like "business", but it's not the same as "kid right out of university who has never had a real job" either, and that's what you would see if I hadn't had advice (from Patrick, Thomas, various YC people, and the rest of HN).

Advice can be very valuable even if you don't follow it to the letter.

JacobAldridge · on May 6, 2015

Don't get me wrong - I probably still agree with Patrick and Thomas.

(Previous: https://news.ycombinator.com/item?id=7731268)

[Edit: And I think the competing theories are an excellent lesson for that "kid right out of university who has never had a real job".]

jedberg · on May 6, 2015

Hey man, awesome writeup. I have a suggestion for you: try and architect off those EBS volumes -- as you unfortunately learned the hard way, they just aren't that consistent. DynamoDB is a good option, or adding some redundancy so that you can just use the ephemeral disk would be even better (and probably cost neutral compared to the "consistent" I/O EBS volumes).

Happy to help if you'd like.

cperciva · on May 6, 2015

try and architect off those EBS volumes

Yeah, that has been a work in progress for a long time. FWIW, I started using piops volumes when they were the only SSD option available -- they beat the crap out of spinning ephemeral disks.

jedberg · on May 6, 2015

> when they were the only SSD option available -- they beat the crap out of spinning ephemeral disks.

No doubt!

What sort of stateful storage are you using on those EBS volumes?

If you're doing key/value, you might want to check out https://github.com/Netflix/dynomite

It turns redis into a dynamo-like key value store that might bootstrap your transition.

Osiris · on May 6, 2015

For those that want to run a similar service using their own systems, I found that Attic [1] is a great open source backup tool that works in a very similar way, including deduplication and compression.

I backup some VPS servers to my NAS at home using attic over an SSH tunnel. Incremental backups are quite small and it's easy to automate with a simple cron job.

[1] https://attic-backup.org/

middleclick · on May 6, 2015

How does this compare to Duplicity?

scott_karana · on May 6, 2015

It uses git-style addressable blob storage, so you don't have to worry about deltas, because there aren't any.

It's also got more efficient deduplication, because it doesn't use rsync's naïve algorithm.

The downsides: it requires the agent to be remotely installed (a la rsync: no "dumb" backends), and supports less storage backends to boot.

YMMV :-)

k1w1 · on May 6, 2015

As an AWS user this type of thing gives me cause for concern:

At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which most of this metadata was stored suddenly changed from an average latency of 1.2 ms per request to an average latency of 2.2 ms per request. I have no idea why this happened -- indeed, I was so surprised by it that I didn't believe Amazon's monitoring systems at first -- but this immediately resulted in the service being I/O limited.

A sudden doubling of latency can have dire consequences on any system. Knowing that such unexpected changes are possible makes it built trust in your environment, even if it is running fine today.

MCRed · on May 6, 2015

It's getting to the point where, when I see a post mortem like this, I am just waiting for the AWS problems. Between this and the downtime that AWS has, I'm kind of amazed that people use it-- you pay too much and you get less. (Compared to a lot of other choices, such as raw metal boxes from Hetzner)

This is why I don't use AWS for anything non-trivial, and I am wary of people who put critical infrastructure on it. (EG: I Don't care about netflix, that service can run on AWS fine, but coinbase, for instance, if I was their customer and they ran on AWS I would stop being their customer.)

Whenever AWS problems come up people talk about how "AWS is so much more efficient, you just outsource that stuff to the experts".

But that seems to imply that hosting on your own hardware in your own office is the only alternative. Of course we stopped doing that in the 1990s.

With AWS you have to know Linux and have ops people, that's true everywhere. With AWS you have the additional burden of learning the AWS APIs and learning how to use AWS, which isn't transferrable, so that's a higher cost. With AWS you have to architect around the limitations of the way AWS is built and your architecture becomes AWS specific if you use those APIS, so that's an additional cost. You don't need any less ops people, probably more, than going with another hosting service like Digital Ocean or Backspace. And if you go with something like Hetzner you pay 1/5th to 1/10th for machines with a lot more performance and local storage. (Though you get the additional latency of being located in Europe, if your primary customers are the USA.)

Of course, I'm also prejudiced. I worked at Amazon and saw how the sausage was made and was not impressed. When AWS was announced as "running on the same infrastructure that powers Amazon.com!!!" as if it was a feature, I cringed. Amazon.com was having outages of parts or major components on a weekly basis at that time. Much of AWS is actually running on bespoke software (so not actually tested by Amazon.com when introduced, though I'm sure portions have been moved over at gunpoint) ... which actually makes it worse. People were trusting their data to a service that pretended to be backing a major e-commerce site but was actually untested outside of the company at the time.

And what have we seen since? An unacceptable level of failures. (in my opinion, of course)

But people seem to be very forgiving. When it's happening everyone's in "how can we fix this mode" and then when it's fixed everyone forgets and goes back to thinking of AWS as always running.

snuxoll · on May 6, 2015

To this day I still do not get why you would use AWS, the entire user experience is clunky and the pricing is crazy for what you get. Azure isn't much better with regards to downtime, but if you want something more than just a VPS I'd choose it any day over AWS for the significantly better UX in both the admin console and the command line tools + SDK.

Ultimately though, even with Azure or AWS you're going to need people knowledgeable enough to administer your compute instances anyway, so why not just run your full stack on a bunch of VM's from DigitalOcean or Linode or rent a couple dedicated servers and throw oVirt on them; saving yourself a significant chunk of money at the same time.

cperciva · on May 6, 2015

Indeed, I didn't know such a change was possible -- that EBS volume went for years with consistent low latency before it suddenly slowed down.

jeffbarr · on May 6, 2015

You could have contacted AWS support or emailed me. Either way, we would have investigated.

cperciva · on May 6, 2015

It wasn't missing its guaranteed # of I/Os per second, so I figured the slowdown was just "one of those things" and not an out-of-spec issue. Happy to send you the volume ID if you think someone would want to investigate (and still has data from the start of April) though.

jeffbarr · on May 6, 2015

Yes, please do.

toomuchtodo · on May 6, 2015

DevOps/Infrastructure engineer here! I see this happen frequently in AWS. Never expect either your instance networking latency or the latency of the underlying EBS storage layer to be consistent.

If you absolutely need guaranteed IO performance, use an instance store or move to dedicated hardware. Them be the breaks of cloud computing.

http://en.wikipedia.org/wiki/Fallacies_of_distributed_comput...

ac29 · on May 6, 2015

Sorry if this is offtopic, but can anybody explain the value proposition of tarsnap to me? It seems like a nice service and all, but the pricing is an order of magnitude more expensive than S3. If you are storing a few GB, this might not matter ("over half of Tarsnap users spend under $1 per month on storing their backups"), but if you have that little data, why not just dump it on a free Dropbox/Gdrive/etc account?

For more data, why not just use one of the many compressed, deduplicated, encrypted, incremental backup systems (attic comes to mind, I'm sure there are others) then just sync to S3 at a tenth the cost?

segf4ult · on May 6, 2015

Because tarsnap is cheap, incredibly well documented, open source, and run by an awesome guy. It's an all around win-win.

stephenr · on May 6, 2015

Rsync.net is even cheaper, has no requirement for a custom client, and arguably are more dependable because they're not just reselling S3

Edit: not to mention they offer actual support not just "contact the author" email link as a last resort.

dmix · on May 6, 2015

I personally just trust Colin's crypto skills more than anyone.

stephenr · on May 6, 2015

So you're saying you trust a single developer to both write an encryption tool and run the servers it talks to more than the combined possibilities using existing open source tools to create backups by encrypting data locally and storing it remotely via ssh/sftp?

dmix · on May 6, 2015

Yes, when it comes to crypto I'd put my in trust in highly talented people over trusting my own ability to glue together a collection of OSS tools anyday.

stephenr · on May 6, 2015

You seem to have misunderstood me.

I didn't suggest you should write your own encryption tool. There are numerous open source tools for creating encrypted backups, some do deduplication first too.

If the tool doesn't happen to support remote storage, a simple rsync or scp fills that part.

Literally the only thing unique about this service is the use of the term picodollars and the single individual it's all reliant on.

NhanH · on May 6, 2015

It's the dropbox discussion all over again. We know how that turned out, don't we?

stephenr · on May 6, 2015

Would you care to elaborate?

NhanH · on May 6, 2015

When Drew first do a "Show HN" [0] (before it was a thing, actually), there were a lot of response about how it doesn't do anything new that couldn't be already done by a technical inclined person (see the first two top comments in the posts).

To make a comparison with tarsnap, while it's probably possible to do encrypted backup manually with a combination of shellscript and such, there are just too many moving pieces that can go wrong. Where do you store the backup? Someone mentions S3, but even managing backup on S3 with deduplication is not something trivial, and managing the encryption process is definitely not something most of us can say with confident we won't mess up. I can imagine a thousand way that I encrypt something, then unable to decrypt it back.

And then maintenance is also an issue, if I'm using a set of OSS tools, I would have to make sure the tool is being maintained, and to follow any potential disclosure on bug/ updates etc. With Tarsnap, I know I will get an email from cperciva if something comes up.

[0]: https://news.ycombinator.com/item?id=8863

stephenr · on May 6, 2015

As I already stated I never said you or I or most people should write our own encryption tools.

There are many open source backup tools. They offer a wide range of features such as data deduplication, references/hard links to simulate total backups without copying unchanged files, data compression, data encryption, logging, reporting, remote storage and/or remote sync.

Not all tools offer all features. Not all features work the same way, but there are many options.

Those that don't offer remote storage/sync can be setup very simply to backup locally and then sync/copy to your remote file store of choice - another server, s3, rsync.net, etc

The majority of these tools are shipped as part of Linux distribution repos, so there are almost certainly many more people using them, and multiple people with a vested interest in maintaining them.

And for reference, I agree with the comment(s) about Dropbox. The only difference is that they offer a more intuitive GUI which so far is lacking in open solutions.

NhanH · on May 6, 2015

I didn't mention writing encryption tools, I was simply saying that plugging all the available tools to use together is non-trivial.

Tarsnap to encrypted backup is what dropbox to file syncing (to a certain extent, obviously). I can understand why for someone knowledgeable like you, the benefit isn't obvious, just like we don't see the benefit of dropbox over other tools. But certain demographics will see tarsnap/ dropbox as value added, and is willing to pay for them (with good reason, too).

I know a lot of developers who have never spin up an EC2 instance, can't get their way around setting up a server, and certainly is not interested in maintaining an offsite server for backup. To them, tarsnap with its command line provide enough simplicity to be used (of course, it can be much better, as patio11 and alot of people pointed out).

stephenr · on May 6, 2015

Plugging together?

I'm talking about working tools. They either do everything when invoked, or write to a file/dir on disk that can then have rsync invoked to copy offsite.

Im talking about maybe a 4 line shell script, if that.

If someone can't handle that amount of setup, maybe they shouldn't be the person setting up mission critical backups?

pronoiac · on May 6, 2015

Check out the key roles; you can split up writing and deleting archives, so - for example - a hacked machine can't delete the archives. This is nice.

GhotiFish · on May 6, 2015

I contacted the author today. He responded to me in 30 seconds.

stephenr · on May 6, 2015

Try in 18 hours. Can you call him when something fails?

I'm not saying he isn't responsive I'm saying depending on a one-man-band who is responsible for the client software, server software and the underlying storage system (ie he is the owner of the s3 account) seems like a huge risk.

vacri · on May 6, 2015

I assume he still has to sleep, at least on some days. :)

pquerna · on May 6, 2015

tarsnap is not open source:

"While the Tarsnap code itself has not been released under an open source license, some of the "reusable components" have been published separately under a BSD license"

http://www.tarsnap.com/oss.html

The source code for tarsnap is available to view, so you could audit/inspect it yourself, but it is not under an open source license.

ac29 · on May 6, 2015

But its not cheap, which was my point. 100GB of storage costs:

$300/year at tarsnap

$36/year at S3

lucb1e · on May 6, 2015

Finally, numbers other than picodollars and gigabyte months and unpredictable deduplication. This convinces me I don't want to store 4TB there at a huge cost($12,000 if it's really $300 a year for 100GB) compared to buying two 4TB drives (~€250 per 3-4 years) and placing them at a friend's with free bandwidth.

Don't get me wrong: managed, off-site encrypted backups are very attractive, and I might be willing to pay a premium, especially for software from a trusted person, but not the cost price hundredfold.

patmcc · on May 6, 2015

Tarsnap isn't intended to be used as one-time backup like that, and it's super expensive if used that way. It's very cheap when used to backup (almost) the same 4GB for 1000 days in a row, which is what a lot of people/businesses need for their backup solutions.

lucb1e · on May 6, 2015

It's not one time, I'd be incrementally writing updates to the disks. With a raspberry pi or something, the power costs are near negligible.

stephenr · on May 6, 2015

Rough estimate here:

If you upload 4tb in a year, that's 333.33gb/month

So for tarsnap that equals

- $1k/year in data transfer charges (4000gb * $0.25 transfer charge * 12 months)

- $83/month per month of data (333gb * $0.25 storage cost/month)

- $6.4k/year for the first year ($83 * 78 cumulative months in a year)

So $7.4k for 12 months resulting in 4tb

If usage stays the same each year will add $12k to the incremental yearly cost

lucb1e · on May 6, 2015

> that's 333.33gb/month

I have 4TB of data, which changes an unknown amount (probably around 20-50GB per month) and grows slightly (probably 5-15GB per month).

In any case, thanks for the calculation. Tarsnap is apparently not for the common person who wants to back up everything including their media.

stephenr · on May 6, 2015

That actually works out worse - $14K for the first year, $13K for the second year.

stephenr · on May 6, 2015

Add another $1k to that for data transfer (assuming you only upload that 4tb once)

girvo · on May 6, 2015

That sort of backup is what AWS Glacier is for, is it not?

lucb1e · on May 6, 2015

I guess, I haven't really looked at it yet. And I'd have to find my own software to encrypt it before uploading. Tarsnap's software is one of the major selling points, at least to me.

stephenr · on May 6, 2015

I hate to think the cost if you had to restore that data from glacier though.

tomjen3 · on May 6, 2015

How much do you save after dedublication? Tar-snap could be a lot cheaper if you do frequent backups or you often change little in huge files.

ac29 · on May 6, 2015

Backup tools like attic (which I use) include automatic deduplication. There are surely minor differences in implementation, but tarsnap isnt the only implementation of deduplicating backup.

scott_karana · on May 6, 2015

Tarsnap can have different write/read keys for each backup archive, so (unlike with Attic) you don't need to worry about a compromised host deleting its entire backup history.

Other than that, Attic is pretty excellent too.

Spooky23 · on May 6, 2015

It depends on the data you are backing up. For dedupe friendly data, the costs come down significantly.

appsonify · on May 6, 2015

what the fu....Colin Percival used to be my cello teacher 12 years ago....and he is running tarsnap. My mind is blown.

cperciva · on May 6, 2015

Not me -- I'm a violinist, and I've never taught anyone violin either.

Maybe you're thinking of my bother (Graham)? He was teaching cello around that time period I think.

appsonify · on May 6, 2015

Oh wow. Yes! Now I remember it was Graham. I was his student around then. Saw your photo on twitter and looked exactly like him! Hahaha. How is Graham?

My mind is just completely blown right now.

cperciva · on May 6, 2015

Graham is in Japan right now, but returning to Canada soon; he'll be joining me in the West Coast Symphony for our June concert.

I don't feel that I should really be talking too much about my family in a public forum, but if you'd like to send me an email I can forward it to him.

btmorex · on May 6, 2015

Why are you reinventing a scheduler when the OS (at least Linux) already provides a good one?

cperciva · on May 6, 2015

I'm talking about scheduling tasks within a single process.

btmorex · on May 6, 2015

Threads

cperciva · on May 6, 2015

Too much overhead. Also, concurrent systems are actively malicious.

btmorex · on May 6, 2015

I don't believe that you have too many active connections for threading to work. Passive connections can be handled by a single or small number of threads. Modern Linux on modern hardware has no problem with many thousands of threads and the overhead is minimal in $$$ compared to the time you wasted debugging a scheduling problem.

As for concurrent systems being harmful, you just have to design your program for threading in mind. Minimize shared state and be very careful when you can't.

rlpb · on May 6, 2015

> the overhead is minimal in $$$ compared to the time you wasted debugging a scheduling problem.

Better than time wasted debugging the races and deadlocks that only threads can cause. These are much harder to debug because they are so much harder to examine without changing behaviour.

Sure, there's a trade-off - so there's no point in pretending that there are no downsides to using threads.

cperciva · on May 6, 2015

One connection can have many outstanding requests.

btmorex · on May 6, 2015

I would redesign your protocol to be request/response based akin to http. Achieve performance by using multiple connections in the client. Simplicity > efficiency especially if you don't have the engineering resources of a company like Google.

And I'm out. The reply rate limiting is infuriating.

hamburglar · on May 6, 2015

It's really easy to glibly criticize someone else's design decisions when 1) you don't have a full understanding of their problem, architecture, or rationale for that architecture, and when 2) the medium of the conversation doesn't lend itself well to providing you a satisfactory explanation.

It seems as though you've gotten the tiniest glimpse of some details about the system and went on to assume he made a boneheaded decision and you know better. Do you have some secret evidence that he's incompetent and doesn't have a good reason for his decision?

Someone · on May 6, 2015

Good description, but I'm missing lesson learned #0: Do not wait too long before informing your users, even if only to tell them "we know about it and are working on it"