This line: I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed" is such a common situation for most people, but I tend to see it with engineers especially. I find I struggle with it an incredible amount. In some ways, I guess it seems healthy or reassuring that incredibly smart people like Colin Percival suffer from similar challenges around fully understanding the scope of the problem and the solution.
All that being said, I really respect the detailed response from a technical perspective as well as owning up to (and the decisions that went into) a spell of downgraded performance.
Later edit because I don't want to spam the comments: I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions. This is well beyond my depth and while Colin mentions that it didn't necessarily increase performance, I'm wondering if the hope is it would longterm? Or were there other benefits to such an integration?
I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions.
I was using OpenSSL for that (which was using a software implementation). The code (you can see it in spiped) now detects the CPU feature and selects between AESNI or OpenSSL automatically. Given that the tarsnap server code was spending about 40% of its time running AES, it's a nontrivial CPU time saving.
I should probably have been clearer in my writeup though -- using AESNI was never a "once I roll this out everything will be good" fix. Rather, it was a case of "I have this well-tested code available which will help a bit while I finish testing the real fixes".
I would have sent out an email to the mailing lists earlier; but since at each point I thought I was "one change away" from fixing the problems, I kept on delaying said email until it was clear that the problems were finally fixed
This ties in to the last lesson I mentioned at the bottom:
5. When performance drops, it's not always due to a single problem; sometimes there are multiple interacting bottlenecks.
Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.
> Every time I identified a problem, I was correct that it was a problem -- my failing was in not realizing that there were several things going on at once.
Very common! One thing that's been helpful for us is establishing predefined system performance thresholds that, if exceeded, initiate the chain of events that will lead to customer communication. "If X% of requests are failing, then we had better advertise that the system is degraded." Discussing and setting these thresholds in advance and the expectation that they'll result in communication helps drive the right outcome. It's not perfect, because one is always tempted to make a judgment call in the circumstance, which is vulnerable to the same effect, but it's a good start.
That totally jibes with what I found "reassuring" in a sense. That even very smart people sometimes get hit with inadvertent "multiple problems looking like a single issue" situations.
i tend to get to debug problems like this (usually in 3rd party code i dont know the internals of) pretty frequently.. my experience has been it tends to follow a curve..MOST of the time, the problem is simple and you can quickly dispatch it. the scary (or fun, depending on your perspective) part hits when you pass the first level, and there are still problems.. and you dont know if it's two or ten levels deeper. then you get into that crazy test/optimize cycle and crawl out two weeks later wondering when you last ate..
This "it's almost fixed, I'll email the client soon" pattern is something I have personally struggled with a lot, and I agree it appears to be common with engineers.
My workaround has been to make something else responsible for sending the email. In a team, this could be a manager setting a cut-off point after which communication must be made. When working on my own, I set an alarm for X minutes. When that alarm goes off I ignore the internal voice which says "just try one more thing, then send the email", and send an update to let the relevant people know my current progress, ETA to fix, and when they can expect the next update.
I think this is similar to how GTD encourages us to use systems for storing to-do lists instead of trying to remember them - our fragile human brains are not always to be trusted.
Very much of the time I feel, "If I knew what the problem[s] [was|were] it'd be solved by now!" That's not exactly true of course but of course diagnosis is a large part of the total solution.
This type of an answer that Colin gave above does not exactly win friends and influence people in most situations where you're part of a team or hierarchy. Can anyone share what they've done to give better answers in these cases? I understand why people want the answers, but I don't have them to give right away particularly when it's Someone Else's system.
One trick that I've learned (though I still have trouble routinely applying it myself) for these situations is: less is more.
That is, as engineers we tend to want details. All the details. We want to know what happened, why it happened, how it's going to be fixed, and how long that will take. Because we want all that detail for ourselves, we hesitate to contact our customers/boss until we have all the details. Combine that with a desire to fix problems as they come up, and you end up with, "I never told you there was a problem because I was always one fix away from the solution."
But most people are not engineers. They want to be acknowledged. They want to feel informed, even if they have less details than what you would like to provide for them. Sometimes, something as simple as, "We've noticed that there is an issue and are currently working on a fix," goes a long way. Also don't be afraid to pull out, "Users have been reporting issues with backup performance. We do not currently believe this represents a service failure, but we are working to return performance to normal levels."
Your users trust you (otherwise they wouldn't pay you). If you "believe" something, they will too.
Just to be clear, when Tarsnap users wrote to me I told them everything I could. The "I think it will be fixed soon" delay in sending out an email to the lists affected only people who didn't notice or noticed but didn't ask about the issue.
In case any other customer is wondering "Wait, I didn't hear anything from my monitoring about that and I'm retroactively worried. How worried should I be?" like I was: I just pulled our logs and reconstructed them, and it shows over the last ~30 days that the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case. This didn't trip our monitoring at the time because they all completed successfully.
n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future.
Explicit support for randomizing timers across multiple hosts is a really nice features of the timers provided by systemd:
"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."
You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.
the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case
Yes, that sounds about right. I had maybe half a dozen people write to me who had noticed performance problems, and after the initial "backups failed because the server hit its connection limit" issue, it was people whose backups were already very long-running -- if your daily backups normally take 20 hours to complete, a 40% slowdown is painful.
I run my backups overnight and get a status email each morning, and I didn't even realise there were performance issues until now. As you said, unless you run your backups multiple times per day, or have long-running backups, it may not have had a lot of impact.
FWIW, I live in Australia (so an 'off-peak' timezone), and schedule my cronjob on an odd minute offset, so it may not have been an issue for me anyway!
Hear hear on said Old Wizened Graybeard habit. The amount of pain inflicted from twenty jobs all starting up at :00 (or even :30, :45, etc.) when they could easily run at :04 or :17 can be huge. Anecdotally I once "lost" a sandbox server to a ton of developer sandbox jobs starting at :00 and not completing before the next batch started.
Funny part to that, was on a project with multiple teams with multiple crontabs. Each team took that advice to heart for some jobs. Sadly, we had too many Hitchhiker fans and :42 became a bit too common.
Or schedule your cron job for :00, but add "sleep `jot -r 1 0 3600` &&" to the start of the command. (jot is a BSDism, but I assume you can do the same with GNU seq.)
This is a pain when deciphering a series of events later, though, because you don't know when a particular job was supposed to start. I'd prefer the delay to be stable on a per-host basis.
We just went with a single group text file with all the jobs and which ones could be spread out. Saves the programming and gives the sys admins / DBAs an idea what goes when.
One way to think about your fear is, shouldn't that just be a tarsnap feature?
Add some metadata for a machine that tarsnap should expect a once a day/week/month backup from this machine, and if it doesn't get one, to send you an email?
Until the day when Colin considers it in-scope for Tarsnap, I recommend Deadman's Snitch for this purpose. I literally spend more on DMS to monitor Tarsnap than I spend on Tarsnap. No, I don't think that is just, either.
Don't you have some other servers running other services? So you must already have some monitoring and alerting system like Nagios, to which you can add one more little "passive check" that does the same thing, for no incremental cost?
I have roughly fourish separate monitoring systems for Appointment Reminder. DMS is the one which is least tied to me, so I use it for Tarsnap (the most critical thing about AR that can fail "quietly") and as the fourthish line of defense for the core AR functionality.
(This may be slightly overbuilt, but I felt it justified to get peace of mind, given AR's fair degree of importance to customers/myself and the enterprise-y customer base. In particular, I would not have been happy with any monitoring solution which would fail if I lost network connectivity at the data center.)
$15 a month is far below my care floor for making sure that my backups are working and that I do not get sued into bits.
I ran into this recently, backing up munin data to s3. I ran it at a time point offset from an hour to avoid those 'on-the-hour' rushes, but I was getting problems with the copy. Took me a moment to realise I was doing it on a 5-minute boundary, and munin fires on a 5-minute boundary - the data was being updated as I was copying it...
It's the picodollars - tarsnap was the second business I fell in love with on HN (the late Kiko was #1) purely because of the awesome vibe I felt emanating from your enterprise (which I'm assuming is a reflection on you as well).
Years later, you've also become a cause celebre for holding true to a clear business and lifestyle vision (again, perceived at distance), in spite of the recommendations and 'support' provided by Patrick and others, including myself. Keep being true, and I suspect the community will keep learning from you Colin.
In all seriousness, the picodollars do an excellent job of attracting exactly the sort of customers I want... and turning away the customers I don't want. They were originally part joke and part a way to avoid arguments with customers who don't understand that 1 GB < 1 GiB, but now it's way more than that.
in spite of the recommendations and 'support' provided by Patrick and others
Don't be too harsh on Patrick. His vision for Tarsnap is not my vision for Tarsnap, but he has helped me to orient myself: The projection of "business" onto the subspace "geek" doesn't look very much like "business", but it's not the same as "kid right out of university who has never had a real job" either, and that's what you would see if I hadn't had advice (from Patrick, Thomas, various YC people, and the rest of HN).
Advice can be very valuable even if you don't follow it to the letter.
Hey man, awesome writeup. I have a suggestion for you: try and architect off those EBS volumes -- as you unfortunately learned the hard way, they just aren't that consistent. DynamoDB is a good option, or adding some redundancy so that you can just use the ephemeral disk would be even better (and probably cost neutral compared to the "consistent" I/O EBS volumes).
Yeah, that has been a work in progress for a long time. FWIW, I started using piops volumes when they were the only SSD option available -- they beat the crap out of spinning ephemeral disks.
For those that want to run a similar service using their own systems, I found that Attic [1] is a great open source backup tool that works in a very similar way, including deduplication and compression.
I backup some VPS servers to my NAS at home using attic over an SSH tunnel. Incremental backups are quite small and it's easy to automate with a simple cron job.
As an AWS user this type of thing gives me cause for concern:
At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which most
of this metadata was stored suddenly changed from an average latency of 1.2
ms per request to an average latency of 2.2 ms per request. I have no idea
why this happened -- indeed, I was so surprised by it that I didn't believe
Amazon's monitoring systems at first -- but this immediately resulted in the
service being I/O limited.
A sudden doubling of latency can have dire consequences on any system. Knowing that such unexpected changes are possible makes it built trust in your environment, even if it is running fine today.
It's getting to the point where, when I see a post mortem like this, I am just waiting for the AWS problems. Between this and the downtime that AWS has, I'm kind of amazed that people use it-- you pay too much and you get less. (Compared to a lot of other choices, such as raw metal boxes from Hetzner)
This is why I don't use AWS for anything non-trivial, and I am wary of people who put critical infrastructure on it. (EG: I Don't care about netflix, that service can run on AWS fine, but coinbase, for instance, if I was their customer and they ran on AWS I would stop being their customer.)
Whenever AWS problems come up people talk about how "AWS is so much more efficient, you just outsource that stuff to the experts".
But that seems to imply that hosting on your own hardware in your own office is the only alternative. Of course we stopped doing that in the 1990s.
With AWS you have to know Linux and have ops people, that's true everywhere. With AWS you have the additional burden of learning the AWS APIs and learning how to use AWS, which isn't transferrable, so that's a higher cost. With AWS you have to architect around the limitations of the way AWS is built and your architecture becomes AWS specific if you use those APIS, so that's an additional cost. You don't need any less ops people, probably more, than going with another hosting service like Digital Ocean or Backspace. And if you go with something like Hetzner you pay 1/5th to 1/10th for machines with a lot more performance and local storage. (Though you get the additional latency of being located in Europe, if your primary customers are the USA.)
Of course, I'm also prejudiced. I worked at Amazon and saw how the sausage was made and was not impressed. When AWS was announced as "running on the same infrastructure that powers Amazon.com!!!" as if it was a feature, I cringed. Amazon.com was having outages of parts or major components on a weekly basis at that time. Much of AWS is actually running on bespoke software (so not actually tested by Amazon.com when introduced, though I'm sure portions have been moved over at gunpoint) ... which actually makes it worse. People were trusting their data to a service that pretended to be backing a major e-commerce site but was actually untested outside of the company at the time.
And what have we seen since? An unacceptable level of failures. (in my opinion, of course)
But people seem to be very forgiving. When it's happening everyone's in "how can we fix this mode" and then when it's fixed everyone forgets and goes back to thinking of AWS as always running.
To this day I still do not get why you would use AWS, the entire user experience is clunky and the pricing is crazy for what you get. Azure isn't much better with regards to downtime, but if you want something more than just a VPS I'd choose it any day over AWS for the significantly better UX in both the admin console and the command line tools + SDK.
Ultimately though, even with Azure or AWS you're going to need people knowledgeable enough to administer your compute instances anyway, so why not just run your full stack on a bunch of VM's from DigitalOcean or Linode or rent a couple dedicated servers and throw oVirt on them; saving yourself a significant chunk of money at the same time.
It wasn't missing its guaranteed # of I/Os per second, so I figured the slowdown was just "one of those things" and not an out-of-spec issue. Happy to send you the volume ID if you think someone would want to investigate (and still has data from the start of April) though.
DevOps/Infrastructure engineer here! I see this happen frequently in AWS. Never expect either your instance networking latency or the latency of the underlying EBS storage layer to be consistent.
If you absolutely need guaranteed IO performance, use an instance store or move to dedicated hardware. Them be the breaks of cloud computing.
Sorry if this is offtopic, but can anybody explain the value proposition of tarsnap to me? It seems like a nice service and all, but the pricing is an order of magnitude more expensive than S3. If you are storing a few GB, this might not matter ("over half of Tarsnap users spend under $1 per month on storing their backups"), but if you have that little data, why not just dump it on a free Dropbox/Gdrive/etc account?
For more data, why not just use one of the many compressed, deduplicated, encrypted, incremental backup systems (attic comes to mind, I'm sure there are others) then just sync to S3 at a tenth the cost?
So you're saying you trust a single developer to both write an encryption tool and run the servers it talks to more than the combined possibilities using existing open source tools to create backups by encrypting data locally and storing it remotely via ssh/sftp?
Yes, when it comes to crypto I'd put my in trust in highly talented people over trusting my own ability to glue together a collection of OSS tools anyday.
I didn't suggest you should write your own encryption tool. There are numerous open source tools for creating encrypted backups, some do deduplication first too.
If the tool doesn't happen to support remote storage, a simple rsync or scp fills that part.
Literally the only thing unique about this service is the use of the term picodollars and the single individual it's all reliant on.
When Drew first do a "Show HN" [0] (before it was a thing, actually), there were a lot of response about how it doesn't do anything new that couldn't be already done by a technical inclined person (see the first two top comments in the posts).
To make a comparison with tarsnap, while it's probably possible to do encrypted backup manually with a combination of shellscript and such, there are just too many moving pieces that can go wrong. Where do you store the backup? Someone mentions S3, but even managing backup on S3 with deduplication is not something trivial, and managing the encryption process is definitely not something most of us can say with confident we won't mess up. I can imagine a thousand way that I encrypt something, then unable to decrypt it back.
And then maintenance is also an issue, if I'm using a set of OSS tools, I would have to make sure the tool is being maintained, and to follow any potential disclosure on bug/ updates etc. With Tarsnap, I know I will get an email from cperciva if something comes up.
As I already stated I never said you or I or most people should write our own encryption tools.
There are many open source backup tools. They offer a wide range of features such as data deduplication, references/hard links to simulate total backups without copying unchanged files, data compression, data encryption, logging, reporting, remote storage and/or remote sync.
Not all tools offer all features. Not all features work the same way, but there are many options.
Those that don't offer remote storage/sync can be setup very simply to backup locally and then sync/copy to your remote file store of choice - another server, s3, rsync.net, etc
The majority of these tools are shipped as part of Linux distribution repos, so there are almost certainly many more people using them, and multiple people with a vested interest in maintaining them.
And for reference, I agree with the comment(s) about Dropbox. The only difference is that they offer a more intuitive GUI which so far is lacking in open solutions.
I didn't mention writing encryption tools, I was simply saying that plugging all the available tools to use together is non-trivial.
Tarsnap to encrypted backup is what dropbox to file syncing (to a certain extent, obviously). I can understand why for someone knowledgeable like you, the benefit isn't obvious, just like we don't see the benefit of dropbox over other tools. But certain demographics will see tarsnap/ dropbox as value added, and is willing to pay for them (with good reason, too).
I know a lot of developers who have never spin up an EC2 instance, can't get their way around setting up a server, and certainly is not interested in maintaining an offsite server for backup. To them, tarsnap with its command line provide enough simplicity to be used (of course, it can be much better, as patio11 and alot of people pointed out).
I'm talking about working tools. They either do everything when invoked, or write to a file/dir on disk that can then have rsync invoked to copy offsite.
Im talking about maybe a 4 line shell script, if that.
If someone can't handle that amount of setup, maybe they shouldn't be the person setting up mission critical backups?
Try in 18 hours. Can you call him when something fails?
I'm not saying he isn't responsive I'm saying depending on a one-man-band who is responsible for the client software, server software and the underlying storage system (ie he is the owner of the s3 account) seems like a huge risk.
"While the Tarsnap code itself has not been released under an open source license, some of the "reusable components" have been published separately under a BSD license"
Finally, numbers other than picodollars and gigabyte months and unpredictable deduplication. This convinces me I don't want to store 4TB there at a huge cost($12,000 if it's really $300 a year for 100GB) compared to buying two 4TB drives (~€250 per 3-4 years) and placing them at a friend's with free bandwidth.
Don't get me wrong: managed, off-site encrypted backups are very attractive, and I might be willing to pay a premium, especially for software from a trusted person, but not the cost price hundredfold.
Tarsnap isn't intended to be used as one-time backup like that, and it's super expensive if used that way. It's very cheap when used to backup (almost) the same 4GB for 1000 days in a row, which is what a lot of people/businesses need for their backup solutions.
I guess, I haven't really looked at it yet. And I'd have to find my own software to encrypt it before uploading. Tarsnap's software is one of the major selling points, at least to me.
Backup tools like attic (which I use) include automatic deduplication. There are surely minor differences in implementation, but tarsnap isnt the only implementation of deduplicating backup.
Tarsnap can have different write/read keys for each backup archive, so (unlike with Attic) you don't need to worry about a compromised host deleting its entire backup history.
Oh wow. Yes! Now I remember it was Graham. I was his student around then. Saw your photo on twitter and looked exactly like him! Hahaha. How is Graham?
Graham is in Japan right now, but returning to Canada soon; he'll be joining me in the West Coast Symphony for our June concert.
I don't feel that I should really be talking too much about my family in a public forum, but if you'd like to send me an email I can forward it to him.
I don't believe that you have too many active connections for threading to work. Passive connections can be handled by a single or small number of threads. Modern Linux on modern hardware has no problem with many thousands of threads and the overhead is minimal in $$$ compared to the time you wasted debugging a scheduling problem.
As for concurrent systems being harmful, you just have to design your program for threading in mind. Minimize shared state and be very careful when you can't.
> the overhead is minimal in $$$ compared to the time you wasted debugging a scheduling problem.
Better than time wasted debugging the races and deadlocks that only threads can cause. These are much harder to debug because they are so much harder to examine without changing behaviour.
Sure, there's a trade-off - so there's no point in pretending that there are no downsides to using threads.
I would redesign your protocol to be request/response based akin to http. Achieve performance by using multiple connections in the client. Simplicity > efficiency especially if you don't have the engineering resources of a company like Google.
And I'm out. The reply rate limiting is infuriating.
It's really easy to glibly criticize someone else's design decisions when 1) you don't have a full understanding of their problem, architecture, or rationale for that architecture, and when 2) the medium of the conversation doesn't lend itself well to providing you a satisfactory explanation.
It seems as though you've gotten the tiniest glimpse of some details about the system and went on to assume he made a boneheaded decision and you know better. Do you have some secret evidence that he's incompetent and doesn't have a good reason for his decision?
Good description, but I'm missing lesson learned #0: Do not wait too long before informing your users, even if only to tell them "we know about it and are working on it"
All that being said, I really respect the detailed response from a technical perspective as well as owning up to (and the decisions that went into) a spell of downgraded performance.
Later edit because I don't want to spam the comments: I'd love some context (maybe from cperciva himself?) around the performance enhancement of integrating new Intel AESNI instructions. This is well beyond my depth and while Colin mentions that it didn't necessarily increase performance, I'm wondering if the hope is it would longterm? Or were there other benefits to such an integration?