Second, be aware those savings come with some downsides. The major one for us has been their maintenance window every Thursday from 2:00-3:00pm Pacific Time. Usually there's no outage, but sometimes there is. There's no warning - it's just down sometimes during that window. So, if uptime is important for your data, consider the cost of also implementing a fallback solution to cover your production use during those maintenance windows. https://www.backblaze.com/scheduled-maintenance.html
They have four datacenters, three in the US and one in the EU. Details are not given regarding how the 20 shards that comprise your data are distributed geographically, but they state eight 9s of reliability.
Just to clarify, Backblaze states eight 9s of durability, not reliability.
Durability refers to the idea that your data will still be retrievable (ie no corruption) similar to S3's claim of eleven 9s of durability.
Reliability however would be say; the actual availability of the Web UI or API server that you download your data through. If it were down, that wouldn't impact the actual integrity of the data itself.
For anyone interested, any more than about five 9s of reliability is basically impossible anyway when it comes to human intervention. As an example, 6 nines of reliability would allow you 3 seconds of unavailability a year so 250 milliseconds of unavailability a month.
From a users point of view, being "unavailable" includes everything from going through a tunnel and having your mobile connection drop out to a shark biting one of the undersea cables in the middle of the ocean.
As you might imagine, with a human involved, they couldn't even get acknowledge an alert fast enough to meet that deadline let alone actually going about doing any repairs and diagnosis :)
It could be spread over multiple instances and redundant hardware as well but as with any system being touched by humans, it's near guaranteed that something will go wrong eventually.
At that scale, a complete outage is unlikely. I have services which haven't gone down _at all_ for longer than a year. But we lose requests every now and then -- during a deploy, or due to a bug. So we've moved from a time-based view of outage to a request-based view.
This helps, too, as it lets us build out services to be more reliable in combination, rather than less reliable. With retries and fail-over, an outage in an entire region may not necessarily result in any user requests failing.
For scale, pre-pandemic our published figures claimed >100M MAU.
I find https://andrewaylett.github.io/multi-burn-rate-calculator/ helpful for visualising error rates -- largely cribbed from the project it's forked from :) but with the tweakables switched around and the time between alert and error budget exhaustion in the tooltip.
It's worth noting that we only evaluate our alerts at most once a minute.
OVH didn't suffer a complete outage. If you were relying on that single DC, then you're probably not sufficiently large for this to apply to you.
But perhaps my point wasn't clearly enough made: a claim of "100% uptime" on a service level isn't particularly _useful_ when our users still only see a 99.9% success rate.
I think the weak point is their domain name. I think cloud providers should have a second domain, with a different registrar and managed sompletely independently, so that if one is subject to a problem (hijacking, dns outage, etc) clients can fallback to the alternative domain.
You say that, but telephony used to be obsessed with uptime. These are the people who invented live patching of a running system, after all. Case in point, the #1 ESS telephone switch was designed for less than 2 hours of cumulative downtime over a 40-year service life. In practice, most achieved less than 1 hour. (How many nines is that?)
Which is to say, individual subscriber lines can have all sorts of faults, but they only affect that subscriber. Trunks between offices can fail, but they only affect a certain number of circuits. The central call processing ability, of the switch to react appropriately to lines changing state and numbers being dialed, for calls to be completed if both stations are available, is what had to meet that target.
I was skeptical when I first heard this, so next time I was in an office that still had a #1 (actually a 1A, it had been upgraded in the late 80s), I talked to the switchman for a bit, and he showed me the downtime counter. It's a mechanical thing like an odometer, and once a second when the switch is executing its main loop, it touches a register that keeps the counter from incrementing. If the processor halts or isn't processing calls for some reason, the counter starts counting.
The switch was installed in the mid-70s, and from the moment it took over for its crossbar predecessor, including an in-place processor upgrade, it had logged less than an hour. Most of that came a few seconds at a time when swapping between active processors during software upgrades, he said. At the time (this was around 2002 I think) it was slated to be replaced with a DMS100, but the replacement activity hadn't commenced yet. I don't know what sort of reliability numbers the DMS machines achieve, but they'd do well to match their predecessors.
Depends. You're only thinking of one half of an SLA, the metric, rather than the measurement period.
What's the SLA penalty vs the extra the customer is willing to pay? If I think I can achieve 6 x 9's on a monthly basis, but probably I'll only achieve it 10 months out of 12, I can offer the customer an SLA of 6 x 9's for 100 USD per month.
My penalty can be 50 USD for failing to meet, and then I as a supplier walk away with 10 x 10 + 2 x 50 USD = 1100 USD for offering something I knew I couldn't achieve (consistently).
> For anyone interested, any more than about five 9s of reliability is basically impossible anyway when it comes to human intervention. As an example, 6 nines of reliability would allow you 3 seconds of unavailability a year so 250 milliseconds of unavailability a month.
Not sure what the qualifier “when it comes to human intervention” means, but if I ignore it then five nines is quite standard in certain sectors. For example phone switch SLA (back when I was in the biz) was measured in minutes per decade (as in “cumulative unavailability under 1/2 minutes per decade”). Large baseline power plants can and must run uninterrupted for decades.
Of course it’s a systems issue not a point solution.
I realise in hindsight that I was rambling from the point of view of offering five nines for software that is inherently flaky/unreliable. Companies where developers cycle through and knowledge is lost, technical debt accumulates and systems are used almost counter to their intended purpose (ie Redis as a database)
In that sense, it's like constantly massaging applications to stay alive or at a reasonable level of service and so hence the assumption that someone will be paged and respond in order to preempt a failure or restore service during an outage.
Given that the parity can only lose 3 shards it's easy to show that a datacenter loss will always result in data loss, so there's no reason to distribute and we should assume that all 20 shards are always going to be in the same place.
Oh how new is that? Last time I was looking through they just said their datacenter was in a bunker so nothing outside of a major natural disaster would affect it.
A fire like that in SBG a few days ago is extremely rare. The data lost there compared to all data stored in data centers probably justifies eight 9s. The problem with these numbers is that it's extremely unlikely to lose data but if you do, the event is so severe that everything might be gone.
That's why I feel more comfortable storing data across two "unreliable" providers with a lot of physical space in between them rather than one super reliable provider.
You also have to consider that data loss can result of simple things such as an account that gets blocked for stupid reasons. If you want to be safe you always need to have your data with at least 2 providers.
The EU and US data centers are completely separated. You can't even use both from the same account. To change, you need to open a completely new account selecting EU at the start. So it's not really easy to use both at the same time and there's definitely no redundancy across data centers if you're a customer in the EU.
A limitation I ran across when using B2 was that their pre-defined url generation doesn't allow you to set file-size limits nor does it allow you to set the file name in the pre-defined url. It simply gives you a url to upload it to. So if you are using b2 for storage for lets say image uploads from browser, some malicious user has the ability to modify the network request with whatever file name or file size they want. Next thing you know, you have a 5gb sized image uploads happening....
This pretty much prevents me from using B2 for now.
I ran into the same limitation! IIRC, there also wasn't a way to expire a signed upload URL sooner than whatever the default was, which was hours or maybe a day. I had the exact use case you mentioned, too - image uploads bypassing my backend server. I didn't want the generation of a signed url to, say, upload a profile photo, give carte blanche to create a hidden image host when combined with the limitation that you highlighted. All sorts of bad things could come of that. I ended up just going back to S3 - costs more, but still worth it.
I'm not so sure. It's better than a late night window in that it forces their customers to actually deal with the engineering window, rather than just cross their fingers.
Also, your office is full of awake engineers at that time. Which is better than a handful of on-call sleepy engineers and crossing your fingers the rest of the team wakes up when you call them if something goes completely haywire.
With the B2 target use case of backup and archive data I suspect it's actually a good time to do it for their customer base (and it also then happens to be good for their engineering team too, awake and alert!).
Thursday is the best maintenance day. If it goes haywire, you have a full business day to fix before losing a weekend. And if you can't, well, you've given your coworkers a "free Friday," which is far less likely to result in complaints than screwing up M-Th.
Ever go into work on Friday and "the system is down"? You can't get anything done, because the tools you use to do your job aren't working, and the fix is out of your hands. Your coworkers are all affected, too. First people are frustrated, because they have tasks deadlines, and those tasks won't get done and the deadlines won't be met. A few are really freaked out and start calling bosses and getting VPs to yell at other VPs. But soon, most people in the company realize that everybody else is in the same boat, and nobody will be meeting their OKRs this week, and the status reports probably won't be filed.
Then they relax.
If they're in offices, they group up, maybe in the break room. A rousing game of ping-pong breaks out.
Remote coworkers ping each other on Slack. Maybe a few start a round of Among Us. Bread dough is kneaded. Kids get a little more help with their schoolwork.
Everybody takes a very long lunch.
By 2pm, people realize the entire day is gone. Almost everybody has left or signed off by 3. Some roll out to bars; others go home to their kids, or to their gardens or garages or battlestations. Everybody beats the traffic.
Come Monday, the system is fixed. People are a little stressed out, since there's so much catching-up to do, status reports to be filed, widgets to be tracked and poked. But everybody agrees that was an amazing couple of days, and they got lots of rest, and it sure was nice. And hey, I had this great idea over the weekend—
I'm a big fan of Digital Ocean and run a bunch of droplets. B2 is way cheaper than Spaces for storage (1/4 the price). I tried using Spaces anyway, because I wanted something with faster throughput for streaming video, but Spaces was even slower than B2, even within the same Digital Ocean datacenter. All these S3 clone storage systems are clearly throttled, and there seems to be at least soft collusion to keep the bandwidth about the same between them, and just enough to prevent video streaming. I'll go sit in the corner and adjust my tinfoil hat now.
I guess it depends on how likely you are to need to do that. Looking at b2 vs glacier deep it seems as long as you don't need the data more than every 2y that glacier still works out cheaper even with the high bandwidth costs.
But glacier also has minimum storage duration. With S3, you'll need to use a tiered system unless you want to store all backups for several months (often that's only the case for weekly or monthly backups).
In the end, S3 can be cheaper but you have to make a lot of assumptions beforehand. Backblaze is cheap enough to just throw everything in there and work with their lifecycle rules. You don't need to make assumptions about download volumes or storage duration beforehand (esp if you can retrieve via cloudflare).
Spaces was not fully compatible with S3 at one point. It was nearly impossible to download your entire bucket when it was huge. Rclone was able to, but it was horribly slow. AWS CLI would only grab up to 1000 items. It seems like they did finally fix that though.
I’m actually a huge fan of Bunny now. The CDN piece is about as cheap as it gets (for any utility based service), it’s optimizer and other things work well, and it works seamlessly with their storage system too. Which is super cheap itself, allows you to control how much it’s replicated (and where) - just waiting for them to deliver S3 compatibility so all the existing tools that exist work, or some other type of CLI tool.
Backblaze downloads are $.01/GB, not $.02/GB as stated there. And via Cloudflare they're free. That makes a big difference vs. AWS where you have no chance to get that data to another provider for free (unless you have a very special deal with them).
Yev from Backblaze here -> typically not - we've built most of our systems to be keep data up and flowing during those maintenance windows - if we anticipate longer windows or are doing things that can impact performance we typically announce it on our blog and twitter!
>if we anticipate longer windows or are doing things that can impact performance we typically announce it on our blog and twitter!
As a customer is there any way to opt-into a more proactive notification of an anticipated delay, like an email? I understand such things are necessary sometimes, but "always pay attention to some blog or twitter for a rare occurrence" doesn't seem particularly busy-stressed-admin friendly :).
Thanks for building Backblaze. If you're able to share some feedback with the team - this is an extremely important factor. If you can ensure downtime is avoided during maintenance windows, it will make your service much more viable for production systems
This is an unreasonable expectation. The whole point of a maintenance window is to allot an expected time when there might be downtime.
Otherwise, the maintenance window becomes 24hx365, since "ensuring downtime is avoided during maintenance window" means literally - make a maintenance window have the same uptime as non-maintenance window.
It's not necessarily unreasonable, it just depends on what kind of product they want to offer. S3, Google Storage, etc. do not have a maintenance window that I'm aware of. That's not to say they would never go down, but if they do you would expect an alert and an apology, at least. Many application require this kind of expected uptime, but of course there are others (backup, etc) that would not.
If you have some umm other useful tips for that 100% uptime let the rest of the world know. I am more happy with realistic and upfront statements from B2 than some wishful thinking from potential users.
Not often, I only remember a few times in the last couple of years. If you're just hosting backups, you're unlikely to even notice. But if you're serving live production data from B2, it can bring down your whole service, which is quite painful especially if you have a large customer base.
Same here. We're midway through a migration, and this has us re-thinking the whole move. I wonder if we had a duplicate copy of our data in the EU data center if we could do a fallback during US downtime, or if the entire 'cloud' goes down.
Replying to my own comment as I just talked to their support about this: "The maintenance window does affect all our data simultaneously. As we push the updates through one data center to another."
Welp, so much for that.
Side question: Apart from this maintenance window, is B2 Cloud reliable? I've heard of problems with the S3 API. Is the "native" API more stable? Would love to know your insight, it will potentially save me a lot of time!
Several years ago Backblaze lost all of my wife's data. Their dashboard said it was all there, and we trusted their systems to be accurate. When attempting to download the data it turned out that none of it was there. When my wife contacted support they tried to blame her.
Obviously this was a few years ago, but a backup provider failing at their one job and then blaming the customer left a really bad impression that keeps me from using them.
You were probably a victim of their 30 day deletion policy. If for any reason (firewall, etc) you did not connect to the backup servers your data would just be purged without a grace period. For that reason I built my own backup sync using B2 directly instead of their backup service (and it’s a lot cheaper).
We actually checked that- the day we went to get the laptop repaired we confirmed that it was active and backed up, and a week later the restore failed.
Backblaze eventually admitted that their dashboards aren't realtime, and they had a bug which was showing us (and their client) files that didn't exist.
Depends entirely on how much data you have. If it’s less than 1TB then $.005/GB/month is less than $5/month. These days most people would blow past 1TB pretty easily so for most people unlimited Backblaze backup would be cheaper (if your on Mac or Windows where it is even an option).
Do people really blow past 1TB easily? Even with all my pictures I don't really get past that mark. Many people I know only have a laptop and all their data on there, it's rare to see laptops with more than 1TB capacity. So apart from people with a high amount of pictures or videos I wouldn't expect many to have more than 1TB backup needs.
Yeah, most is probably an overstatement. I know for me, between photos, videos, music, VM images, etc. I have way more than 1TB of data that I want to be backed up.
I wonder what the version of the old "if you don't test restoring you don't have backups" rule is for these new services. Restoring costs money and laptop users don't have good ways of doing complete restores just for test as it's a lot of downtime.
Maybe it needs a kind of stochastic automated approach.. a program that finds sufficiently small (vs costs) sample of files on your computer (some old, some recently changed, etc) and tries restoring them and verifies.
> Restoring costs money and laptop users don't have good ways of doing complete restores just for test as it's a lot of downtime.
At least for the standard Backblaze service you can download for free. For a USB drive you float the cost of the drive (they reimburse you when you return it)--maybe you pay shipping?
Downloading rarely makes sense for a full restore, but is perfect for smaller restores or tests. Even if it didn't blow away your quota they only keep the packaged restores around for 7 days and I've found that difficult to restore a large amount of data with home Internet.
To restore it to a drive all you pay out of pocket is return shipping of the drive. The one time I had to use it I was slightly over the 30 days (I was waiting on a repair before I could restore the data) and it wasn't an issue.
I haven't used a it a lot but it's been a solid endpoint for backups.
I did some medium-intensity benchmarking a while back and decided not to put certain server data on it because I was getting a few 20+ second timeouts per thousand read requests. I can handle server errors, and I have retry logic, but this was something where I needed to be able to access the data within a second or two. Maybe it would have worked better if I set a very aggressive timeout, I'm not sure. Deeper testing is something I'll worry about some other time if the data actually grows past a couple hundred gigabytes.
This was mostly with the S3 API, I don't remember if I ever succeeded in getting the program to use the native one.
Have had live production data on there for a couple of years and it's been very solid outside the maintenance windows.
(The exception being the recent outages GoDaddy caused for them, but since they've moved to using Cloudflare as their registrar, I don't anticipate further issues there: https://news.ycombinator.com/item?id=26119619 )
I love B2's free ingress and egress when you use CloudFlare, they utterly destroy the competition on price.
But my true dream would be for backblaze to someday offer ZFS as a service.
I want to `zfs send -i my_pool@2020-03-11 | b2zfs recv some_bucket_id`, then be able to view my snapshots and files within in the backblaze web UI, and restore with `b2zfs send -i some_bucket_id/my_pool@2020-03-11 | zfs recv my_pool`.
You can already mount B2 as a FUSE filesystem with something like ExpanDrive, then write ZFS raw file vdevs to the B2 FUSE mount, but it's horrifically slow and probably too janky for any real use.
Rsync.net looks great but the storage cost is 5x B2 unfortunately
EDIT: As mentioned below this is for the most expensive (lowest capacity) tier. I’d been comparing for my own home use and so I would be unlikely to exceed 10TB but if you’re looking at higher capacity then maybe the calculus is different.
Without b2 adding anything zfs specific, that should be possible with a regular large file upload if you can deal with each upload part being buffered in a file/memory locally before upload.
It's interesting because it feels like a small turf war. The "Amazon Provider Team" wants to own the s3 provider backend, and you want to make it pluggable. Is this to avoid an explicit "b2" provider? I can see how it might be confusing to use terraform and see an s3 provider and then have it...not use s3. :)
Terraform doesn't allow any backends other than the ones built-in. So even if you wanted to provide a "b2" backend, it'd have to be built into to terraform.
Using the S3 backend with B2 should work fine since B2 makes a S3 compatible API available, but its made more difficult because the "Amazon Provider Team" is the one that maintains the "s3" backend for Terraform, and they want to do additional validation that matches AWS's expectations.
Also, was there a reason it's not against the B2 API? Not a judgment, just curious about the design tradeoffs in a professional-talking-to-professional sense.
That design choice prevents their provider from working with official Terraform container images. They should make the API calls directly in Go, it's just weird to do it via Python... They have very few resources exposed on their API so it's not like writing a Go client would be all that hard.
Inside the article is a section "How to get started using Backblaze B2 in Terraform" with a link [1] to the getting started guide which should have everything you need.
I can see why they put it on a separate page to not clutter up the article.
I was less interested in the code sample as a "how-to" and more for skimmability – rather than reading a bunch of words I just wanted to see what it would look like.
> Terraform is an open-source infrastructure as code (IaC) tool
Maybe it was in the beginning, but Terraform is far more powerful than that now. Terraform is a monad that neatly separates pure declarative configuration from the I/O (side effects) that are factored out into providers. Terraform used at its most powerful is not limited to infrastructure, it also sets up the platforms and applications running on that infrastructure for you, by separating the configuration of the platforms and applications from the generic API calls that apply that configuration. Terraform's dependency graph ensures that the calls are made in the right order, no matter if they are made to infrastructure APIs, platform APIs, or APIs belonging to layers further up the stack
For large downloads, does BB support the Range header. If the user is on a connection that that is not suitable for long downloads, could the Range header be used to download a large file in several parts.
Popped in to say backblaze built out a tf provider at our request and they were great about it! Got a quick early build out to test, GA build a few months later and are very responsive to feedback. Pleasure to deal with
I was just looking at this new Terraform provider yesterday how timely. Nice to see the quickstart guide this will be helpful for managing the buckets and application keys.
You could use NixOps[0] for Nix but I'm not sure you can directly compare Terraform and Guix/Nix? My set up involves Terraform for infrastructure and Nix for provisioning, and it's working for me so far.
I think this too, but at the same time the reality is a tool like Terraform is really complicated to implement well, and, importantly, it always has to work. All the time. There are really high standards for this, and in my personal experience, alternative solutions like NixOps don't quite stack up in reliability or broad utility versus Terraform. The design is good in theory, but it needs just a huge amount of work to be trusted.
For the most part, I provision things with Terraform and then instantiate the servers with NixOS/Nix itself, and this mostly works. For bonus points you can use Nix to generate the HCL that Terraform reads in (because Nix can write JSON, and HCL is just JSON in a trenchcoat) if you want to put some veneer on it.
Are there any providers that to have such integration? I have only lightly used nix. Having a hard time understanding what a cloud provider integration would look like.
First, B2's pricing is pretty amazing, especially compared to S3 and similar competitors: https://www.backblaze.com/b2/cloud-storage-pricing.html
Second, be aware those savings come with some downsides. The major one for us has been their maintenance window every Thursday from 2:00-3:00pm Pacific Time. Usually there's no outage, but sometimes there is. There's no warning - it's just down sometimes during that window. So, if uptime is important for your data, consider the cost of also implementing a fallback solution to cover your production use during those maintenance windows. https://www.backblaze.com/scheduled-maintenance.html