Silent Data Corruption Is Real

elfchief · on March 12, 2017

It really bugs me (and has for a while) that there is still no mainstream linux filesystem that supports data block checksumming. Silent corruption is not exactly new, and the odds of running into it have grown significantly as drives have gotten bigger. It's a bit maddening that nobody seems to care (or maybe I'm just looking in the wrong places)

(...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say "mainstream" I mean something along the lines of "officially supported by RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)

nayuki · on March 12, 2017

Btrfs has experienced some data loss bugs in recent memory. It looks like ZFS is the only remaining option.

https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-Da...

https://www.spinics.net/lists/linux-btrfs/msg59190.html

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

seiferteric · on March 12, 2017

Yes I have tried to use btrfs several times for work projects and personally because I was very excited about it but every time I have ran into severe bugs even though it was said to be "stable". I have given up for now, maybe I will check back in a couple more years.

Dylan16807 · on March 12, 2017

I really like btrfs for my backup machine, but it sometimes manages to hang when it's cleaning up deleted snapshots. This is a problem that's much worse when on a hard drive or fragmented file system, but I've gotten it to happen even on a recently-balanced drive with a small amount of snapshots.

I recently had a server be crippled by running snapper on default settings for a few months. And after a couple days of balancing (which desperately needs a throttle control) it wasn't much better, so I gave up on having it run btrfs.

I think the record I managed was slightly over two minutes of btrfs blocking all disk I/O. Something is deeply wrong with how it organizes transactions.

dom0 · on March 13, 2017

Funny, my btrfs story is somewhat similar. I don't hit severe bugs, though, just severe performance issues. E.g. btrfs on / always sooner or later means for me that the entire IO to the disk that's on will be _thrashed_ during system updates or package installs / uninstalls.

djsumdog · on March 12, 2017

The last time I was using btrfs was around 2014, and I was wondering why my hard drive was always showing 100% utilization even after I moved/deleted a ton-o-stuff.

Turns out at the time, re-balancing still had to be run manually. I'm not sure if that still holds true.

delroth · on March 13, 2017

It is still a problem, I just had one of my personal servers hit issues where I couldn't even manually re-balance because the metadata was full. Had to apply weird workarounds to be able to write to that filesystem again...

tombrossman · on March 12, 2017

Any idea if ZFS plays well with Ubuntu's full-disk encryption? I've used the FDE option at install for years and every time I upgrade (I wipe & reinstall every year or so) I try to understand how to first set up ZFS, then FDE, and then I realize it's far too complicated for me. Any good tutorials or setup guides that even a moron could understand?

I've got a pretty good setup now with a fairly complex fstab, multiple SSDs, backup drives, and everything fully encrypted and auto mounting at boot. I'd really love to move this to a file system more resistant to data corruption.

zxv · on March 12, 2017

Native ZFS encryption may become available soon. It is pending code review: https://github.com/zfsonlinux/zfs/pull/5769

There was a presentation of this by Tom Caputi at the most recent OpenZFS Developer Summit.

Slides: https://drive.google.com/file/d/0B5hUzsxe4cdmU3ZTRXNxa2JIaDQ... Video: https://youtu.be/frnLiXclAMo Conference: http://open-zfs.org/wiki/OpenZFS_Developer_Summit

tnorgaard · on March 12, 2017

We run with ZFS over LUKS encrypted volumes in production on AWS ephemeral disks and have done so in over two years on Ubuntu 14.04 and 16.04. The major issue for us has been getting the startup order right, as timing issues does occur once you have many instances. To solve this, we use upstart (14.04) and systemd (16.04) together with Puppet to control the ordering.

Performance wise it does fairly well, our benchmarks shows ~10-15% decrease on random 8kb IO (14.04).

We are definitely looking forward to ZFS native encryption!

newman314 · on March 13, 2017

What is the right order?

weitzj · on March 13, 2017

Since ZFS will run on blocklevel devices and you want to get the ZFS benefits of Snapshots/compression/(deduplication), in my opinion it makes sense to do the encryption at the blocklevel, i.e. LUKS has to provide decrypted block level devices before ZFS searches for its zpools. When ZFS native encryption is available on Linux this will be different, since you much finer control on what to encrypt and you can keep all ZFS features.

So:

First decrypt LUKS (we are doing this in GRUB) Then mount zpool(s)

drvdevd · on March 12, 2017

I use it all over with LUKS on Ubuntu. It works fine, but there's one little hitch:

when calling anything that ultimately calls grub-probe (e.g. apt-get upgrade), you have to symlink the decrypted device mapper volume up a layer into /dev because grub-probe can't seem to find the ZFS vdev(s) otherwise. ie: "ln -s /dev/mapper/encrypted-zfs-vdev /dev".

This is in fact the case on every Linux distribution I've run ZFS over dm-crypt on.

EDIT: IOW its a grub bug not an Ubuntu bug.

XorNot · on March 13, 2017

The other problem I've found is that grub's update-grub scripts do not handle mirrored ZFS volumes well at all - they wind up just spraying doubled up invalid commands everywhere even up to yakkety so far.

I've had it on my backlog to at some point go in and sort out my initramfs's insanity when it comes to handling crypt'd disks in general - it should be a lot less brittle then it is.

weitzj · on March 13, 2017

We are rolling our own update grub as a bash script, which does some sanity checks:

- is the crypto module still present in GRUB config?

- is grub running in text only mode? Otherwise we cannot see the LUKS prompt to decrypt the devices.

- does initramfs know about all mirror devices (instead of an early return after the first)

nisa · on March 12, 2017

You can always use zvols and just use ext4+luks on that. There is also work on ZFS native encryption that looks pretty promising, not sure if it's ready yet.

benley · on March 12, 2017

I haven't tried to do it with Ubuntu specifically, but I do know that ZFS-on-linux works fine atop LUKS full-disk encryption. My laptop is running NixOS with such a setup, and I'm pretty sure I followed the Ubuntu ZFS documentation while figuring out how to do it.

(summary: I don't know how to make it super easy, but what you want should totally work)

_vdrz · on March 12, 2017

bcachefs is working on it but needs support:

- https://bcache.evilpiepirate.org/Bcachefs/

- https://www.patreon.com/bcachefs/

(I'm not Kent but I love the work)

patrickg_zill · on March 12, 2017

Having used ZFS on Solaris/x86 for years and years, and now on Linux, I would say that your best best is to use ZFS.

While ZFS might not be supported oput of the box on "your-choice-distro" I would point out that Ansible or other automation tools should make it pretty easy to end up with a repeatable installed machine or VM image with ZFS.

cyphar · on March 13, 2017

> (...sure, you could call zfs or btrfs "mainstream", I suppose, but when I say "mainstream" I mean something along the lines of "officially supported by RedHat". zfs isn't, and RH considers btrfs to still be "experimental".)

SUSE (the other main enterprise Linux distribution vendor) has been supporting Btrfs in the enterprise (as the default filesystem) for several years.

[Disclosure: I work for SUSE.]

throwaway7767 · on March 15, 2017

No disrespect intended, but my experience is that SuSE has always played fast and loose with the filesystem defaults.

I recall they switched to reiserfs as default at one point. reiserfs was never a good choice for data consistency - the fact that storing a reiserfs image in a file on the host reiserfs filesystem and then doing an fsck on the host FS would corrupt the FS should be a clear signal that there are fundamental problems remaining to be solved.

That said, I'm playing with btrfs on some of my machines, and it seems quite nice. But no way would I risk using it on a production server at this time.

cyphar · on March 17, 2017

None taken. To be clear, we do provide enterprise support for ext4 and XFS as well (which a lot of people use for the reasons you mentioned). In my experience, btrfs still has some growing pains (especially when it comes to quotas, which will cause your machine to lag quite a bit when doing a balance) but is definitely serviceable as a daily driver (though for long-term storage I use XFS).

throwaway7767 · on March 17, 2017

Ah, that's good to hear. It's been a while since I've used SuSE.

Of course, someone has to go first and filesystems never truly get battle-hardened until distros start pushing them. I appreciate that SuSE does this from that perspective. It means when I switch over there will be less bugs. :)

I'm using btrfs as a daily driver on my workstations so I get some experience with the tooling, and also because features like consistent snapshots are really nice to have. Still haven't taken the plunge on the server side, I expect I'll give it a few years until it's considered "boring".

dom0 · on March 12, 2017

btrfs only uses CRC32c which is weakish. ZFS is great but not exactly portable. I started to use Borg now for archiving purposes as well, not just backup. For me (low access concurrency, i.e. single or at most "a few" users) that works very well. Portable + strong checksumming + strong crypto + mountable + reasonable speed (with prospect of more) is a good package. It doesn't solve error correction, though.

rleigh · on March 12, 2017

Not portable? In comparison to which filesytems? I can easily export a ZFS pool on Linux, physically transport the discs to a FreeBSD or IllumOS server and import the pool. Or to MacOS X, which is also supported (though I haven't tried it unlike the others where it worked perfectly).

That's already far ahead of ext, ufs, xfs, jfs, btrfs etc. The only ones offhand which are possibly more portable are fat, hfs, udf and ntfs, and you're not exactly going to want to use them for any serious purpose on a Unix system. ZFS is the most portable and featureful Unix filesystem around at present IMHO.

ak217 · on March 12, 2017

crc32c is not weakish, and was chosen for a reason: crc32c has widespread hardware acceleration support that remains faster than any hash, and crc32c can be computed in parallel (unlike a hash, it has no hidden state, so you can sum independently computed block checksums to get the overall blob checksum). Bitrot detection doesn't need a cryptographic hash. You may want a hash for other purposes (like if you somehow trust your metadata volume more than your data volume), but that's a separate and slower use case.

bascule · on March 13, 2017

Bitrot detection doesn't need a cryptographic hash.

Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job.

Cryptographic hashes operate under a different set of constraints than error detecting code. With an error detecting code, it's desirable to guarantee a different checksum in the event of a bitflip.

With a cryptographic random oracle, this is not the case: we want all outcomes to have equal probability, even potentially producing the same digest in the event of a bitflip. As an example of a system which failed in this way: Enigma was specifically designed so the ciphertext of a given letter was always different from its plaintext. This left a statistical signature on the ciphertext which in turn was exploited by cryptanalysts to recover plaintexts. (Note: a comparison to block ciphers is apt as most hash functions are, at their core, similar to block ciphers)

Though the digests of cryptographic hash functions are so large it's statistically improbable for a single bitflip to result in the same digest as the original, it is not a guarantee the same way it is with the CRC family.

Cryptographic hash functions are not designed to be error detecting codes. They are designed to be random oracles. Outside a security context, using a CRC family function will not only be faster, but will actually provide guarantees cryptographic hash functions can't.

rwiggins · on March 13, 2017

In practice, the chance of a digest collision between two messages that differ in a single bit is exceedingly small for any secure cryptographic hash function. It's so small that it's practically not worth considering. Cryptographers are incredibly careful in building and ensuring proper diffusion in cryptographic hash functions.

mrb · on March 13, 2017

"Not only is a cryptographic hash unnecessary, under certain circumstances it will actually do a worse job."

A cryptographic hash is unnecessary, but as you point out in the next to last paragraph, it is statistically improbable that it will do a worse job. Because collisions are statistically improbable.

jepler · on March 13, 2017

Specifically, I wrote a program to search for single-bit-flip collisions in sha1 truncated to 16 bits. The program didn't need to search for long before finding two messages with the same 16-truncated sha1 with a single bit flip at bit 1 of byte 171 of a 256-byte message. 376 1 171 be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e650d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd be44b935e7ecfc81d1fe2cddcd7c1d7e04338fd83fa994cd6a877732ca5d8db83346bd9ccbfc4c8770682bd307c782421a512a80a106be87825d5c13f3156e23ffaacdfc1651f88f775507d1175542def2ccf084271ebd4ead175c8a448be0d50b26f59d970301ebc5a7f672d3ea870d9a1e02f8f5fd01c38297b8aa264a3f07fec32f9a91aa359784d2d9ce0e4649465c705f50feed23dcbefc0a726cfadb5e47ee577ed45203f90d6e2e640d42ddb10cba49d06bd4cdad4e6eaf5cfcb062de2539fc847ce0c104f2e667369080eaaab5934ae5f7f1ba733c3d1bfbda87bfa72ef12475b9ff0edc4deb99e6a5cf387c7f6b9c71ea62b4db4bb67c92d36460dd

https://gist.github.com/jepler/96d1e779dc95b8941b208887e10a8...

On the other hand, any CRC will detect all such errors; a well-chosen one such as CRC32C will detect all messages with up to 5 bits flipped at this message size.

This is quite appropriate for the error model in data transmission, of uncorrelated bit errors. https://users.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koop... is a pretty good paper, though there are probably better ones for readers without an existing background in how CRC works.

mrb · on March 13, 2017

You are not testing a crypto hash. "Crypto hash" means it is cryptographically strong, not truncated to 16 bits. For example ZFS with checksum=sha256 will use the full 256-bit hash for detecting data corruption.

jepler · on March 14, 2017

Yup, you're right. if you use a full size cryptographic hash then the number of undetected errors can be treated as 0 regardless of hamming distance. On the other hand, it has 8x the storage overhead of a 32-bit CRC.

jepler · on March 13, 2017

Less so if you assume the cryptographic hash will be truncated to 32bits so that its size matches crc32c. Furthermore, some of those collisions will be on message pairs with small hamming distance, probably including messages with a single bit flipped which CRC will always detect.

dom0 · on March 12, 2017

CRC32c -> I saw this many times fail to detect corruption on message lengths anywhere between a couple kB and a few MB. btrfs blocks are 16 kB iirc, so in range. The longer hashes of ZFS, Borg and so on mean that if it's corrupted I _definitely_ know. Not so confident with CRC32 from experience.

ak217 · on March 13, 2017

I'm curious about the setting in which you saw these failures, could you elaborate?

Unlike a plain checksum, CRC-32C is hardened against bias, which means its distribution is not far from that of an ideal checksum. This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure. Modern hard drives corrupt data at a rate of once every few terabytes. TCP without SSL (because of its highly imperfect checksum) corrupts data at a rate of once every few gigabytes. Assuming an extremely bad scenario of a corrupt packet once every 1 GB, in theory you'd need to read more than a zettabyte of data to get a random CRC-32C failure. I'm not doubting that real world performance could be much worse, but I'd like to understand how.

vardump · on March 13, 2017

> This means if your bitrot is random and you're using 16KB blocks, you will need to see on the order of ((2 * * 32) * 16KB)=64TB of corrupted data to get a random failure.

No, that's what you need to generate one guaranteed failure, when enumerating different random corruption possibilities. Simply because 32-bit number can at most represent 2^32 different states.

In practice, you'd have 50% probability to have a collision for every 32 TB... assuming perfect distribution.

By the way, 32 TB takes just 4-15 hours to read from a modern SSD. Terabyte is just not so much data nowadays.

leni536 · on March 13, 2017

Just a nit: you don't get a guaranteed failure at 64TB, you get a failure with approx 1-1/e ~= 63% probability. At 32TB you get a failure with approx 1-1/sqrt(e) ~= 39% probability.

I do agree that tens of TBs are not too much data, but mind that this probability means that you need to feed your checksum 64TB worth of 16KB blocks, every one of them being corrupt, to let at least one of them go trough unnoticed with 63% probability. So you don't only need to calculate with the throughput of your SSD, but the throughput multiplied with the corruption rate.

marcosdumay · on March 13, 2017

For disk storage CRC-32C is still non-broken. You can't say the same about on-board communication protocols or even some LANs.

When people started using CRC-32 it was because with the technology of the time it was virtually impossible to see collisions. Nowadays we are discussing if it's reasonable to expect a data volume that gives you 40% or 60% of collision chance.

CRC32 end is way overdue. We should standardize on a CRC64 algorithm soon, or will have our hands forced and probably stick with a bad choice.

speleo_engr · on March 13, 2017

Posts like this are what keep me coming back to HN.

cmurf · on March 13, 2017

Not exactly, blocks are 4KiB so the vast majority of the CRC32C's apply to 4KiB block size; the metadata uses 16KiB nodes and those have their own checksum also. From what I see in the on disk format docs, it's a 20 byte checksum.

johnramsden · on March 13, 2017

ZFS is probably one of your better options if you want portability. It works on most Unix like operating systems, and due to its zfs send/recv capability you can send your data all over the place - and can trust that it will actually end up the same on the other end. If that's not portability I don't know what is.

If what you're saying is it's not portable to Windows then sure, but compared to most filesystems it's extremely portable.

X86BSD · on March 12, 2017

ZFS not portable? Have you ever used it?

"ZFS export" on the host system, remove drives, insert drives in new server, "ZFS import" on the new server, It really is that simple.

dpedu · on March 12, 2017

I think he means portability of the software. What I want to 'ZFS import' on a Windows or macOS machine?

X86BSD · on March 12, 2017

Oh, well it's ported from Solaris to illumos and FreeBSD and to a lesser extent OS X and Linux. So I'm still confused about the portability claim.

tux1968 · on March 12, 2017

The license incompatibility has kept it out of the mainline kernel for Linux, so it's not really a viable option there in many situations. Linux is definitely lacking in this department.

rleigh · on March 12, 2017

That's a minor concern and unrelated to the portability claim. That comes down to choice, and is not a technical consideration. I'm using it with Ubuntu 16.04 LTS and 16.10 where it works out of the box. It's most certainly portable to and from Linux and other systems; I've done it personally, and it works a treat.

tux1968 · on March 12, 2017

Minor for some, major for others. And it is technical too, because there are maintenance ramifications for it not being in the mainline kernel, for example how quickly a security patch can be applied.

And for others a constriction on which distribution they can move to etc. It just reduces the number of situations where it can be used, even if you find yourself in one where it can.

justincormack · on March 12, 2017

ZFS on OSX has been revived I believe.

drvdevd · on March 12, 2017

I wonder if the recent Linux syscall emulation on Windows would somehow make it possible or easier to port ZFS on Linux to Windows.

I know you have the SPL anyway, so maybe with the addition of the Linux POSIX-ish layer in there this could be the case...

chungy · on March 12, 2017

Short answer: No it wouldn't make it easier to port.

Longer answer: The Linux subsystem in Windows 10 only deals with userspace. It doesn't support kernel modules nor changes anything about making Windows drivers. Porting ZFS to Windows is certainly possible, but it will take quite a lot of effort, and the Linux subsystem is irrelevant in that situation.

drvdevd · on March 12, 2017

Yeah, I figured as much, but was hoping there might be something about the Linux subsystem that would be helpful in porting drivers around, beyond userspace.

Obviously, I've yet to actually use it myself.

gjjrfcbugxbhf · on March 12, 2017

Would a FUSE implementation of ZFS not be possible? (Just wondering I've no idea what's possible here)

Dylan16807 · on March 12, 2017

A FUSE implementation of ZFS exists and works well, and adding FUSE support to the Windows 10 Linux subsystem appears to be reasonably high up on the priority list.

That doesn't get you access from Windows programs, but there are some other ways to do FUSE or FUSE-like things on Windows..

problems · on March 13, 2017

It may be fairly straightforward to port the ZFS FUSE to Windows if you use things like Dokan or WinFsp which have the FUSE interface supported fairly well - these would give full access via standard Windows tools.

rincebrain · on March 12, 2017

Last I knew, the zfs-fuse codebase hadn't been updated since before feature flags were added to any of the OpenZFS targets, so it's not a particularly well-supported solution...

e12e · on March 12, 2017

Fwiw I've successfully shared my luks-encrypted usb3 zfs-formatted disk to Windows pro on my Surface 4 via a hyper-v vm running Ubuntu and samba. It won't work for all external drives - you need to be able to set the drive as "offline" in device manager under Windows in order to pass it through to the hyper-v vm (and sadly this doesn't appear possible with the sdcard - I had hoped to install Ubuntu on the sdcard and have the option to boot from the sdcard and also boot into the same filsystem under hyper-v).

s-macke · on March 12, 2017

A real benefit would be file system support for Reed-Solomon error correction. Some archival tools support such a feature. You would spent 1%-10% of the disk space for error correction.

fnord123 · on March 12, 2017

Sector failures on a single drive are not independent random events. So doing this on a single drive is not perceived to be a good idea.

The only choices, imo, are multiple disks, or automatic backup to offsite (dropbox, box, one drive, etc).

fnord123 · on March 13, 2017

Sorry, a better answer is that it's already done in the hardware/firmware layer. e.g. from Wikipedia[1]:

""" Modern HDDs present a consistent interface to the rest of the computer, no matter what data encoding scheme is used internally. Typically a DSP in the electronics inside the HDD takes the raw analog voltages from the read head and uses PRML and Reed–Solomon error correction[144] to decode the sector boundaries and sector data, then sends that data out the standard interface. That DSP also watches the error rate detected by error detection and correction, and performs bad sector remapping, data collection for Self-Monitoring, Analysis, and Reporting Technology, and other internal tasks. """

With the answer I gave above, and the fact that it's already being done in hardware, I don't think adding another layer of EC will be fruitful.

[1] https://en.wikipedia.org/wiki/Hard_disk_drive#Access_and_int...

pmarreck · on March 13, 2017

If this is the case, then why does bit-rot still occur?

fnord123 · on March 13, 2017

It's not magic. If you have enough bit flips for the same data then it's not recoverable. Sometimes it could even flip and resemble correct data. It might not happen to you, but if 500 people reading this thread each have access to 50T of data, then sure bit rot can happen in some of that 25 petabytes.

pmarreck · on March 13, 2017

Do you think that there is a need for some kind of ECC at the OS level? There are surely some applications where even 1 bit flip in 25 petabytes is bad. Amazon's 2008 site outage comes to mind: http://status.aws.amazon.com/s3-20080720.html

fnord123 · on March 13, 2017

Yes, that's why you have erasure coding across drives either using RAID groups or using an object store. Just not at the single drive file system level where blocks going bad within a disk tend to be correlated (i.e. the drive is getting old and will die).

Also, EC at the block level would probably spread the blocks around the drive. This means any read would need to seek all over the dang place trying to reassemble files and that's a bad access pattern for spinning disks. Real bad. Like, the worst. It might even reduce the effective lifetime of the drive. So it would be not only correlated with device failure, it could precipitate it.

Maybe it would be ok on an SSD.

zzzcpan · on March 12, 2017

What would you do with a failed checksum on a filesystem level? These errors could be and likely to be transient too.

elfchief · on March 12, 2017

If nothing else, you can log the error. If you have RAID1 you can recopy the block from a good copy. It is, honestly, probably situation-specific, but step #1 is always going to be "identify that you have a problem"

zzzcpan · on March 12, 2017

That's the thing, identifying the problem on the filesystem level is useless if you can only correct it on another level. Unless your filesystem is distributed and self-healing, it's not a place for checksums, it must remain a thin predictable layer on top of a disk.

ProblemFactory · on March 12, 2017

There absolutely is a practical use for checksums and error detection, even if you cannot do error correction.

Error detection allows you to:

* Discover and replace bad hardware (like the author of the article),

* Avoid unknowingly using invalid data for calculations or program code,

* Stop using the system and corrupting more data over time,

* Restore files from backup while still in the retention period of your backup service.

I once had a computer with a bad disk that somehow corrupted 1 bit out of about 150MB written, and probably only in some regions. I only found out after the GZip checksum of a very large downloaded file mismatched, and it took a while to figure out the real cause. By that time I had been using it for months, so it's unclear to this day what other files might have been corrupted.

floatboth · on March 12, 2017

ZFS is self-healing. It can automatically restore from mirror/RAID drives. And if you only have one drive, you can set a parity option so that information will be written several times, like a mirror but on one disk. (Obviously doesn't protect against a whole drive failure :D)

g0xA52A2A · on March 12, 2017

> That's the thing, identifying the problem on the filesystem level is useless if you can only correct it on another level.

Nope I'd take a system that can tell me "Hey sorry I don't have the right data" over "Here's some data, looks good to me" any day of the week.

Also as ProblemFactory points out, ZFS will self-heal whenever possible.

zzzcpan · on March 13, 2017

"Nope I'd take a system that can tell me "Hey sorry I don't have the right data" over "Here's some data, looks good to me" any day of the week."

Nope, it doesn't help with anything. I think you are making false assumptions about the safety of your data. Because to keep consistency such system must automatically shutdown on data corruption and only wake up once the problem is fixed. To keep availability it must automatically fix the problem. There is no third option, because filesystems are not aware of how applications use them. Returning EIO would also introduce inconsistency.

But if you actually care about your data, you shouldn't trust a single box, ever, no matter how fancy the filesystem is. There are just so many things that could go wrong, it's hard to believe someone would even take that risk.

Just a few weeks ago I saw data loss on one of the object storage nodes, because of a typical sata cable issue. It caused multiple kernel panics, affecting data on multiple disks, before I was able to figure it out. Not a problem for a distributed storage, but you wouldn't survive that with a local storage. This is also one of the reasons I consider zfs an over engineered toy, that doesn't target real world problems.

heinrich5991 · on March 12, 2017

What is wrong with returning an IO error (e.g. EIO) if the data read from the disk is corrupt?

adrianN · on March 12, 2017

Use an error correcting code, fix errors you can fix, report unrecoverable errors to the user so that they can restore from backup.

Mister_Snuggles · on March 12, 2017

OpenSuSE (and SLES) has considered btrfs stable for a while.

morecoffee · on March 12, 2017

Making data on individual drives is probably not a long term goal for integrity. It is more likely that multi homing data in geographically disjoint locations with some sort of syncing is a better long term goal. Fixing silent data corruption on a single drive doesn't solve any of the much more likely disasters, like fire, flooding, weather, etc. Not even datacenters can withstand lightning.

elfchief · on March 12, 2017

The problem with multiple geographic locations, in this context, is that you'd have to read from all of them, and compare the results, to know that you have file corruption. Which is, needless to say, not something that it makes sense to do.

The purpose of data block checksumming isn't to make your data more resilient (at least directly), it's to make sure you know you have a problem. Once you know you have a problem, then you can go read from your alternate datacenter or whatever.

db48x · on March 12, 2017

Agreed, and that's how zfs does it behind the scenes. When it detects a read error, it uses the pool redundancy (whether mirrors or RAIDZn, whichever you're using) to transparently recover from the error. Even if you set up a pool with only one disk you can still set it to keep multiple copies of all the data.

A super-ZFS that automatically did that using remote mirrors would be interesting, but would also stretch the definition of "transparent" a bit.

TheSpiceIsLife · on March 12, 2017

Not even datacenters can withstand lightning.

Data centers do have lighting protection. Lighting protection is very low-tech.

https://en.m.wikipedia.org/wiki/Lightning_rod

stubish · on March 13, 2017

zfs is supported in Ubuntu by Canonical, both in the 'out of the box and recommended' sense and the pay-for-support sense via Ubuntu Advantage.

Unfortunately, you still can't get a root zfs partition without hassles.

X86BSD · on March 12, 2017

If you actually care about your data you simply will have to switch platforms. Illumos or FreeBSD. Linux has no answer to the horrors of garbage Filesystems and garbage lying hardware, firmware etc that has plagued UNIX admins for ever. Btrfs is apparently the best they can do and its glaringly deficient architecturally.

Filesystems are HARD. And as bcantrill notes, if you can't get simple things like epoll right your dead in the water on harder things like kqueue and zfs, Dtrace, jails, etc.

This is proving true looking at the fact Linux land can't get a filesystem right and they've been trying for decades.

Many will argue and try and justify that simple fact but as the cold harsh hammer of reality slams into their skull eventually those folks will be forced to face the reality.

Linux honestly just needs to adopt zfs. Period. But because of the Linux license that may be all but impossible now. Which leaves Linux in this untenable position of being stuck in a tar pit. Unable to adopt the clear choice to move forward and unable to have the technical ability to implement their own solution equal to zfs. So wither now Linux?

stock_toaster · on March 12, 2017

  > Linux honestly just needs to adopt zfs. Period.
  > But because of the Linux license that may be all but impossible now.

Hammer2 might be an option, once it gets finished.

tytso · on March 12, 2017

The question is whether or not "the market" cares enough about "its data" enough that they are willing to pay for what would cost. And data checksumming and RAID 1 (so you can recover from the data corruption) does cost something --- even if it is a 100% expansion in the bytes needed to store the data. And COW file systems do cost something in terms of HDD overhead. Maybe you have enough slack in terms of how you are using your HDD that you don't notice, but if you are using all of your HDD's I/O capacity with an update-in-place file system, when you switch to a COW file system, you will notice the performance hit.

If your only copy of your children's baby pictures are on a solo 4TB drive in your house --- then you may be more than willing to pay that cost. But what if your house burns down? It may be that the right answer is do data backups in the cloud, and there you will want erasure coding, and perhaps more than 100% overhead --- you might erasure coding scheme that has a 150% to 300% blowup in space so you have better data availability, not just data loss avoidance.

I do agree that file systems are hard, but at the same time you need to have a business case for doing that level of investment. This is true whether it is for a proprietary or open source code base. Many years ago, before ZFS was announced, I participated in a company wide study at my employer at the time about whether it was worth it to invest in file systems. Many distinguished engineers and fellows, as well as business people and product managers participated. The question was studied not just in terms of the technical questions, but also from the business perspective --- "was the ROI on this positive; would customers actually pay more such that the company would actually gain market share, or otherwise profit from making this investment". And the answer was "No". This made me sad, because it meant that my company wasn't going to invest in further file system technologies. But from a stock holder's perspective, the company's capital was better spent investing in middleware and database products, because there the ROI was actually positive.

From everything that I've read and from listening to Bryan's presentations, I understanding is that at Sun they did _not_ do a careful business investigation before deciding to invest in ZFS. And so, "Good for Open Solaris!" Maybe it kinda sucked if you were a Sun shareholder, but maybe Sun was going to go belly-up anyway, so might as well get some cool technology developed before the ship went under. :-)

As far as Linux is concerned, at some level, the amount of volunteer manpower and corporate investment in btrfs speaks to, I suspect, a similar business decision being made across the Linux ecosystem about whether or not the ROI of investing in btrfs makes sense. The companies that have invested in ext4 because in no-journal mode, have done so because if what you want is a back-end, local disk file system, on top of which you put a cluster file system like HDFS or GFS or Colossus, and where the data integrity is done end-to-end at the cluster file system level, and not at the local disk file system, you'll want the lowest overhead file system layer you can get. That doesn't mean that you don't care about data integrity; you do! But you've made an architectural decision about where to place that functionality, and a business decision about where to invest their proprietary and open source development work.

Each company and each user should make their own decisions. If you don't need the software compatibility and other advantages of Linux, and if data integrity is hugely important, and you don't want to go down the path of using some userspace solution like Camilstore, or some cloud service like AWS S3 or Google Compute Storage, then perhaps switching to FreeBSD is the right solution for you. Or you can choose to contribute to make btrfs more stable. And in some cases that may mean being willing to accept a certain risk for data loss at one level, because you handle your integrity requirements at a different level. (e.g., why worry about your source tree on your laptop when it's backed up regularly via "git push" to servers around the internet?)

kabdib · on March 12, 2017

Oh, yes. Silent bit errors are tons of fun to track down.

I spent a day chasing what turned out to be a bad bit in the cache of a disk drive; bits would get set to zero in random sectors, but always at a specific sector offset. The drive firmware didn't bother doing any kind of memory test; even a simple stuck-at test would have found this and preserved the customer's data.

In another case, we had Merkle-tree integrity checking in a file system, to prevent attackers from tampering with data. The unasked-for feature was that it was a memory test, too, and we found a bunch of systems with bad RAM. ECC would have made this a non-issue, but this was consumer-level hardware with very small cost margins.

It's fun (well maybe "fun" isn't the right word) to watch the different ways that large populations of systems fail. Crash reports from 50M machines will shake your trust in anything more powerful than a pocket calculator.

blablabloe · on March 13, 2017

Enterprise disks do have ECC cache as opposed to consumer drives. Was it a consumer drive?

nisa · on March 12, 2017

ZFS is also crazy good on surviving disks with bad sectors (as long as they still respond fast). Check out this paper: https://research.cs.wisc.edu/wind/Publications/zfs-corruptio...

They even spread the metadata across the disk by default. I'm running on some old WD-Greens with 1500+ of bad sectors and it's cruising along with RAIDZ just fine.

There is also failmode=continue where ZFS doesn't hang when it can't read something. If you have a distributed layer above ZFS that also checksums (like HDFS) you can go pretty far even without RAID and quite broken disks. There is also copies=n. When ZFS broke, the disk usally stopped talking or died a few days later. btrs, ext4 just choke and remount ro quite fast (probably the best and correct course of action) but you can tell ZFS to just carry on! Great piece of engineering!

comboy · on March 13, 2017

Pretty fascinating. But just based on this comment, I reckon that these 1500+ bad sectors drives aren't worth your time. So, why? Is it just that you wanted to play with all these options and don't really care about the data on these drives, or do you actually believe it's reasonable bang for the buck?

nisa · on March 13, 2017

I forget the disclaimer that you should not do this, ever :)

We had a cluster for Hadoop experiments at uni and no ressources to replace all the faulty disks at that time (20-30% were faulty to some degree from the SMART values - more than 150 disks). So this was kind of an experiment. All used data was available and backup up ouside of that cluster. The problem was that with ext4 after running a job certain disks always switched to readonly and this was a major hassle as this node had to be touched by hand. HDFS ist 3x replicated and checksummed and the disks usally worked fine for quite a time after the first bad sector. So we switched to ZFS, ran weekly scrubs - only replaced disks that didn't survived the srub in reasonable time or with reasonable failure rates and bumped up the HDFS checksum reads that everything is control read once a week. The working directory for the layer above (MapReduce and stuff like that) got a dataset with copies=2 so that intermediate data is still fine within reasonable amounts. This was for learning or research purposes where top speed or 100% integrity didn't matter and uptime and usability was more important. Basically the metadata on disk had to be sound and the data on a single disk didn't matter that much. This was quite a ride and it's long been replaced since then.

Just thought it's interesting how far you can push that. In the end it worked but turned out there is no magic, disks die sooner or later and sometimes take the whole node with them.

Don't go to ebay and buy broken disks out of believing with ZFS these will work. Some survive a while, most die fast, some exhibit strange behavoir.

That RAIDZ is more or less for "let's see where this goes" purposes, backups are in place it's not a production system.

comboy · on March 13, 2017

Hah, thanks for the story.

It seems that limited resources often lead to some interesting solutions (and learning new things). A factor that is not very common in VC backed companies.

Malic · on March 12, 2017

It's articles like this that re-enforce my disappointment that Apple is choosing to NOT implement checksums in their new file system, APFS.

https://news.ycombinator.com/item?id=11934457

JumpCrisscross · on March 13, 2017

Can someone explain why one would checksum metadata but not user data? Is the assumption everything's backed up on iCloud? If so, are system files checksummed?

cmurf · on March 13, 2017

Metadata writes can be considered atomic, so an integrated checksum is written at the time the metadata block is written, or overwritten. Whereas with data, you can't do overwrites of either data or checksum, it's not atomic and any kind of crash or powerfailure will result in mismatching data to checksum. So unless you have something really clever to work around this, you need a copy on write file system to do data checksums.

rsync · on March 13, 2017

iDevices don't have multi-TB disks, nor do they have arrays of disks.

Therefore this is of limited relevance to Apple.

dredmorbius · on March 12, 2017

"Data tends to corrupt. Absolute data tends to corrupt absolutely."

In both sense of the word.

Many moons ago, in one of my first professional assignments, I was tasked with what was, for the organisation, myself, and the provisioned equipment, a stupidly large data processing task. One of the problems encountered was a failure of a critical hard drive -- this on a system with no concept of a filesystem integrity check (think a particularly culpable damned operating system, and yes, I said that everything about this was stupid). The process of both tracking down, and then demonstrating convincingly to management (I said ...) the nature of the problem was infuriating.

And that was with hardware which was reliably and replicably bad. Transient data corruption ... because cosmic rays ... gets to be one of those particularly annoying failure modes.

Yes, checksums and redundancy, please.

swinglock · on March 12, 2017

If I were to run ZFS on my laptop with a single disk and copies=1, and a file becomes corrupted, can I recover it (partially)?

My assumption is the read will fail and the error logged but there is no redundancy so it will stay unreadable.

Will ZFS attempt to read the file again, in case the error is transient? If not, can I make ZFS retry reading? Can I "unlock" the file and read it even though it is corrupted, or get a copy of the file? If I restore the file from backup, can ZFS make sure the backup is good using the checksum it expects the file to have?

Single disk users seem to be unusual so it's not obvious how to do this, all documentation assumes a highly available installation rather than laptop, but I think there's value in ZFS even with a single disk - if only I understood exactly how it fails and how to scavenge for pieces when it does.

mrb · on March 13, 2017

It depends. Metadata is always redundant with 2 copies (even when using copies=1). So if the file's metadata is corrupted, yes ZFS will fully recover and rewrite a 2nd good copy of the metadata. But if the data is corrupted, then ZFS can do nothing to recover (you may be able to partially read the file, but the rest of it will return I/O errors.)

h2hn · on March 13, 2017

Ok, so zfs for single drive users doesn't fix single data corruption.

Definitely I'm going to use my solution so. All the next generation FS stuff is cool (btrfs also indeed) but for the simplest use case people just need safe data and fix it it the disk goes bad.

mrb · on March 13, 2017

Why don't you use copies=2 with a single disk?

mrb · on March 12, 2017

The exact same silent data corruption issues just happened to my 6 x 5TB ZFS FreeBSD fileserver. But unlike what the poster concluded, mine were caused by bad (ECC!) RAM. I kept meticulous notes, so here is my story...

I scrub on a weekly basis. One day ZFS started reporting silent errors on disk ada3, just 4kB:

    pool: tank
   state: ONLINE
  status: One or more devices has experienced an unrecoverable error.  An
          attempt was made to correct the error.  Applications are unaffected.
  action: Determine if the device needs to be replaced, and clear the errors
          using 'zpool clear' or replace the device with 'zpool replace'.
     see: http://illumos.org/msg/ZFS-8000-9P
    scan: scrub repaired 4K in 21h05m with 0 errors on Mon Aug 29 20:52:45 2016
  config:
          NAME        STATE     READ WRITE CKSUM
          tank        ONLINE       0     0     0
            raidz2-0  ONLINE       0     0     0
              ada3    ONLINE       0     0     2  <---
              ada4    ONLINE       0     0     0
              ada6    ONLINE       0     0     0
              ada1    ONLINE       0     0     0
              ada2    ONLINE       0     0     0
              ada5    ONLINE       0     0     0

I monitored the situation. But every week, subsequent scrubs would continue to find errors on ada3, and on more data (100-5000kB):

  2016-09-05: 1.7MB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-09-12: 5.2MB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-09-19: 300kB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-09-26: 1.8MB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-10-03: 3.1MB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-10-10: 84kB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-10-17: 204kB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-10-24: 388kB silently corrupted on ada3 (ST5000DM000-1FK178)
  2016-11-07: 3.9MB silently corrupted on ada3 (ST5000DM000-1FK178)

The next week. The server became unreachable during a scrub. I attempted to access the console over IPMI but it just showed a blank screen and was unresponsive. I rebooted it.

The next week the server again became unreachable during a scrub. I could access the console over IPMI but the network seemed non-working even though the link was up. I checked the IPMI event logs and saw multiple correctable memory ECC errors:

  Correctable Memory ECC @ DIMM1A(CPU1) - Asserted

The kernel logs reported muliple Machine Check Architecture errors:

  MCA: Bank 4, Status 0xdc00400080080813
  MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
  MCA: Vendor "AuthenticAMD", ID 0x100f80, APIC ID 0
  MCA: CPU 0 COR OVER BUSLG Source RD Memory
  MCA: Address 0x5462930
  MCA: Misc 0xe00c0f2b01000000

At this point I could not even reboot remotely the server via IPMI. Also, I theorized that in addition to correctable memory ECC errors, maybe the DIMM experienced uncorrectable/undetected ones that were really messing up the OS but also IPMI. So I physically removed the module in "DIMM1A", and the server has been working perfectly well since then.

The reason these memory errors always happened on ada3 is not because of a bad drive or bad cables, but likely due to the way FreeBSD allocates buffer memory to cache drive data: the data for ada3 was probably located right on defective physical memory page(s), and the kernel never moves that data around. So it's always ada3 data that seems corrupted.

PS: the really nice combinatorial property of raidz2 with 6 drives is that when silent corruption occurs, the kernel has 15 different ways to attempt to rebuild the data ("6 choose 4 = 15").

blablabloe · on March 13, 2017

When a double-parity error is detected, the operating system should halt. Maybe that didn't happen properly. Tripple parity errors may go undetected, but how likely is that.

I wonder what 'really' happened.

mrb · on March 14, 2017

I saw dozens of ECC errors in the IPMI log, and dozens of MCAs in dmesg. So the memory was failing so badly that it is probably there were 3+ bit errors.

bestham · on March 14, 2017

Is it not "5 choose 4" since one of the 6 is a "known bad" in terms of ZFS knowledge?

mrb · on March 14, 2017

ZFS does not know which disk(s) caused the silent data corruption. It first assumes only 1 disk is bad and makes 4 attempts to rebuild it. It assumes successively each one of the 4 data stripes is bad, and rebuilds it from the parity stripes. But if it still cannot rebuild the correct data it means 2 disks are bad, so it tries 15 combinations of data/parity stripes to rebuild the 4 data stripes (well... minus the combination of "4 data stripes" which is already known bad, so the 14 other combinations are tested).

RX14 · on March 12, 2017

I know for sure that btrfs scrub found 8 correctable errors on my home server filesystem last July. This is obviously great news for me. Contrary to a lot of people here I've personally found btrfs to be really stable (as long as you don't use raid5/6 though).

kev009 · on March 13, 2017

People grossly under-intuit the channel error rate of the SATA. At datacenter scale it's alarmingly high http://www.enterprisestorageforum.com/imagesvr_ce/8069/sas-s...

platosrepublic · on March 12, 2017

I'm not a database expert, but this seems like something I should worry about, at least a bit. Is this a problem if you store all your persistent data in a database like MySQL?

Mister_Snuggles · on March 12, 2017

If your database does ZFS-like checksumming on all of its data, including the structures that it uses to find the data that you put in, and has the ability to correct errors, then no.

Realistically though, I don't know if MySQL has this. You'd probably be better off using a filesystem that gives these kind of guarantees and running your database on that.

zzzcpan · on March 12, 2017

Every ACID compliant database is supposed to calculate checksums.

copperx · on March 12, 2017

Good to know.

blablabloe · on March 13, 2017

If you run on proper enterprise gear, the likelyhood of encountering an issue like this is very, very rare.

HP/Dell/SM servers are all ECC memory top to bottom, are properly wired (snark) and SANs have also ECC everywhere. Even on the disk caches.

And in this particular instance, it was basically some private server build being messed up.

pmarreck · on March 13, 2017

Of course, my ZFS NAS backup is sound until a file that got bitrotted on my non-ZFS computer is touched and then backed up to it :/

It's kind of (literally?) like immutability. If you allow even a little mutability, it ruins it.

I think all filesystems should be able to add error-correction data to ensure data integrity.

lasermike026 · on March 12, 2017

Shouldn't RAID 1,5,6 protect against data corruption because of disk errors?

g0xA52A2A · on March 12, 2017

If you have corruption in RAID 1 how do you know which copy is good?

On a slight tangent ZFS will checksum data and store that checksum in the block pointer (i.e. not with the data itself) so it can tell which of the copies is correct. The same extends to RAID 5 and RAID 6, although with RAID 6 you can intelligently work out which block might be bad. However that is assuming the block devices are returning consistent data and you are the one talking to the block devices. If the disks were sat behind a hardware RAID controller and the controller was the one you'd be hard pressed to identify the source of the data corruption. The checksumming in ZFS comes to the rescue here again.

I recommend checking out this video [1] from Bryan Cantrill. It's about Joyent's object store Manta but features a fair bit of ZFS history. Also it features the usual rant level that one can come to expect from a Bryan Cantrill talk which I quite enjoy. There are plenty of other videos available on ZFS.

[1] https://www.youtube.com/watch?v=79fvDDPaIoY

rcthompson · on March 12, 2017

Some disk errors, yes, but something as simple as a power failure can easily corrupt your data:

http://www.raid-recovery-guide.com/raid5-write-hole.aspx

https://blogs.oracle.com/bonwick/entry/raid_z

rajachan · on March 12, 2017

Hardware RAID does not suffer from the write-hole like MD-RAID does (thanks to on-board supercap-backed non-volatile memory).

I can't remember if it was merged upstream, but some folks from Facebook worked on a write-back cache for MD-RAID (4/5/6 personalities) in Linux which essentially closes the write-hole too. It allows one to stage dirty RAID stripe data in a non-volatile medium (NVDIMMs/flash) before submitting block requests to the underlying array. On recovery, the cache is scanned for dirty stripes, which are restored before actually using the P/Q parity to rebuild user data. I worked on something similar in a prior project where we cached dirty stripes in an NVDIMM, and also mirrored it on its controller-pair (in a dual-controller server architecture) using NTB. It was a fun project, when neither the PMEM nor the NTB driver subsystems were in the mainline kernel.

qb45 · on March 12, 2017

RAID journaling.

https://lwn.net/Articles/665299/

Haven't tried it but it seems to already be merged, at least write-through. They now work on implementing writeback and this IIRC isn't merged yet.

rcthompson · on March 12, 2017

Wow, good to know that progress has been made on this front.

Last I checked (maybe a year or 2 ago), I read that btrfs also suffered from the write hole. Is that still the case?

blablabloe · on March 13, 2017

The story here is not how Silent Data Corruption is real. The story is that somebody did a bad home brew server build and fucked up.

So ZFS protects against end-user mistakes.

I was really hoping about a story on some large-scale study on silent data corruption, but no, just an ankedote.

Sad!

:D

jgoerzen · on March 13, 2017

I didn't go into detail in the article, but the server in question is running a Supermicro X10SLH-F-O motherboard, ECC RAM, and a Haswell CPU, in a Rosewill RSV-L4411 4U chassis. Is there a hardware problem here? For sure. But you can't write this off as being some dusty overclock mess bought at someone's garage sale.

I have, incidentally, seen this in corporate environments on traditionally-engineered sever-class hardware as well. This is just a much more easily-discussed case.

cmurf · on March 14, 2017

What was the flaw in the setup that makes this not silent data corruption? That any other file system would not have caught the problem means corruption would have silently propagated to the application layer. And yet it didn't, because the corruption was detected. I fail to understand the point of this comment.

meesterdude · on March 12, 2017

interesting find! I wonder what would be a good safeguard to this. I feel like just backing up your data would offer something - but a file could silently change and become corrupted in the backup too.

ralfd · on March 12, 2017

Hm. You could make two backups and checksum each file and safe the checksum. Then you could compare regularly file contents with the initial checksum, if there is a mismatch copy from the other backup.

icebraining · on March 12, 2017

Git-annex is nice for this; the fsck command will check the file against a checksum and request a new copy from other node automatically if the check fails.

heinrich5991 · on March 12, 2017

You could use ZFS, then the file cannot silently become corrupted.

dpedu · on March 12, 2017

Yes it can. ZFS will only notice next time you scrub or read the sector.

db48x · on March 12, 2017

Sure, but your ZFS pool will have redundancy, and ZFS will know which block was corrupted. This lets it recover from the error.

ianhowson · on March 12, 2017

If the corruption occurred on disk, yes. If it occurred in memory then it will write multiple incorrect copies to disk.

X86BSD · on March 12, 2017

This is why ECC is important. Many many people poo poo the idea that it's needed. But by not having it you have left a single vital part of the data path unprotected. And ram and disk is cheap, losing your data is not. The risk simply isn't worth it to save literally a few dollars.

db48x · on March 15, 2017

That's no argument against ZFS, or backups, or any other form of redundancy. Only the insane would buy computers without ECC.

ksec · on March 13, 2017

Yes, it is VERY real. Because no one gives a damn. Most people ( consumers ) would just ignore that corrupted Jpeg.

I am in the minority group that gets very frustrated and paranoid when my Video or Photos gets corrupted.

Synology has Btrfs on some range of their NAS. But most of them are expensive.

I really want a Consumer NAS, or preferably even Time Capsule ( with two 2.5" HDD instead of one drive ) with built in ZFS and ECC Memory, by default weekly scrub drive. And alert you when there is problem.

And lastly, do any of those consumer Cloud Storage, OneDrive, DropBox, Amazon, iCloud have these protection in place? Because I would much rather Data Corruption be someone else problem then complexity at my end.

hawski · on March 12, 2017

That gives me more reasons to experiment with DragonFly BSD by building a NAS using HAMMER file system.

greenshackle · on March 12, 2017

Uh. My ZFS-backed BSD NAS also has the hostname 'alexandria'.

danjoc · on March 12, 2017

Closed source firmware on drives contain bugs that corrupt data. Are there any drives, available anywhere, that have open source firmware?

weitzj · on March 12, 2017

I guess this is what ZFS is for. Don't trust the underlying storage.

But I am interested as well about your question.

dandelion_lover · on March 12, 2017

There is a corresponding project: http://www.openssd-project.org

duskwuff · on March 13, 2017

Note that it's essentially a dead project. The "NEW!!!" platform mentioned on their homepage is from 2014.

qb45 · on March 12, 2017

Buggy SSDs certainly are a thing, mainly due to the complexity of FTL, but if you have a story of FW corrupting data on spinning rust that would be news to me.

_RPM · on March 12, 2017

Just like data black markets.

h2hn · on March 12, 2017

I started a really simple and effective project the last month to be able to fix from bitrot in linux(MacOs/Unix?). It's "almost done" just need more real testing and make the systemd service. I've been pretty busy the last weeks so I've only been able to improve the bitrot performance.

https://github.com/liloman/heal-bitrots

Unfortunatly, btrfs is not stable and zfs needs a "super computer" or at least as much GBs of ECC RAM as you can buy. This solution is designed to any machine and any FS.

tnorgaard · on March 12, 2017

Please stop spreading this misinformed statement. I assume you are referring to the ZFS ARC (Adaptive Replacement Cache). It works in much the same way as a regular Linux page cache. It does not take much more memory (if you disable prefetch) and will only use what is available/idle. We use Linux with ZFS on production systems with as low as 1GB memory. We stopped counting the times it has saved the day. :-)

ECC is a nice to have, but ZFS does not have special requirement over say a regular page cache. The only difference is that ZFS will discovery bit-flips instead of just ignoring them as ext4 or xfs would do.

3legcat · on March 13, 2017

> ECC is nice to have.

Actually it seems ECC is important for ZFS filesystems see:

http://louwrentius.com/please-use-zfs-with-ecc-memory.html

solarengineer · on March 13, 2017

To be clear, it is not ZFS that requires or even mandates ECC. Since ZFS uses data as present in memory and has checks for everything post that, it is prudent to have memory checks at the hardware level.

Thus, if one is using ZFS for data reliability, one ought to use ECC memory as well.

g0xA52A2A · on March 13, 2017

> Actually it seems ECC is important for ZFS filesystems see:

The inflection made by the previous comment tends to lead people to think ECC RAM is needed for ZFS specifically. As the blog post you link to points out it's equally applicable to all filesystems.

blablabloe · on March 13, 2017

It's not required, but it doesn't make sense to use ZFS but not to use ECC memory. That's the point. It's like locking the backdoor but leaving the front door wide open.

h2hn · on March 12, 2017

Interesting.

That's rigth the kind of hardware I was referring to, 1 GB of plain RAM. Truly, I haven't tested ZFS yet for that reason I've always read that ZFS has big requirements so I refrained to try it. It seems I should give it a try. ;)

Btrfs is another story I've used it for years and I'd prefer not to have to use it anymore untill it'll become "stable" and "performance". :)

Laforet · on March 13, 2017

FreeNAS != ZFS. The former is a specialised storage system that has to meet a very different set of criteria than a lightweight server with 1GB ram.

h2hn · on March 13, 2017

Is zfs able to repair from single data (copy) corruption?

My main issue is to be able to repair a "silent" data corruption on a single drive machine. Am I able to use x% of my "partition" to data repair or do I need to use other partition/drive to mirror/raid it?

If I understand right zfs can detect bitrot ("not really" a big deal) but without any local copy It can't self heal.

My use case is an arm A20 SoC (lime2) to storage local backups among other things, so I need something that detects and repairs silent data corruption at rest by itself (using a single drive).

A poor man NAS/server. ;)

mdtancsa · on March 13, 2017

Not sure if it will fit your needs or not, but for long term storage on single HDs (and back in the day on DVD), I would create par files with about 5-10% redundancy to guard against data loss due to bad sectors. http://parchive.sourceforge.net/ total drive failure of course means loss of data, but the odd bad sector or corrupted bit would be correctable on a single disk. This was very popular back in the binary UseNet days....

solarengineer · on March 13, 2017

You can create a nested ZFS file system and set the number of copies of the various blocks to be two or more. This will take more space, but there'll be multiple copies of the same block of data.

Ideally, though, please add an additional disk and set it up as a mirror.

ZFS can detect the silent data corruption during data access or during a zpool scrub (which can be run on a live production server). If there happen to be multiple copies, then ZFS can use one of the working copies to repair the corrupted copy.

h2hn · on March 13, 2017

Got it but not for my use case then cause I don't want to halve my storage capacity.

Anyway I will try to use it for my main PC which has several disks and continue to use my solution for single disk machines (laptop, vps, SoC...). :)

rleigh · on March 13, 2017

Note it won't necessarily halve the capacity. Selectively enable it for the datasets requiring it, and avoid the overhead with the rest.

bestham · on March 14, 2017

No but parity archives solves a different problem, with only some percent of wasted storage you can survive bit-errors in your dataset. It's like reed-solomon for files.

In order to achive the same with ZFS you have to run RAID-Z2 on sparse files.