Shouldn't RAID 1,5,6 protect against data corruption because of disk errors?

g0xA52A2A · on March 12, 2017

If you have corruption in RAID 1 how do you know which copy is good?

On a slight tangent ZFS will checksum data and store that checksum in the block pointer (i.e. not with the data itself) so it can tell which of the copies is correct. The same extends to RAID 5 and RAID 6, although with RAID 6 you can intelligently work out which block might be bad. However that is assuming the block devices are returning consistent data and you are the one talking to the block devices. If the disks were sat behind a hardware RAID controller and the controller was the one you'd be hard pressed to identify the source of the data corruption. The checksumming in ZFS comes to the rescue here again.

I recommend checking out this video [1] from Bryan Cantrill. It's about Joyent's object store Manta but features a fair bit of ZFS history. Also it features the usual rant level that one can come to expect from a Bryan Cantrill talk which I quite enjoy. There are plenty of other videos available on ZFS.

[1] https://www.youtube.com/watch?v=79fvDDPaIoY

rcthompson · on March 12, 2017

Some disk errors, yes, but something as simple as a power failure can easily corrupt your data:

http://www.raid-recovery-guide.com/raid5-write-hole.aspx

https://blogs.oracle.com/bonwick/entry/raid_z

rajachan · on March 12, 2017

Hardware RAID does not suffer from the write-hole like MD-RAID does (thanks to on-board supercap-backed non-volatile memory).

I can't remember if it was merged upstream, but some folks from Facebook worked on a write-back cache for MD-RAID (4/5/6 personalities) in Linux which essentially closes the write-hole too. It allows one to stage dirty RAID stripe data in a non-volatile medium (NVDIMMs/flash) before submitting block requests to the underlying array. On recovery, the cache is scanned for dirty stripes, which are restored before actually using the P/Q parity to rebuild user data. I worked on something similar in a prior project where we cached dirty stripes in an NVDIMM, and also mirrored it on its controller-pair (in a dual-controller server architecture) using NTB. It was a fun project, when neither the PMEM nor the NTB driver subsystems were in the mainline kernel.

qb45 · on March 12, 2017

RAID journaling.

https://lwn.net/Articles/665299/

Haven't tried it but it seems to already be merged, at least write-through. They now work on implementing writeback and this IIRC isn't merged yet.

rcthompson · on March 12, 2017

Wow, good to know that progress has been made on this front.

Last I checked (maybe a year or 2 ago), I read that btrfs also suffered from the write hole. Is that still the case?