Raid-Z Expansion Feature for ZFS Goes Live

reincarnate0x14 · on Feb 8, 2022

This is great, there has been a demand for this since forever. Enterprise-y people generally didn't care much but the homelab/SMB users end up dealing with it a lot more than might be naively imagined.

Always reminds me of when NetApp used to do their arrays in RAID-4 because it made expansion super-fast, just add a new zeroed disk and only had to update the new disk blocks + parity drive on writes. Used to blow our Netware admin's mind as almost nobody else ever used RAID-4 -- I had it as an interview question along with "what is virtual memory" because you'd get interesting answers :)

matheusmoreira · on Feb 9, 2022

> the homelab/SMB users end up dealing with it a lot more than might be naively imagined

As a home server administrator, I've wanted this feature for so long. Before this, in order to expand an existing array I'd have to fail every single drive and replace them with new higher capacity ones.

The only question I have is whether it supports expansion with drives of different capacities.

Shish2k · on Feb 9, 2022

> The only question I have is whether it supports expansion with drives of different capacities.

I’m also very interested in this - my main reason for sticking with btrfs is that I can use a variety of odd-sized drives, and expand it by adding a new oddly-sized drive...

matheusmoreira · on Feb 9, 2022

Exactly. That way we can create a storage server out of any random drives we have lying around as well as slowly expand its capacity without having to rebuild the whole thing. I think this capability is vital for people who can't immediately spend thousands of dollars on equipment.

chasil · on Feb 9, 2022

Can we have "rebalance?"

After that, can we have defragmentation?

Let's make the best of both worlds.

reincarnate0x14 · on Feb 9, 2022

Is fragmentation a serious issue for you? COW filesystems in general aren't great for use-cases that rewrite blocks frequently (databases usually the poster child) but I've never had much problem with it for more general cases even when the free space fragmentation gets north of 70%. Then again most of the storage I care about performance on is NVME.

I can see a conceptual sort of super-scrub that balances a zpool and addresses all that that's not on anyone's radar AFAIK.

aidenn0 · on Feb 9, 2022

I maintain an archive of nightly builds that the metaslab table is so large that it doesn't stay in ARC without me forcing metaslab_debug on.

Workload is roughly

Every night write a lot of files

6/7 nights delete a lot of files (i.e. keep weekly snapshots).

It took about 3-5 years of doing this to get to the point where write-performance dropped off the clif and enabling metaslab_debug brought it back.

exikyut · on Feb 9, 2022

Hmmmm, in a home-use setting I currently have a scenario where

a) random seek/many-small-files performance has been really had since day 1, I initially suspected old/low-end hardware (i3, 1600MHz RAM) but given that I can do just south of 200MB/s (two-way mirror) I'm kinda staring at ZFS expectantly here

b) I've admittedly managed to net myself a fair few pathological way-too-many-files situations from projects and whatnot that I really do need to get to cleaning up

Fairly early on I noticed apt performance degraded pretty badly, and long before (b) became a substantial concern it got to the point where installing just about anything would take about 60 seconds to do the "Reading database ..." step.

I've been idly curious about tweaking different settings to try and improve performance, but it's mostly been an idle curiosity because I don't have a straightforward way to back out of "oh great now what" edge cases.

This has probably been going on for just around a year or two, and with absolutely no context I'd be confident saying write volume isn't a shadow of what you're doing :) so perhaps that particular tunable is... maybe not relevant? Or maybe it is. I'm curious.

aidenn0 · on Feb 9, 2022

That tunable is probably not an issue for you; the symptom there is painfully slow writes (even for bulk sequential writes).

A sibling mentioned making sure ashift is 12. I'll second that. In addition, make sure your ZFS partition is aligned; if you gave ZFS the whole disk it probably is. If you did not (e.g. because you needed an EFI boot partition) it might not be.

Lastly, for any given workload, ZFS seems to have roughly logistic performance curve with respect to the amount of RAM it has to work with. The ARC does a pretty good job of keeping important data in RAM to minimize seeks when there is "enough" RAM, but it does a progressively worse job as it gets RAM constrained. On a development machine where I'm dealing with multiple SVN and git checkouts on spinning metal the performance difference between 8GB and 12GB of RAM for ZFS is night and day. Good SSDs make this a lot less important because the penalty for a small number of read-misses is approximately zero compared to rotating drives.

exikyut · on Feb 10, 2022

Ah, I see.

ashift is definitely 12 (as I noted in my sibling reply)... but TIL about alignment (thanks). Parted says my ZFS partition starts at 8590983168 bytes (after an 8GB swap partition), which divides down by 4K cleanly. Is that what you mean?

Hmm, the RAM usage on this machine is generally low-ish, but with 8GB I suspect the smallest perturbations can make a big difference (even though I use it headlessly). I'll definitely keep more RAM in mind going forward, and yeah, SSD/NVME storage makes these kinds of considerations moot in high-performance contexts.

Thanks for the info!

aidenn0 · on Feb 11, 2022

I honestly left the alignment vague because I couldn't remember what it was; I would believe it's 128k as that's the largest value ZFS ever uses, but I would also believe that 4k is fine.

NavinF · on Feb 9, 2022

Try running "zdb |grep ashift" and confirm that it's 12 (or even higher for SSDs). The default used to be 9 which killed IOPS on non-ancient HDDs that have 2^12 byte sectors and have to read-modify-write anything smaller than a sector.

“i3, 1600MHz RAM” sounds like a laptop. Are you doing anything funky like using USB HDD enclosures?

Also try comparing the number, size, and latency of IO operations submitted to ZFS vs the same stats for IO submitted to the disks with https://github.com/iovisor/bcc

Once you figure out what layer (application? VFS/cache? file system? IO elevator? HBA? disk firmware?) the performance drop is happening on, it should be trivial to fix.

exikyut · on Feb 10, 2022

One of the few things in the Debian (yes...) setup guide was emphasizing ashift, I remember explicitly setting it to 12.

It's not a laptop, it's a low-end motherboard currently serving as my primary workhorse :) (until I find the money to fix the issues preventing me from working... any day now... :'D). *Checks* It's an ASUS P8H61-M. And no, the disks are directly attached.

TIL ZFS can submit different IO sizes than what reach the disks. I've just been dumbly staring at iotop and thinking that was the last word on the situation. Now to figure out how to get that info from ZFS (and figure out which bcc script to use). Thanks.

Thanks for the layer consideration. The application layer (an ncdu scan I'm currently doing is has been reading 60 files/second for days) and VFS/cache layer (if I do two apt operations in relatively quick succession (seconds apart) with nothing else doing I/O, the second one completes the read step instantly) seem to be the effect/symptom, with file system (all ZFS, but obviously badly tuned) and IO elevator (oooooh that's what that is TIL, I might play with this! :D) seemingly the most interesting, and HBA (onboard SATA3 port *hides*) and disk firmware (I've never upgraded a BIOS in case I irreparably break something lol) beyond the horizon somewhat.

Thanks for the info!

chasil · on Feb 9, 2022

Fragmentation is a serious issue for everybody.

With XFS, I can xfs_fsr to undo it. With Btrfs, I can fi defrag to undo it.

With ZFS, reformat the disk. I'm not comfortable with that.

https://tim.cexx.org/?p=1236

https://www.usenix.org/system/files/login/articles/login_sum...

arjvik · on Feb 9, 2022

Doesn't RAID5 have the exact same property as RAID4? With a healthy array in both cases, the XOR of all disks should be zero.

RAID5 has the benefit of not making the parity drive a bottleneck that will fail sooner.

Naac · on Feb 8, 2022

I don't understand why the title says "Goes Live"?

The code is here[0]. It still needs more testing and cleanup, and will then eventually be merged. After that it'll take some time to make to all the distributions ( freenas, freebsd, etc. )

[0] https://github.com/openzfs/zfs/pull/12225

abrookewood · on Feb 9, 2022

There is a very interesting podcast with Matt Ahrens, co-founder of the ZFS project, that covers RAID-Z expansion as well as the history of ZFS etc. https://changelog.com/podcast/475

fomine3 · on Feb 9, 2022

Thanks.

> But the interesting thing about this project is how did it come to be. So a long-requested feature - how did it get funded? So actually, it’s funded by the FreeBSD Foundation.

Now I found why OP posted by FreeBSD Foundation.

mrighele · on Feb 8, 2022

For those that like videos more than text, there is a youtube video from last year [1] that explain the feature (unless it's changed since ,but it seems not to be the case).

One downside that I see of this approach, if I understand it correctly, is that the data already present on disk will not take advantage of the extra disk per slice. For example, if I have a raidz of 4 disks (so 25% of space "wasted"), and add another disk, new data will be distributed on 5 disks (so 20% of space "wasted") but the old data will keep using stripes of 4 blocks, they will just be reshuffled between the disks. Do I understand it correctly ?

[1] https://www.youtube.com/watch?v=yF2KgQGmUic

GauntletWizard · on Feb 8, 2022

That is correct. There are planned solutions for rewriting all your data, though they won't play nicely with one of ZFS's other most important features - Snapshots. The current plan is basically to have a nice userspace utility that will rewrite all your data in-place, but that will cause you to rereplicate everything over your snapshots.

mrighele · on Feb 8, 2022

Thanks for your answer. The point about the snapshots is a very good one, for some reasons I didn't think about it.

Rewriting data in place can be tricky, if you have old enough snapshots the newly added space may not be enough. I hope they will find a good enough solution.

rincebrain · on Feb 8, 2022

Yes, old data keeps the old data:parity ratio unless you do something like cp or zfs send|recv it and destroy the old copy.

uniqueuid · on Feb 8, 2022

This is great, but an important and little known caveat is that raidz is limited to the iops of one disk. So growing a raidz will at some point have lots of throughput but suffer in small and random reads and writes. At that point, it will be better to grow the pool with additional, separate raidz.

wyager · on Feb 9, 2022

To clarify, the sequential read bandwidth of a RAIDZN with M total disks will be M times the read bandwidth of a single disk. The sequential write bandwidth will be (M-N) times the write bandwidth of a single disk. These are optimal values. https://calomel.org/zfs_raid_speed_capacity.html

It's a constraint on iops in particular, not bandwidth, because writing a record to disk involves writing something to every disk in the pool. So if you have a disk that's taking a long time to write, the other disks need to pause and let it catch up.

uniqueuid · on Feb 9, 2022

Precisely. The split between bandwidth and iops is what remains largely absent from discussions on ZFS and also from benchmarks.

Youden · on Feb 8, 2022

To be fair, I don't think this is little-known. When a write must be acknowledged by multiple devices (as is the case with RAID1 or RAIDZ), a write requires more IOPS.

Dylan16807 · on Feb 8, 2022

Let's say I have a 4+1 disk layout, and I do 100 data writes followed by an fsync.

Naively I would expect that to become 25 data writes per disk, and then the fsync would go to all disks.

That might be about 30 writes total, so I'd expect it to be at least 3x as fast as a single disk.

Where does that expectation break down?

secabeen · on Feb 9, 2022

This summarizes it well for reads: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSRaidzRea...

For writes, the system is copy-on-write, so for each write, there is a full read of the containing record, often across all disks, so the updated record can be written out in the new location on the disks without disrupting the old record, which may still be referenced by snapshots, etc.

Adding additional disks gets you more throughput (you can split a single large write across more disks, but not more OPS.)

Dylan16807 · on Feb 9, 2022

Okay, the follow-up post really explains it. It's yet another consequence of ZFS's inability to move blocks. So you more or less can't have multiple writes get lumped together because that would break garbage collection.

There are ways to work around elements of this and get much better performance, but not wanting more complexity makes enough sense.

Definitely a flaw in the ZFS data model though, rather than something inherent to the use of multiple disks.

bestham · on Feb 9, 2022

It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ. The real world I a cruel place and you get nothing for free, in order to have distributed parity and consistency you will have to suffer the lowest common denominator IOPS penalty.

That is just maths.

If you want to use ZFS and have more IOPS you will have to have more VDEVs (either more RAIDZs or several mirrors). Your storage efficiency will be slightly reduced (RAIDZ) or tank to 50% (mirrors) with less redundancy.

But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.

Dylan16807 · on Feb 9, 2022

> It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ.

You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.

> That is just maths.

I don't think so. What's your calculation here?

The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.

If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.

If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.

And nothing about that causes a write hole. Handle the metadata the same way.

ZFS can't do that, but a filesystem could safely do it.

> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.

The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).

kaba0 · on Feb 9, 2022

I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice? Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.

Dylan16807 · on Feb 9, 2022

> I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice?

It's a reasonable choice, but only because it makes certain kinds of bugs harder, not because it's safer when the code is correct.

> Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.

That's not true. You make the new copy, then update every reference to the new copy, and only then remove the old one. If there's an error halfway through then there's two copies of the data.

barrkel · on Feb 9, 2022

100 data writes, assuming they're to different files and/or large enough to fill a slice, means 500 disk writes (5 disk writes per slice).

The 500 disk writes are parallelized over 5 disks so they only take the time taken for 100 writes (500 / 5).

So the IOPs is the same as a single disk.

HOWEVER, bandwidth is 4x. The above assumes that seek time dominates. If your writes are multiple slices in length, they will be written 4x faster because the amount of data per disk is divided across the disks. If you're reading and writing large contiguous files, then you do get a big I/O boost from raidz.

Dylan16807 · on Feb 9, 2022

> 100 data writes, assuming they're to different files and/or large enough to fill a slice, means 500 disk writes (5 disk writes per slice).

And if I wrote 100 slices to a single drive, is that 100 writes or 400 writes?

If it's 100, then it sounds like the slices are sized wrong: they should get bigger when I add more disks.

If it's 400 writes, then 100 writes per disk should be much faster.

barrkel · on Feb 10, 2022

The slices depend on record size and sector size. For 4 data disks and default 128k record you have 32k per disk per record. With more disks, the proportion of a slice per disk decreases, not increases, and it's rounded up to the sector size, so there's usually some loss of space on larger parity schemas.

maccam94 · on Feb 9, 2022

Each write is striped across all of the disks. For a stripe to be completely written, all disks must finish writing the data. Thus your max write IOPS equals that of the slowest drive in the vdev.

wtallis · on Feb 9, 2022

> Each write is striped across all of the disks.

It's reasonable to expect that issuing a batch of 100 writes at the application layer followed by a fsync would not always require doing 100 writes to each underlying block device. The OS/FS should be able to combine writes when the IO pattern allows for it, and should be doing some buffering prior to the fsync in hopes of assembling full-stripe writes out of smaller application-layer writes.

Dylan16807 · on Feb 9, 2022

> Each write is striped across all of the disks.

It sure shouldn't be. This isn't dumb RAID.

Osiris · on Feb 8, 2022

So you setup two raid-z then stripe them for increased performance?

unethical_ban · on Feb 8, 2022

This is how I understand it.

The hierarchy is disk < vdev < zpool.

Disk is physical.

vdev is logical. Purpose: Disk grouping and redundancy. Composition: One or more disks.

zpool is logical. Purpose: Higher-level management of one or more vdevs. Composition: It acts like a JBOD.

---

zpools can be thought of as "stripes of vdevs". This, in the narrow sense that the failure of any vdev in a zpool is a permanent loss of the entire zpool. All your redundancy in the ZFS ecosystem is via mirrored or RAID'ed vdevs.

---

The setup I have heard of that balances performance, redundancy and space is to do what you say: Have a zpool of multiple mirror-type vdevs.

You can also stripe at the vdev level, which I would assume has higher performance than having multiple single-disk vdevs in a pool - I'm unaware of the differences at a low level.

oakwhiz · on Feb 9, 2022

That's a common configuration for larger arrays where you probably don't want to make individual RAIDZ arrays too wide on their own. It's functionally similar to RAID 50 or 60. However when the budget is tight you might be incentivized to compromise on performance or reliability, so this new expansion feature really does help in hobbyist or shoestring budget situations where you just want one huge array to maximize usable space while still being able to tolerate some failures. Typically you would buy disks as you go along to try to stretch the budget further as utilization increases.

rodgerd · on Feb 9, 2022

> little known caveat

I'm surprised that this would be a little-known caveat; it's been a problem with all RAID-parity systems 3,4,5 and Z) since forever.

wmf · on Feb 9, 2022

No, RAID-4/5/6 give O(N) IOPS while RAID-Z gives O(1) IOPS.

FullyFunctional · on Feb 8, 2022

I certainly wanted this. I even heckled Bill Moore about it. Having gone through the expansion the old way (replace each drive one at a time with a larger one), this looks a lot simpler. Unfortunately it appears to not work with simple mirror and stripes (~ RAID10) so it will make no difference for me. (Drives are cheap but performance is not -> RAID10).

arwineap · on Feb 8, 2022

This looks different than the old way

The old way ( as you referenced ) was to replace each disk one by one with a larger one.

If I'm understanding this right, and please correct me, this feature will allow you to add a 5th disk to a 4 disk raidz

And if I'm right about that, then this feature wouldn't really make sense for RAID10 anyway

FullyFunctional · on Feb 8, 2022

I love ZFS but this is something that just works in btrfs; mirror just means all blocks live in two physical locations. You certainly can do that even with an odd number of drives. However ZFS is more rigid and doesn’t allow flowing blocks like this, nor dynamic defragmentation.

NavinF · on Feb 8, 2022

What are you on about? The submission is about raidz.

Adding drives to a mirror has worked in zfs since prehistoric times. “zpool attach test_pool sda sdc” will mirror sda to sdc. If sda was already mirrored with sdb, you now have a triple-mirror with sda, sdb, and sdc.

FullyFunctional · on Feb 9, 2022

I am aware of course. I was explaining why adding a single drive does make sense for RAID10 just fine and, as an example, btrfs has it (existence proof), but ZFS doesn't support it.

Your example doesn't expand the storage of the vdev, which is what this entire discussion is about, it merely adds a mirror.

zaarn · on Feb 9, 2022

You can definitely expand and shrink a zpool based on mirrors. You can remove top-level vdevs, provided the pool is fine, and expand existing pools by adding new mirrors.

wtallis · on Feb 9, 2022

Resizing a pool that uses mirrored vdevs isn't the question; resizing vdevs is.

zaarn · on Feb 9, 2022

You can now with this feature. If you're using RAID1 vdevs, you could always remove and add mirrors. Now you can at least add a device to a RAIDZn vdev.

wtallis · on Feb 9, 2022

> If you're using RAID1 vdevs, you could always remove and add mirrors.

This is still answering the wrong question, unless by "add mirrors" you mean "add another drive to a mirrored vdev and get more usable capacity", which I don't think is how ZFS works and would be strange to summarize as "add mirrors".

I'm trying to discuss an idea you don't seem to have any terminology for. Can you please try to meet me halfway and at least respond in a manner that makes it clear you understand there's a real distinction to be made?

zaarn · on Feb 10, 2022

You can add and removes devs from a vdev in ZFS now, with this feature, giving you the ability to resize a vdev. But only upwards. Downwards isn't possible yet.

NavinF · on Feb 9, 2022

In what situation would increasing the size of a vdev be better than increasing the size of a pool? The submission increases the size of a raidz vdev and that's incredibly useful, but when would that make sense for a mirror?

wtallis · on Feb 9, 2022

> In what situation would increasing the size of a vdev be better than increasing the size of a pool?

When you want to add a single drive to expand your available storage capacity, without reducing the degree of redundancy you're already using. That use case should have been obvious given how much discussion it's already received in this thread, so I think you have some kind of blind spot about anything small-scale. This is why it's good to actually understand the competition, even after deciding which tradeoffs are right for you.

wtallis · on Feb 8, 2022

> If sda was already mirrored with sdb, you now have a triple-mirror with sda, sdb, and sdc.

You probably already know this, but btrfs offers a different option. Its "RAID1" mode would not result in the three drives storing your data in triplicate but rather still stores data in duplicate but with a usable capacity that's 1.5x that of a single driver's capacity. To get the ZFS behavior you describe, btrfs offers a "RAID1c3" mode, but that's a much less interesting feature.

NavinF · on Feb 8, 2022

Huh I didn’t know that. Seems like a niche feature, but it’s still pretty neat!

rodgerd · on Feb 9, 2022

It's a neat feature for SOHO-type users, and takes advantage of the fact that the RAID layer knows about the FS. The first implementation that I know of is Drobo's filesystem.

Asymmetric RAID is probably the only config that I'd like to see in ZFS - being able to have (say) an 8TB drive and two 4TB drives, and have 8TB of mirrored space is quite nice when you're just buying commodity kit and don't want to have to retire/match drives all the time when growing mirrors.

wtallis · on Feb 9, 2022

Yep, SOHO use cases are where this kind of capability is really useful. ZFS is great when you buy drives by the dozen, but even with this feature it's still inconveniently inflexible if you only have the budget to be upgrading or expanding one drive at a time.

mgerdts · on Feb 9, 2022

Sounds like the way that Windows Storage Spaces works as well.

chasil · on Feb 9, 2022

Like Btrfs, ZFS stores multiple copies of metatdata blocks when possible.

Forcing ZFS to store multiple data blocks also can be done.

This is great for ZFS on flash drives.

https://docs.oracle.com/cd/E19253-01/819-5461/gevpg/index.ht...

FullyFunctional · on Feb 9, 2022

That not the same thing. Bill Tallis below explained it better than I did though.

deagle50 · on Feb 8, 2022

Would you use raid 5/6 in btrfs?

NavinF · on Feb 8, 2022

The btrfs raid5/6 write hole is still around, if anyone’s wondering. Though it was only recently that btrfs started warning users that it would eat their data: https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-Wa...

mrlonglong · on Feb 8, 2022

I'll certainly attest to that, btrfs did that to me. Moved over to ZFS and all that went away.

chungy · on Feb 8, 2022

Simple mirrors and stripes could always be expanded (and reduced, too). RAID-Z has been special.

amatecha · on Feb 9, 2022

Ah man, this is perfect timing. I'm literally in the midst of building a new NAS and intending to use TrueNAS + RAIDZ2. Purposely over-speccing potential drive bay capacity so I can expand later on. I was worried how hard it will be to expand later, but it looks like it may end up being more possible than I thought! Sweeet :)

ggm · on Feb 8, 2022

Lots of people have wanted this for ages. I managed to cope with spindle replace and resize into new space (larger spindles) but being able to add more discrete devices and get more parity coverage and more space (I may be incorrectly assuming you get better redundancy as well) is great.

rincebrain · on Feb 8, 2022

This trick cannot be used to turn an N-disk raidzP into an [any number]-disk raidzP+1, as far as I understand.

gigatexal · on Feb 8, 2022

Amazing! Kudos to the team and everyone behind getting this code into the tree.

genpfault · on Feb 8, 2022

So is this feature FreeBSD-only? Or will it be integrated into OpenZFS at some point?

chungy · on Feb 8, 2022

It is an experimental feature on FreeBSD 14-CURRENT. It will be merged into OpenZFS eventually (and maybe backported to FreeBSD 13-STABLE and whatever new point releases happen).

rincebrain · on Feb 8, 2022

Where do you see it in CURRENT? AFAICT FreeBSD main is still more or less just a sync against OpenZFS.

AFAIK it's being developed against FBSD, but is not any more integrated there than it is in the main OpenZFS project.

chungy · on Feb 9, 2022

Fair point, I haven't actually looked at CURRENT, it's an assumption from the "goes live" part of the article :)

2OEH8eoCRo0 · on Feb 8, 2022

[flagged]

sleepycatgirl · on Feb 8, 2022

Nah, ZFS is pretty comfy FS, it has lots nice features, it is reasonably fast, and it is stable. And as far as I know, it has been used for fairly long time.

vanillax · on Feb 8, 2022

[flagged]

vvatermelone · on Feb 8, 2022

They "lost everything" because they failed to set up even the most basic and critical maintenance functions built into ZFS. I don't think the responsibility in this incident could fall any more squarely on the shoulders of the people who set it up.

toast0 · on Feb 9, 2022

I think there's a case to be made that a NAS oriented OS/distro should probably default to having a scheduled scrub, not wait for you to set one up. I don't follow LTT, but a quick look around says he was using TruNAS for at least one of the builds?

You do need a notification feature as well, which is tricky because a distro can't assume you'll setup mail properly, but maybe something like make samba go read only when the zpool is degraded could work.

wolrah · on Feb 9, 2022

They mentioned in their video about it how TrueNAS does do this correctly but they had manually set up this particular system on a normal distro.

ksec · on Feb 9, 2022

It wouldn't have happened and they would have all the error report for months if they have it on by default. Which they somehow didn't.

It's funny thinking back now when LTT asked why MKBHD didn't ask for help when he wanted a DAS.

7steps2much · on Feb 8, 2022

To be fair, they didn't really understand how ZFS works and failed to set up bitrot detection.

A "clean" setup would include those, as well as either a messaging system or a regular checkup on how your FS is doing.

liuliu · on Feb 9, 2022

Not only that. A vdev with 15 disks each around 16 (or 18?) TiB with RAID-Z2 just asking for troubles (https://www.zdnet.com/article/why-raid-6-stops-working-in-20...). Even with 1% disk failures annually (based on Backblaze, for better disks), we are looking at least two disks fail at the same time with 15 disks around 1% chance. With 10 vdevs, that is 10% chance of any of these have failures.

They should really move to RAID-Z3 or 8 disks group. With RAID-Z3, we are looking at at least 3 disk fail at the same time with 15 disks around 0.04% chance. With 8 disks, we are looking at 0.2% chance.

kalleboo · on Feb 9, 2022

I've read this article and others like it, and what I don't get is if the URE rates are really that high, why I have I never seen an URE in the bi-weekly scrubs of my pool (184 TB of raw disk), aside from when a single disk was literally going bad?

sliken · on Feb 9, 2022

There's many contributing factors. Power supplies degrade over time, ability to maintain the correct voltages can be impaired for worst case scenarios, like running all disks flat out. Drives (and even more so older drives) can be vibration sensitive, especially in worse case scenarios involving all drives running flat out. With modern manufacturing tolerances being so low, whatever triggered the first disk in your pool to die is likely to get a second in a fairly small window.

So sure, the failures are not equally distributed and you've done well with your 184TB pool. Are you tracking the device errors, or only those that are visible to the OS? But multiple disk failures are not particularly uncommon. My experience in this space was two 16 disk servers that I set up 6 5 disk RAID5s (with one global spare per server). Within one month I had 11 of 32 disks die, and barely managed not to lose any user files, and this was not during the 1st month in production.

Scary. I've since moved to pairs of servers cross connected to pairs of 60 disk chassis (16 x 12 gbit connections per chassis) with ten 11-disk RAIDz3, 10 global spares, and 6x3.2TB of NVMe cache per server.

kalleboo · on Feb 9, 2022

Oh there are all kinds of reasons drives can cause errors, and you have the bathtub curve. So there's lots to take into account when designing your pool.

But the article is using the spec sheet URE rate which I'd assume looks only at the drive and doesn't take into account problems with the computer around the drive or the EOL time after the drive warranty has expired, I'd assume it was the "baseline" error rate.

> Are you tracking the device errors, or only those that are visible to the OS?

If we're talking URE like the article, that's data-loss on a disk, and the OS would always figure it out, since it would cause a ZFS checksum failure on scrub.

In this case it's not my data and not my money, so my preference is 6-drive RAIDZ2 vdevs. We've only had one disk with errors (and that one was migrated from a PC where Windows never reported any errors... of course...). The oldest 2 disks (3.5 years power-on time) have single-digit reallocated sectors in SMART so those are on course to be replaced.

I'm just curious since the argument in the article doesn't add up in my eyes.

> Within one month I had 11 of 32 disks die, and barely managed not to lose any user files, and this was not during the 1st month in production

Wow, that is some terrible luck!

mustache_kimono · on Feb 9, 2022

They even admit this in the video -- "No one is to blame here except us" which means "Holy shit! We really messed this up."

Yes, they should have been scrubbing their pools, but I don't think this was bit rot. Millions of data errors is not what bit rot looks like. This looks exactly like bad hardware.

ksec · on Feb 9, 2022

ServetheHome actually reported the hard drive LTT are using had higher failure rate than others. Even from a consumer hardware perspective. And they were suppose to be enterprise drive.

It is just bad hardware, bad setup and bad everything all cramped together.

mustache_kimono · on Feb 9, 2022

I've never seen a HDD fail like that, but I'm willing to believe it's possible. I have seen read/write errors because of a buggy implementation of ALPM.

But, agreed, whatever it was, if they were paying attention, they could have diagnosed and remediated well before they had any data loss.

Youden · on Feb 8, 2022

You can say this about literally any storage system. Unrecoverable failures can always happen, that's why you keep backups.

ZFS redundancy features aren't there to eliminate the need for backups, they're there to reduce the chance of downtime.

mustache_kimono · on Feb 8, 2022

This is just silly. As other have asked, how is this different from any other storage solution? (It's not.)

The only thing to glean from this video is that LTT didn't know what they were doing.

namibj · on Feb 8, 2022

Btrfs lets you keep metadata at a stronger bitrot-protection level fairly easily, and as long as that is still ok, you'll only use the data blocks that rotted, so at LTT's scale just a few seconds of footage in total.

NavinF · on Feb 8, 2022

Even if that’s true, btrfs would be entirely impractical for them because btrfs-raid6 is broken. Their petabyte server would become a 1/3 petabyte server without raidz2.

namibj · on Feb 8, 2022

But that's more because no one dared to try and actually fix Btrfs parity raid. Misaligned incentives causing lack of funding for something as flexible and generally resilient as Btrfs.

guipsp · on Feb 9, 2022

My data doesn't really care if the reason for brtfs losing data is lack of funding

matheusmoreira · on Feb 9, 2022

Can you elaborate on this? I see NAS companies like Synology using btrfs in their products. I don't understand why btrfs RAID 5/6 is still not stable.

genghizkhan · on Feb 9, 2022

Synology deploys btrfs on top of md-raid. They do not trust btrfs's raid layer.

namibj · on Feb 9, 2022

Because there are some write holes around iirc power failures and, more importantly, restore basically doesn't work (the code just doesn't exist in a working variant).

This isn't easy code to write, and it's Linux-Kernel-C, far from "comfy" code to write. I assume it's that kind of effect that continues to keep all competent people from fixing Btrfs's RAID 5/6 code. (If I could gather the motivation, I could probably do it, but that's far from fun to do and I'm already dire for motivation-to-code... it needs to be someone who likes that work, ideally sponsored so they don't loose money from working on that project.)

mustache_kimono · on Feb 8, 2022

> Btrfs lets you keep metadata at a stronger bitrot-protection level

I really have no idea what BTRFS feature this refers to, but ZFS stores redundant copies of metadata. See: redundant_metadata, https://zfsonlinux.org/manpages/0.8.0/man8/zfs.8.html

namibj · on Feb 8, 2022

Block group specific profiles. The normal these days is SINGLE for data and DUP for metadata.

See how one converts here: https://btrfs.readthedocs.io/en/latest/btrfs-balance.html

mustache_kimono · on Feb 9, 2022

Okay, I still don't understand your point.

BTRFS stores duplicate copies of the metadata. Good for BTRFS. I don't understand how this would have prevented what LTT experienced.

First, as someone else said -- no one would use a BTRFS raid5/6 array in production. No one. Second, I don't think what LTT experienced was bit rot. There were far, far too many data errors for that be simple bit rot. My guess is bad cable, bad HBA which maybe doesn't support something like ASPM, etc. Third, are we even sure the problem was corrupted metadata? Fourth, ZFS keeps redundant copies of metadata too (as do many filesystems.)

So the question is what's your point?

InvaderFizz · on Feb 9, 2022

Plenty of people do, like my predecessors at a former job. I then had to clean up the blast from multiple BTRFS RAID6 arrays.

Oh, you meant no one with a shred of responsibility, which I agree with.

namibj · on Feb 9, 2022

That you could have dialed up the redundancy for metadata to 3x or even 4x copies being stored, and just because a block got bitrot doesn't mean other blocks fail to read.

I know no one would use Btrfs raid5/6 in production, but that's because no one wanted to enough to get the required development done.

If it was just file blocks that got corrupted, there would be just very minor damage in total and a scrub that logs all damaged blocks typically proceeds at sequential read speeds from what I've observed in the past.

chungy · on Feb 9, 2022

ZFS already stores three logical copies of all metadata, regardless of copies parameter or vdev redundancy. The failures LTT experienced went way beyond what ZFS, or btrfs, could repair.

They made severe errors in judgment when setting up the system and failing to monitor and administer it. No system will ever protect against this. Not ZFS. Not btrfs.

derkades · on Feb 8, 2022

That's like saying "having 4 parity disks is dangerous, you'll lose all your data if 5 disks die!"

They were supposed to run scrubs and replace disks when they die, not let it sit for 5 years.

ikiris · on Feb 8, 2022

When you do things wrong, things go wrong. Film at 11.

mnd999 · on Feb 8, 2022

Since the Linux crowd moved in zfs development seems to have gone from stability to feature feature feature. I’m starting to get a bit concerned that this isn’t going to end well. I really hope I’m wrong.

p_l · on Feb 8, 2022

Nothing really changed in development, ZFSonLinux team was actually one of the more conservative in terms of data safety, what changed is that a bunch of things that were really long in the works coincided in reaching maturity.

If you want "feature chasing", FreeBSD ZFS TRIM is the ur example. I've read that code end to end... and I'll leave it at this.

nightfly · on Feb 8, 2022

These feature feature features are ones people have been asking for _years_.

dsr_ · on Feb 8, 2022

This feature is being developed in FreeBSD, and will become part of the general ZFSonLinux set.

ComputerGuru · on Feb 9, 2022

The ZoL/OpenZFS-implemented features for encryption shipped with a critical flaw that could permanently corrupt the receiving dataset on subsequent raw send. Same bug also with raw send two a pool with a different ashift, also just recently fixed.

aniou · on Feb 8, 2022

I'm sorry to say that but this article is not entirely true - an illustration "how does traditional raid 4/5/6 do it?" shows ONLY RAID 4. There is a big difference between RAID 4 and RAID 5/6 and former was abandoned a years (decades?) ago in favor of RAID 5 and - later - 6.

Of course, it gives "better publicity" for RAID-Z, but it is rather an marketing trick not engineering.

See https://en.wikipedia.org/wiki/Standard_RAID_levels

mrighele · on Feb 8, 2022

Note that the article talks about the way the array is expanded, not how the specific level works.

In other words, what they are saying is that the traditional way to expand an array is essentially to rewrite the whole array from scratch, so if the old array has three stripes, with blocks [1,2,3,p1] [4,5,6,p2] and [7,8,9,p3] (with p1 and p2 being the parity blocks), the new array will have stripes [1,2,3,4,p1'], [5,6,7,8,p2'] and [9,x,x,x,p3'], i.e. not only has to move the blocks around, but also recompute essentially all the parity blocks.

IF I understand the ZFS approach correctly, the existing blocks are not restructured but only reshuffled, so the new layout will be logically still [1,2,3,p1] [4,5,6,p2] and [7,8,9,p3] but distributed on five disks so [1,2,3,p1,4] [5,6,p2,7,8], [9,p3,x,x,x]

It seems that this means less work while expanding, but some space lost unless one manually copies old data in a new place.

IF I got it right, I am not sure who is the intended audience for this feature: enterprise users will probably not use it, and power users would probably benefit from getting all the space they could get from the extra disk

Elhana · on Feb 10, 2022

Power users would like to get all the space, but when the choice is either you buy just one hdd and get some space or 4+ hdds to replace old array with a completely new one and then left over with unused old one - most would pick first option.