This is great, there has been a demand for this since forever. Enterprise-y peop...

matheusmoreira · on Feb 9, 2022

> the homelab/SMB users end up dealing with it a lot more than might be naively imagined

As a home server administrator, I've wanted this feature for so long. Before this, in order to expand an existing array I'd have to fail every single drive and replace them with new higher capacity ones.

The only question I have is whether it supports expansion with drives of different capacities.

Shish2k · on Feb 9, 2022

> The only question I have is whether it supports expansion with drives of different capacities.

I’m also very interested in this - my main reason for sticking with btrfs is that I can use a variety of odd-sized drives, and expand it by adding a new oddly-sized drive...

matheusmoreira · on Feb 9, 2022

Exactly. That way we can create a storage server out of any random drives we have lying around as well as slowly expand its capacity without having to rebuild the whole thing. I think this capability is vital for people who can't immediately spend thousands of dollars on equipment.

chasil · on Feb 9, 2022

Can we have "rebalance?"

After that, can we have defragmentation?

Let's make the best of both worlds.

reincarnate0x14 · on Feb 9, 2022

Is fragmentation a serious issue for you? COW filesystems in general aren't great for use-cases that rewrite blocks frequently (databases usually the poster child) but I've never had much problem with it for more general cases even when the free space fragmentation gets north of 70%. Then again most of the storage I care about performance on is NVME.

I can see a conceptual sort of super-scrub that balances a zpool and addresses all that that's not on anyone's radar AFAIK.

aidenn0 · on Feb 9, 2022

I maintain an archive of nightly builds that the metaslab table is so large that it doesn't stay in ARC without me forcing metaslab_debug on.

Workload is roughly

Every night write a lot of files

6/7 nights delete a lot of files (i.e. keep weekly snapshots).

It took about 3-5 years of doing this to get to the point where write-performance dropped off the clif and enabling metaslab_debug brought it back.

exikyut · on Feb 9, 2022

Hmmmm, in a home-use setting I currently have a scenario where

a) random seek/many-small-files performance has been really had since day 1, I initially suspected old/low-end hardware (i3, 1600MHz RAM) but given that I can do just south of 200MB/s (two-way mirror) I'm kinda staring at ZFS expectantly here

b) I've admittedly managed to net myself a fair few pathological way-too-many-files situations from projects and whatnot that I really do need to get to cleaning up

Fairly early on I noticed apt performance degraded pretty badly, and long before (b) became a substantial concern it got to the point where installing just about anything would take about 60 seconds to do the "Reading database ..." step.

I've been idly curious about tweaking different settings to try and improve performance, but it's mostly been an idle curiosity because I don't have a straightforward way to back out of "oh great now what" edge cases.

This has probably been going on for just around a year or two, and with absolutely no context I'd be confident saying write volume isn't a shadow of what you're doing :) so perhaps that particular tunable is... maybe not relevant? Or maybe it is. I'm curious.

aidenn0 · on Feb 9, 2022

That tunable is probably not an issue for you; the symptom there is painfully slow writes (even for bulk sequential writes).

A sibling mentioned making sure ashift is 12. I'll second that. In addition, make sure your ZFS partition is aligned; if you gave ZFS the whole disk it probably is. If you did not (e.g. because you needed an EFI boot partition) it might not be.

Lastly, for any given workload, ZFS seems to have roughly logistic performance curve with respect to the amount of RAM it has to work with. The ARC does a pretty good job of keeping important data in RAM to minimize seeks when there is "enough" RAM, but it does a progressively worse job as it gets RAM constrained. On a development machine where I'm dealing with multiple SVN and git checkouts on spinning metal the performance difference between 8GB and 12GB of RAM for ZFS is night and day. Good SSDs make this a lot less important because the penalty for a small number of read-misses is approximately zero compared to rotating drives.

exikyut · on Feb 10, 2022

Ah, I see.

ashift is definitely 12 (as I noted in my sibling reply)... but TIL about alignment (thanks). Parted says my ZFS partition starts at 8590983168 bytes (after an 8GB swap partition), which divides down by 4K cleanly. Is that what you mean?

Hmm, the RAM usage on this machine is generally low-ish, but with 8GB I suspect the smallest perturbations can make a big difference (even though I use it headlessly). I'll definitely keep more RAM in mind going forward, and yeah, SSD/NVME storage makes these kinds of considerations moot in high-performance contexts.

Thanks for the info!

aidenn0 · on Feb 11, 2022

I honestly left the alignment vague because I couldn't remember what it was; I would believe it's 128k as that's the largest value ZFS ever uses, but I would also believe that 4k is fine.

NavinF · on Feb 9, 2022

Try running "zdb |grep ashift" and confirm that it's 12 (or even higher for SSDs). The default used to be 9 which killed IOPS on non-ancient HDDs that have 2^12 byte sectors and have to read-modify-write anything smaller than a sector.

“i3, 1600MHz RAM” sounds like a laptop. Are you doing anything funky like using USB HDD enclosures?

Also try comparing the number, size, and latency of IO operations submitted to ZFS vs the same stats for IO submitted to the disks with https://github.com/iovisor/bcc

Once you figure out what layer (application? VFS/cache? file system? IO elevator? HBA? disk firmware?) the performance drop is happening on, it should be trivial to fix.

exikyut · on Feb 10, 2022

One of the few things in the Debian (yes...) setup guide was emphasizing ashift, I remember explicitly setting it to 12.

It's not a laptop, it's a low-end motherboard currently serving as my primary workhorse :) (until I find the money to fix the issues preventing me from working... any day now... :'D). *Checks* It's an ASUS P8H61-M. And no, the disks are directly attached.

TIL ZFS can submit different IO sizes than what reach the disks. I've just been dumbly staring at iotop and thinking that was the last word on the situation. Now to figure out how to get that info from ZFS (and figure out which bcc script to use). Thanks.

Thanks for the layer consideration. The application layer (an ncdu scan I'm currently doing is has been reading 60 files/second for days) and VFS/cache layer (if I do two apt operations in relatively quick succession (seconds apart) with nothing else doing I/O, the second one completes the read step instantly) seem to be the effect/symptom, with file system (all ZFS, but obviously badly tuned) and IO elevator (oooooh that's what that is TIL, I might play with this! :D) seemingly the most interesting, and HBA (onboard SATA3 port *hides*) and disk firmware (I've never upgraded a BIOS in case I irreparably break something lol) beyond the horizon somewhat.

Thanks for the info!

chasil · on Feb 9, 2022

Fragmentation is a serious issue for everybody.

With XFS, I can xfs_fsr to undo it. With Btrfs, I can fi defrag to undo it.

With ZFS, reformat the disk. I'm not comfortable with that.

https://tim.cexx.org/?p=1236

https://www.usenix.org/system/files/login/articles/login_sum...

arjvik · on Feb 9, 2022

Doesn't RAID5 have the exact same property as RAID4? With a healthy array in both cases, the XOR of all disks should be zero.

RAID5 has the benefit of not making the parity drive a bottleneck that will fail sooner.