This is great, there has been a demand for this since forever. Enterprise-y people generally didn't care much but the homelab/SMB users end up dealing with it a lot more than might be naively imagined.
Always reminds me of when NetApp used to do their arrays in RAID-4 because it made expansion super-fast, just add a new zeroed disk and only had to update the new disk blocks + parity drive on writes. Used to blow our Netware admin's mind as almost nobody else ever used RAID-4 -- I had it as an interview question along with "what is virtual memory" because you'd get interesting answers :)
> the homelab/SMB users end up dealing with it a lot more than might be naively imagined
As a home server administrator, I've wanted this feature for so long. Before this, in order to expand an existing array I'd have to fail every single drive and replace them with new higher capacity ones.
The only question I have is whether it supports expansion with drives of different capacities.
> The only question I have is whether it supports expansion with drives of different capacities.
I’m also very interested in this - my main reason for sticking with btrfs is that I can use a variety of odd-sized drives, and expand it by adding a new oddly-sized drive...
Exactly. That way we can create a storage server out of any random drives we have lying around as well as slowly expand its capacity without having to rebuild the whole thing. I think this capability is vital for people who can't immediately spend thousands of dollars on equipment.
Is fragmentation a serious issue for you? COW filesystems in general aren't great for use-cases that rewrite blocks frequently (databases usually the poster child) but I've never had much problem with it for more general cases even when the free space fragmentation gets north of 70%. Then again most of the storage I care about performance on is NVME.
I can see a conceptual sort of super-scrub that balances a zpool and addresses all that that's not on anyone's radar AFAIK.
Hmmmm, in a home-use setting I currently have a scenario where
a) random seek/many-small-files performance has been really had since day 1, I initially suspected old/low-end hardware (i3, 1600MHz RAM) but given that I can do just south of 200MB/s (two-way mirror) I'm kinda staring at ZFS expectantly here
b) I've admittedly managed to net myself a fair few pathological way-too-many-files situations from projects and whatnot that I really do need to get to cleaning up
Fairly early on I noticed apt performance degraded pretty badly, and long before (b) became a substantial concern it got to the point where installing just about anything would take about 60 seconds to do the "Reading database ..." step.
I've been idly curious about tweaking different settings to try and improve performance, but it's mostly been an idle curiosity because I don't have a straightforward way to back out of "oh great now what" edge cases.
This has probably been going on for just around a year or two, and with absolutely no context I'd be confident saying write volume isn't a shadow of what you're doing :) so perhaps that particular tunable is... maybe not relevant? Or maybe it is. I'm curious.
That tunable is probably not an issue for you; the symptom there is painfully slow writes (even for bulk sequential writes).
A sibling mentioned making sure ashift is 12. I'll second that. In addition, make sure your ZFS partition is aligned; if you gave ZFS the whole disk it probably is. If you did not (e.g. because you needed an EFI boot partition) it might not be.
Lastly, for any given workload, ZFS seems to have roughly logistic performance curve with respect to the amount of RAM it has to work with. The ARC does a pretty good job of keeping important data in RAM to minimize seeks when there is "enough" RAM, but it does a progressively worse job as it gets RAM constrained. On a development machine where I'm dealing with multiple SVN and git checkouts on spinning metal the performance difference between 8GB and 12GB of RAM for ZFS is night and day. Good SSDs make this a lot less important because the penalty for a small number of read-misses is approximately zero compared to rotating drives.
ashift is definitely 12 (as I noted in my sibling reply)... but TIL about alignment (thanks). Parted says my ZFS partition starts at 8590983168 bytes (after an 8GB swap partition), which divides down by 4K cleanly. Is that what you mean?
Hmm, the RAM usage on this machine is generally low-ish, but with 8GB I suspect the smallest perturbations can make a big difference (even though I use it headlessly). I'll definitely keep more RAM in mind going forward, and yeah, SSD/NVME storage makes these kinds of considerations moot in high-performance contexts.
I honestly left the alignment vague because I couldn't remember what it was; I would believe it's 128k as that's the largest value ZFS ever uses, but I would also believe that 4k is fine.
Try running "zdb |grep ashift" and confirm that it's 12 (or even higher for SSDs). The default used to be 9 which killed IOPS on non-ancient HDDs that have 2^12 byte sectors and have to read-modify-write anything smaller than a sector.
“i3, 1600MHz RAM” sounds like a laptop. Are you doing anything funky like using USB HDD enclosures?
Also try comparing the number, size, and latency of IO operations submitted to ZFS vs the same stats for IO submitted to the disks with https://github.com/iovisor/bcc
Once you figure out what layer (application? VFS/cache? file system? IO elevator? HBA? disk firmware?) the performance drop is happening on, it should be trivial to fix.
One of the few things in the Debian (yes...) setup guide was emphasizing ashift, I remember explicitly setting it to 12.
It's not a laptop, it's a low-end motherboard currently serving as my primary workhorse :) (until I find the money to fix the issues preventing me from working... any day now... :'D). *Checks* It's an ASUS P8H61-M. And no, the disks are directly attached.
TIL ZFS can submit different IO sizes than what reach the disks. I've just been dumbly staring at iotop and thinking that was the last word on the situation. Now to figure out how to get that info from ZFS (and figure out which bcc script to use). Thanks.
Thanks for the layer consideration. The application layer (an ncdu scan I'm currently doing is has been reading 60 files/second for days) and VFS/cache layer (if I do two apt operations in relatively quick succession (seconds apart) with nothing else doing I/O, the second one completes the read step instantly) seem to be the effect/symptom, with file system (all ZFS, but obviously badly tuned) and IO elevator (oooooh that's what that is TIL, I might play with this! :D) seemingly the most interesting, and HBA (onboard SATA3 port *hides*) and disk firmware (I've never upgraded a BIOS in case I irreparably break something lol) beyond the horizon somewhat.
Always reminds me of when NetApp used to do their arrays in RAID-4 because it made expansion super-fast, just add a new zeroed disk and only had to update the new disk blocks + parity drive on writes. Used to blow our Netware admin's mind as almost nobody else ever used RAID-4 -- I had it as an interview question along with "what is virtual memory" because you'd get interesting answers :)