This is great, there has been a demand for this since forever. Enterprise-y people generally didn't care much but the homelab/SMB users end up dealing with it a lot more than might be naively imagined.
Always reminds me of when NetApp used to do their arrays in RAID-4 because it made expansion super-fast, just add a new zeroed disk and only had to update the new disk blocks + parity drive on writes. Used to blow our Netware admin's mind as almost nobody else ever used RAID-4 -- I had it as an interview question along with "what is virtual memory" because you'd get interesting answers :)
> the homelab/SMB users end up dealing with it a lot more than might be naively imagined
As a home server administrator, I've wanted this feature for so long. Before this, in order to expand an existing array I'd have to fail every single drive and replace them with new higher capacity ones.
The only question I have is whether it supports expansion with drives of different capacities.
> The only question I have is whether it supports expansion with drives of different capacities.
I’m also very interested in this - my main reason for sticking with btrfs is that I can use a variety of odd-sized drives, and expand it by adding a new oddly-sized drive...
Exactly. That way we can create a storage server out of any random drives we have lying around as well as slowly expand its capacity without having to rebuild the whole thing. I think this capability is vital for people who can't immediately spend thousands of dollars on equipment.
Is fragmentation a serious issue for you? COW filesystems in general aren't great for use-cases that rewrite blocks frequently (databases usually the poster child) but I've never had much problem with it for more general cases even when the free space fragmentation gets north of 70%. Then again most of the storage I care about performance on is NVME.
I can see a conceptual sort of super-scrub that balances a zpool and addresses all that that's not on anyone's radar AFAIK.
Hmmmm, in a home-use setting I currently have a scenario where
a) random seek/many-small-files performance has been really had since day 1, I initially suspected old/low-end hardware (i3, 1600MHz RAM) but given that I can do just south of 200MB/s (two-way mirror) I'm kinda staring at ZFS expectantly here
b) I've admittedly managed to net myself a fair few pathological way-too-many-files situations from projects and whatnot that I really do need to get to cleaning up
Fairly early on I noticed apt performance degraded pretty badly, and long before (b) became a substantial concern it got to the point where installing just about anything would take about 60 seconds to do the "Reading database ..." step.
I've been idly curious about tweaking different settings to try and improve performance, but it's mostly been an idle curiosity because I don't have a straightforward way to back out of "oh great now what" edge cases.
This has probably been going on for just around a year or two, and with absolutely no context I'd be confident saying write volume isn't a shadow of what you're doing :) so perhaps that particular tunable is... maybe not relevant? Or maybe it is. I'm curious.
That tunable is probably not an issue for you; the symptom there is painfully slow writes (even for bulk sequential writes).
A sibling mentioned making sure ashift is 12. I'll second that. In addition, make sure your ZFS partition is aligned; if you gave ZFS the whole disk it probably is. If you did not (e.g. because you needed an EFI boot partition) it might not be.
Lastly, for any given workload, ZFS seems to have roughly logistic performance curve with respect to the amount of RAM it has to work with. The ARC does a pretty good job of keeping important data in RAM to minimize seeks when there is "enough" RAM, but it does a progressively worse job as it gets RAM constrained. On a development machine where I'm dealing with multiple SVN and git checkouts on spinning metal the performance difference between 8GB and 12GB of RAM for ZFS is night and day. Good SSDs make this a lot less important because the penalty for a small number of read-misses is approximately zero compared to rotating drives.
ashift is definitely 12 (as I noted in my sibling reply)... but TIL about alignment (thanks). Parted says my ZFS partition starts at 8590983168 bytes (after an 8GB swap partition), which divides down by 4K cleanly. Is that what you mean?
Hmm, the RAM usage on this machine is generally low-ish, but with 8GB I suspect the smallest perturbations can make a big difference (even though I use it headlessly). I'll definitely keep more RAM in mind going forward, and yeah, SSD/NVME storage makes these kinds of considerations moot in high-performance contexts.
I honestly left the alignment vague because I couldn't remember what it was; I would believe it's 128k as that's the largest value ZFS ever uses, but I would also believe that 4k is fine.
Try running "zdb |grep ashift" and confirm that it's 12 (or even higher for SSDs). The default used to be 9 which killed IOPS on non-ancient HDDs that have 2^12 byte sectors and have to read-modify-write anything smaller than a sector.
“i3, 1600MHz RAM” sounds like a laptop. Are you doing anything funky like using USB HDD enclosures?
Also try comparing the number, size, and latency of IO operations submitted to ZFS vs the same stats for IO submitted to the disks with https://github.com/iovisor/bcc
Once you figure out what layer (application? VFS/cache? file system? IO elevator? HBA? disk firmware?) the performance drop is happening on, it should be trivial to fix.
One of the few things in the Debian (yes...) setup guide was emphasizing ashift, I remember explicitly setting it to 12.
It's not a laptop, it's a low-end motherboard currently serving as my primary workhorse :) (until I find the money to fix the issues preventing me from working... any day now... :'D). *Checks* It's an ASUS P8H61-M. And no, the disks are directly attached.
TIL ZFS can submit different IO sizes than what reach the disks. I've just been dumbly staring at iotop and thinking that was the last word on the situation. Now to figure out how to get that info from ZFS (and figure out which bcc script to use). Thanks.
Thanks for the layer consideration. The application layer (an ncdu scan I'm currently doing is has been reading 60 files/second for days) and VFS/cache layer (if I do two apt operations in relatively quick succession (seconds apart) with nothing else doing I/O, the second one completes the read step instantly) seem to be the effect/symptom, with file system (all ZFS, but obviously badly tuned) and IO elevator (oooooh that's what that is TIL, I might play with this! :D) seemingly the most interesting, and HBA (onboard SATA3 port *hides*) and disk firmware (I've never upgraded a BIOS in case I irreparably break something lol) beyond the horizon somewhat.
I don't understand why the title says "Goes Live"?
The code is here[0]. It still needs more testing and cleanup, and will then eventually be merged. After that it'll take some time to make to all the distributions ( freenas, freebsd, etc. )
There is a very interesting podcast with Matt Ahrens, co-founder of the ZFS project, that covers RAID-Z expansion as well as the history of ZFS etc.
https://changelog.com/podcast/475
> But the interesting thing about this project is how did it come to be. So a long-requested feature - how did it get funded? So actually, it’s funded by the FreeBSD Foundation.
For those that like videos more than text, there is a youtube video from last year [1] that explain the feature (unless it's changed since ,but it seems not to be the case).
One downside that I see of this approach, if I understand it correctly, is that the data already present on disk will not take advantage of the extra disk per slice. For example, if I have a raidz of 4 disks (so 25% of space "wasted"), and add another disk, new data will be distributed on 5 disks (so 20% of space "wasted") but the old data will keep using stripes of 4 blocks, they will just be reshuffled between the disks. Do I understand it correctly ?
That is correct. There are planned solutions for rewriting all your data, though they won't play nicely with one of ZFS's other most important features - Snapshots. The current plan is basically to have a nice userspace utility that will rewrite all your data in-place, but that will cause you to rereplicate everything over your snapshots.
Thanks for your answer. The point about the snapshots is a very good one, for some reasons I didn't think about it.
Rewriting data in place can be tricky, if you have old enough snapshots the newly added space may not be enough. I hope they will find a good enough solution.
This is great, but an important and little known caveat is that raidz is limited to the iops of one disk. So growing a raidz will at some point have lots of throughput but suffer in small and random reads and writes. At that point, it will be better to grow the pool with additional, separate raidz.
To clarify, the sequential read bandwidth of a RAIDZN with M total disks will be M times the read bandwidth of a single disk. The sequential write bandwidth will be (M-N) times the write bandwidth of a single disk. These are optimal values. https://calomel.org/zfs_raid_speed_capacity.html
It's a constraint on iops in particular, not bandwidth, because writing a record to disk involves writing something to every disk in the pool. So if you have a disk that's taking a long time to write, the other disks need to pause and let it catch up.
To be fair, I don't think this is little-known. When a write must be acknowledged by multiple devices (as is the case with RAID1 or RAIDZ), a write requires more IOPS.
For writes, the system is copy-on-write, so for each write, there is a full read of the containing record, often across all disks, so the updated record can be written out in the new location on the disks without disrupting the old record, which may still be referenced by snapshots, etc.
Adding additional disks gets you more throughput (you can split a single large write across more disks, but not more OPS.)
Okay, the follow-up post really explains it. It's yet another consequence of ZFS's inability to move blocks. So you more or less can't have multiple writes get lumped together because that would break garbage collection.
There are ways to work around elements of this and get much better performance, but not wanting more complexity makes enough sense.
Definitely a flaw in the ZFS data model though, rather than something inherent to the use of multiple disks.
It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ. The real world I a cruel place and you get nothing for free, in order to have distributed parity and consistency you will have to suffer the lowest common denominator IOPS penalty.
That is just maths.
If you want to use ZFS and have more IOPS you will have to have more VDEVs (either more RAIDZs or several mirrors). Your storage efficiency will be slightly reduced (RAIDZ) or tank to 50% (mirrors) with less redundancy.
But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.
> It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ.
You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.
> That is just maths.
I don't think so. What's your calculation here?
The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.
If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.
If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.
And nothing about that causes a write hole. Handle the metadata the same way.
ZFS can't do that, but a filesystem could safely do it.
> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.
The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).
I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice? Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.
> I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice?
It's a reasonable choice, but only because it makes certain kinds of bugs harder, not because it's safer when the code is correct.
> Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.
That's not true. You make the new copy, then update every reference to the new copy, and only then remove the old one. If there's an error halfway through then there's two copies of the data.
100 data writes, assuming they're to different files and/or large enough to fill a slice, means 500 disk writes (5 disk writes per slice).
The 500 disk writes are parallelized over 5 disks so they only take the time taken for 100 writes (500 / 5).
So the IOPs is the same as a single disk.
HOWEVER, bandwidth is 4x. The above assumes that seek time dominates. If your writes are multiple slices in length, they will be written 4x faster because the amount of data per disk is divided across the disks. If you're reading and writing large contiguous files, then you do get a big I/O boost from raidz.
The slices depend on record size and sector size. For 4 data disks and default 128k record you have 32k per disk per record. With more disks, the proportion of a slice per disk decreases, not increases, and it's rounded up to the sector size, so there's usually some loss of space on larger parity schemas.
Each write is striped across all of the disks. For a stripe to be completely written, all disks must finish writing the data. Thus your max write IOPS equals that of the slowest drive in the vdev.
It's reasonable to expect that issuing a batch of 100 writes at the application layer followed by a fsync would not always require doing 100 writes to each underlying block device. The OS/FS should be able to combine writes when the IO pattern allows for it, and should be doing some buffering prior to the fsync in hopes of assembling full-stripe writes out of smaller application-layer writes.
vdev is logical. Purpose: Disk grouping and redundancy. Composition: One or more disks.
zpool is logical. Purpose: Higher-level management of one or more vdevs. Composition: It acts like a JBOD.
---
zpools can be thought of as "stripes of vdevs". This, in the narrow sense that the failure of any vdev in a zpool is a permanent loss of the entire zpool. All your redundancy in the ZFS ecosystem is via mirrored or RAID'ed vdevs.
---
The setup I have heard of that balances performance, redundancy and space is to do what you say: Have a zpool of multiple mirror-type vdevs.
You can also stripe at the vdev level, which I would assume has higher performance than having multiple single-disk vdevs in a pool - I'm unaware of the differences at a low level.
That's a common configuration for larger arrays where you probably don't want to make individual RAIDZ arrays too wide on their own. It's functionally similar to RAID 50 or 60. However when the budget is tight you might be incentivized to compromise on performance or reliability, so this new expansion feature really does help in hobbyist or shoestring budget situations where you just want one huge array to maximize usable space while still being able to tolerate some failures. Typically you would buy disks as you go along to try to stretch the budget further as utilization increases.
I certainly wanted this. I even heckled Bill Moore about it. Having gone through the expansion the old way (replace each drive one at a time with a larger one), this looks a lot simpler. Unfortunately it appears to not work with simple mirror and stripes (~ RAID10) so it will make no difference for me. (Drives are cheap but performance is not -> RAID10).
I love ZFS but this is something that just works in btrfs; mirror just means all blocks live in two physical locations. You certainly can do that even with an odd number of drives. However ZFS is more rigid and doesn’t allow flowing blocks like this, nor dynamic defragmentation.
What are you on about? The submission is about raidz.
Adding drives to a mirror has worked in zfs since prehistoric times. “zpool attach test_pool sda sdc” will mirror sda to sdc. If sda was already mirrored with sdb, you now have a triple-mirror with sda, sdb, and sdc.
I am aware of course. I was explaining why adding a single drive does make sense for RAID10 just fine and, as an example, btrfs has it (existence proof), but ZFS doesn't support it.
Your example doesn't expand the storage of the vdev, which is what this entire discussion is about, it merely adds a mirror.
You can definitely expand and shrink a zpool based on mirrors. You can remove top-level vdevs, provided the pool is fine, and expand existing pools by adding new mirrors.
You can now with this feature. If you're using RAID1 vdevs, you could always remove and add mirrors. Now you can at least add a device to a RAIDZn vdev.
> If you're using RAID1 vdevs, you could always remove and add mirrors.
This is still answering the wrong question, unless by "add mirrors" you mean "add another drive to a mirrored vdev and get more usable capacity", which I don't think is how ZFS works and would be strange to summarize as "add mirrors".
I'm trying to discuss an idea you don't seem to have any terminology for. Can you please try to meet me halfway and at least respond in a manner that makes it clear you understand there's a real distinction to be made?
You can add and removes devs from a vdev in ZFS now, with this feature, giving you the ability to resize a vdev. But only upwards. Downwards isn't possible yet.
In what situation would increasing the size of a vdev be better than increasing the size of a pool? The submission increases the size of a raidz vdev and that's incredibly useful, but when would that make sense for a mirror?
> In what situation would increasing the size of a vdev be better than increasing the size of a pool?
When you want to add a single drive to expand your available storage capacity, without reducing the degree of redundancy you're already using. That use case should have been obvious given how much discussion it's already received in this thread, so I think you have some kind of blind spot about anything small-scale. This is why it's good to actually understand the competition, even after deciding which tradeoffs are right for you.
> If sda was already mirrored with sdb, you now have a triple-mirror with sda, sdb, and sdc.
You probably already know this, but btrfs offers a different option. Its "RAID1" mode would not result in the three drives storing your data in triplicate but rather still stores data in duplicate but with a usable capacity that's 1.5x that of a single driver's capacity. To get the ZFS behavior you describe, btrfs offers a "RAID1c3" mode, but that's a much less interesting feature.
It's a neat feature for SOHO-type users, and takes advantage of the fact that the RAID layer knows about the FS. The first implementation that I know of is Drobo's filesystem.
Asymmetric RAID is probably the only config that I'd like to see in ZFS - being able to have (say) an 8TB drive and two 4TB drives, and have 8TB of mirrored space is quite nice when you're just buying commodity kit and don't want to have to retire/match drives all the time when growing mirrors.
Yep, SOHO use cases are where this kind of capability is really useful. ZFS is great when you buy drives by the dozen, but even with this feature it's still inconveniently inflexible if you only have the budget to be upgrading or expanding one drive at a time.
Ah man, this is perfect timing. I'm literally in the midst of building a new NAS and intending to use TrueNAS + RAIDZ2. Purposely over-speccing potential drive bay capacity so I can expand later on. I was worried how hard it will be to expand later, but it looks like it may end up being more possible than I thought! Sweeet :)
Lots of people have wanted this for ages. I managed to cope with spindle replace and resize into new space (larger spindles) but being able to add more discrete devices and get more parity coverage and more space (I may be incorrectly assuming you get better redundancy as well) is great.
It is an experimental feature on FreeBSD 14-CURRENT. It will be merged into OpenZFS eventually (and maybe backported to FreeBSD 13-STABLE and whatever new point releases happen).
Nah, ZFS is pretty comfy FS, it has lots nice features, it is reasonably fast, and it is stable. And as far as I know, it has been used for fairly long time.
They "lost everything" because they failed to set up even the most basic and critical maintenance functions built into ZFS. I don't think the responsibility in this incident could fall any more squarely on the shoulders of the people who set it up.
I think there's a case to be made that a NAS oriented OS/distro should probably default to having a scheduled scrub, not wait for you to set one up. I don't follow LTT, but a quick look around says he was using TruNAS for at least one of the builds?
You do need a notification feature as well, which is tricky because a distro can't assume you'll setup mail properly, but maybe something like make samba go read only when the zpool is degraded could work.
Not only that. A vdev with 15 disks each around 16 (or 18?) TiB with RAID-Z2 just asking for troubles (https://www.zdnet.com/article/why-raid-6-stops-working-in-20...). Even with 1% disk failures annually (based on Backblaze, for better disks), we are looking at least two disks fail at the same time with 15 disks around 1% chance. With 10 vdevs, that is 10% chance of any of these have failures.
They should really move to RAID-Z3 or 8 disks group. With RAID-Z3, we are looking at at least 3 disk fail at the same time with 15 disks around 0.04% chance. With 8 disks, we are looking at 0.2% chance.
I've read this article and others like it, and what I don't get is if the URE rates are really that high, why I have I never seen an URE in the bi-weekly scrubs of my pool (184 TB of raw disk), aside from when a single disk was literally going bad?
There's many contributing factors. Power supplies degrade over time, ability to maintain the correct voltages can be impaired for worst case scenarios, like running all disks flat out. Drives (and even more so older drives) can be vibration sensitive, especially in worse case scenarios involving all drives running flat out. With modern manufacturing tolerances being so low, whatever triggered the first disk in your pool to die is likely to get a second in a fairly small window.
So sure, the failures are not equally distributed and you've done well with your 184TB pool.
Are you tracking the device errors, or only those that are visible to the OS? But multiple disk failures are not particularly uncommon. My experience in this space was two 16 disk servers that I set up 6 5 disk RAID5s (with one global spare per server). Within one month I had 11 of 32 disks die, and barely managed not to lose any user files, and this was not during the 1st month in production.
Scary. I've since moved to pairs of servers cross connected to pairs of 60 disk chassis (16 x 12 gbit connections per chassis) with ten 11-disk RAIDz3, 10 global spares, and 6x3.2TB of NVMe cache per server.
Oh there are all kinds of reasons drives can cause errors, and you have the bathtub curve. So there's lots to take into account when designing your pool.
But the article is using the spec sheet URE rate which I'd assume looks only at the drive and doesn't take into account problems with the computer around the drive or the EOL time after the drive warranty has expired, I'd assume it was the "baseline" error rate.
> Are you tracking the device errors, or only those that are visible to the OS?
If we're talking URE like the article, that's data-loss on a disk, and the OS would always figure it out, since it would cause a ZFS checksum failure on scrub.
In this case it's not my data and not my money, so my preference is 6-drive RAIDZ2 vdevs. We've only had one disk with errors (and that one was migrated from a PC where Windows never reported any errors... of course...). The oldest 2 disks (3.5 years power-on time) have single-digit reallocated sectors in SMART so those are on course to be replaced.
I'm just curious since the argument in the article doesn't add up in my eyes.
> Within one month I had 11 of 32 disks die, and barely managed not to lose any user files, and this was not during the 1st month in production
They even admit this in the video -- "No one is to blame here except us" which means "Holy shit! We really messed this up."
Yes, they should have been scrubbing their pools, but I don't think this was bit rot. Millions of data errors is not what bit rot looks like. This looks exactly like bad hardware.
ServetheHome actually reported the hard drive LTT are using had higher failure rate than others. Even from a consumer hardware perspective. And they were suppose to be enterprise drive.
It is just bad hardware, bad setup and bad everything all cramped together.
I've never seen a HDD fail like that, but I'm willing to believe it's possible. I have seen read/write errors because of a buggy implementation of ALPM.
But, agreed, whatever it was, if they were paying attention, they could have diagnosed and remediated well before they had any data loss.
Btrfs lets you keep metadata at a stronger bitrot-protection level fairly easily, and as long as that is still ok, you'll only use the data blocks that rotted, so at LTT's scale just a few seconds of footage in total.
Even if that’s true, btrfs would be entirely impractical for them because btrfs-raid6 is broken. Their petabyte server would become a 1/3 petabyte server without raidz2.
But that's more because no one dared to try and actually fix Btrfs parity raid.
Misaligned incentives causing lack of funding for something as flexible and generally resilient as Btrfs.
Because there are some write holes around iirc power failures and, more importantly, restore basically doesn't work (the code just doesn't exist in a working variant).
This isn't easy code to write, and it's Linux-Kernel-C, far from "comfy" code to write. I assume it's that kind of effect that continues to keep all competent people from fixing Btrfs's RAID 5/6 code. (If I could gather the motivation, I could probably do it, but that's far from fun to do and I'm already dire for motivation-to-code... it needs to be someone who likes that work, ideally sponsored so they don't loose money from working on that project.)
BTRFS stores duplicate copies of the metadata. Good for BTRFS. I don't understand how this would have prevented what LTT experienced.
First, as someone else said -- no one would use a BTRFS raid5/6 array in production. No one. Second, I don't think what LTT experienced was bit rot. There were far, far too many data errors for that be simple bit rot. My guess is bad cable, bad HBA which maybe doesn't support something like ASPM, etc. Third, are we even sure the problem was corrupted metadata? Fourth, ZFS keeps redundant copies of metadata too (as do many filesystems.)
That you could have dialed up the redundancy for metadata to 3x or even 4x copies being stored, and just because a block got bitrot doesn't mean other blocks fail to read.
I know no one would use Btrfs raid5/6 in production, but that's because no one wanted to enough to get the required development done.
If it was just file blocks that got corrupted, there would be just very minor damage in total and a scrub that logs all damaged blocks typically proceeds at sequential read speeds from what I've observed in the past.
ZFS already stores three logical copies of all metadata, regardless of copies parameter or vdev redundancy. The failures LTT experienced went way beyond what ZFS, or btrfs, could repair.
They made severe errors in judgment when setting up the system and failing to monitor and administer it. No system will ever protect against this. Not ZFS. Not btrfs.
Since the Linux crowd moved in zfs development seems to have gone from stability to feature feature feature. I’m starting to get a bit concerned that this isn’t going to end well. I really hope I’m wrong.
Nothing really changed in development, ZFSonLinux team was actually one of the more conservative in terms of data safety, what changed is that a bunch of things that were really long in the works coincided in reaching maturity.
If you want "feature chasing", FreeBSD ZFS TRIM is the ur example. I've read that code end to end... and I'll leave it at this.
The ZoL/OpenZFS-implemented features for encryption shipped with a critical flaw that could permanently corrupt the receiving dataset on subsequent raw send. Same bug also with raw send two a pool with a different ashift, also just recently fixed.
I'm sorry to say that but this article is not entirely true - an illustration "how does traditional raid 4/5/6 do it?" shows ONLY RAID 4. There is a big difference between RAID 4 and RAID 5/6 and former was abandoned a years (decades?) ago in favor of RAID 5 and - later - 6.
Of course, it gives "better publicity" for RAID-Z, but it is rather an marketing trick not engineering.
Note that the article talks about the way the array is expanded, not how the specific level works.
In other words, what they are saying is that the traditional way to expand an array is essentially to rewrite the whole array from scratch, so if the old array has three stripes, with blocks [1,2,3,p1] [4,5,6,p2] and [7,8,9,p3] (with p1 and p2 being the parity blocks), the new array will have stripes [1,2,3,4,p1'], [5,6,7,8,p2'] and [9,x,x,x,p3'], i.e. not only has to move the blocks around, but also recompute essentially all the parity blocks.
IF I understand the ZFS approach correctly, the existing blocks are not restructured but only reshuffled, so the new layout will be logically still
[1,2,3,p1] [4,5,6,p2] and [7,8,9,p3] but distributed on five disks so
[1,2,3,p1,4] [5,6,p2,7,8], [9,p3,x,x,x]
It seems that this means less work while expanding, but some space lost unless one manually copies old data in a new place.
IF I got it right, I am not sure who is the intended audience for this feature: enterprise users will probably not use it, and power users would probably benefit from getting all the space they could get from the extra disk
Power users would like to get all the space, but when the choice is either you buy just one hdd and get some space or 4+ hdds to replace old array with a completely new one and then left over with unused old one - most would pick first option.
Always reminds me of when NetApp used to do their arrays in RAID-4 because it made expansion super-fast, just add a new zeroed disk and only had to update the new disk blocks + parity drive on writes. Used to blow our Netware admin's mind as almost nobody else ever used RAID-4 -- I had it as an interview question along with "what is virtual memory" because you'd get interesting answers :)