Let's say I have a 4+1 disk layout, and I do 100 data writes followed by an fsyn...

secabeen · on Feb 9, 2022

This summarizes it well for reads: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSRaidzRea...

For writes, the system is copy-on-write, so for each write, there is a full read of the containing record, often across all disks, so the updated record can be written out in the new location on the disks without disrupting the old record, which may still be referenced by snapshots, etc.

Adding additional disks gets you more throughput (you can split a single large write across more disks, but not more OPS.)

Dylan16807 · on Feb 9, 2022

Okay, the follow-up post really explains it. It's yet another consequence of ZFS's inability to move blocks. So you more or less can't have multiple writes get lumped together because that would break garbage collection.

There are ways to work around elements of this and get much better performance, but not wanting more complexity makes enough sense.

Definitely a flaw in the ZFS data model though, rather than something inherent to the use of multiple disks.

bestham · on Feb 9, 2022

It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ. The real world I a cruel place and you get nothing for free, in order to have distributed parity and consistency you will have to suffer the lowest common denominator IOPS penalty.

That is just maths.

If you want to use ZFS and have more IOPS you will have to have more VDEVs (either more RAIDZs or several mirrors). Your storage efficiency will be slightly reduced (RAIDZ) or tank to 50% (mirrors) with less redundancy.

But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.

Dylan16807 · on Feb 9, 2022

> It is not a flaw in the ZFS data model, it is a carefully considered tradeoff between consistency and IOPS for RAIDZ.

You could have both! That's not the tradeoff here. The problem is that if you wrote 4 independent pieces of data at the same time, sharing parity, then if you deleted some of them you wouldn't be able to recover any disk space.

> That is just maths.

I don't think so. What's your calculation here?

The math says you need to do N writes at a time. It doesn't say you need to turn 1 write into N writes.

If my block size is 128KB, then splitting that into 4+1 32KB pieces will mean I have the same IOPS as a single disk.

If my block size is 128KB, then doing 4 writes at once, 4+1 128KB pieces, means I could have much more IOPS than a single disk.

And nothing about that causes a write hole. Handle the metadata the same way.

ZFS can't do that, but a filesystem could safely do it.

> But to call it a flaw in the data model goes a long way to show that you do not appreciate (or understand) the tradeoffs in the design.

The flaw I'm talking about is that Block Pointer Rewrite™ never got added. Which prevents a lot of use cases. It has nothing to do with preserving consistency (except that more code means more bugs).

kaba0 · on Feb 9, 2022

I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice? Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.

Dylan16807 · on Feb 9, 2022

> I am a beginner when it comes to ZFS, but isn’t “no moving data” as an axiom a good choice?

It's a reasonable choice, but only because it makes certain kinds of bugs harder, not because it's safer when the code is correct.

> Any error that may happen during would destroy that data - while without moving it will likely be recoverable even with a dead harddrive.

That's not true. You make the new copy, then update every reference to the new copy, and only then remove the old one. If there's an error halfway through then there's two copies of the data.

barrkel · on Feb 9, 2022

100 data writes, assuming they're to different files and/or large enough to fill a slice, means 500 disk writes (5 disk writes per slice).

The 500 disk writes are parallelized over 5 disks so they only take the time taken for 100 writes (500 / 5).

So the IOPs is the same as a single disk.

HOWEVER, bandwidth is 4x. The above assumes that seek time dominates. If your writes are multiple slices in length, they will be written 4x faster because the amount of data per disk is divided across the disks. If you're reading and writing large contiguous files, then you do get a big I/O boost from raidz.

Dylan16807 · on Feb 9, 2022

> 100 data writes, assuming they're to different files and/or large enough to fill a slice, means 500 disk writes (5 disk writes per slice).

And if I wrote 100 slices to a single drive, is that 100 writes or 400 writes?

If it's 100, then it sounds like the slices are sized wrong: they should get bigger when I add more disks.

If it's 400 writes, then 100 writes per disk should be much faster.

barrkel · on Feb 10, 2022

The slices depend on record size and sector size. For 4 data disks and default 128k record you have 32k per disk per record. With more disks, the proportion of a slice per disk decreases, not increases, and it's rounded up to the sector size, so there's usually some loss of space on larger parity schemas.

maccam94 · on Feb 9, 2022

Each write is striped across all of the disks. For a stripe to be completely written, all disks must finish writing the data. Thus your max write IOPS equals that of the slowest drive in the vdev.

wtallis · on Feb 9, 2022

> Each write is striped across all of the disks.

It's reasonable to expect that issuing a batch of 100 writes at the application layer followed by a fsync would not always require doing 100 writes to each underlying block device. The OS/FS should be able to combine writes when the IO pattern allows for it, and should be doing some buffering prior to the fsync in hopes of assembling full-stripe writes out of smaller application-layer writes.

Dylan16807 · on Feb 9, 2022

> Each write is striped across all of the disks.

It sure shouldn't be. This isn't dumb RAID.