But, of course, there is no defrag for ZFS and so we do it the trailer-park way - we add free space to a zpool when adding a vdev, and then we 'zfs send' datasets to ourselves on that same zpool thus allowing ZFS to lay down those bytes in an efficient and orderly fashion ... as opposed to the inefficient way they were laid down over time.
Because ZFS is not a content-addressed store (CAS), so block pointer (BP) rewrite is extremely expensive and has not been implemented. And defragmentation requires BP rewrite, or what GP says they do: zfs send, then swap the old and new datasets, but that can't be done atomically and transparently with a script, so that kinda sucks.
The fundamental problem is that ZFS is not content addressed. It's almost CAS, but not quite. In ZFS block pointers have physical addresses in the pointer. That means that physical block locations are inexorably part of the ZFS Merkle hash tree. And that means that any change to the location of any block necessitates that the pointers to it must change and so be rewritten, but now you have to find all those pointers -- even if there can be only one pointer, you still have to change it, and when you change it, you change the Merkle hash tree of the dataset.
The solution, IMO, is to split all znodes and any interior nodes that have block pointers into two halves. One half should have only logical block pointers free of any physical location pointers, thus the pointers in this half should be some metadata + hashes of the pointed-to blocks. The second half should be a "cache" of physical locations corresponding to the logical block pointers in the first half, plus a hash of just those physical locations. The key is that the Merkle hash tree should not bind physical locations so that changing those locations does not alter the Merkle hash tree. That leaves only the task of updating those second block halves that carry physical locations. So when traversing the tree the filesystem could optimistically read pointed-to blocks from the cached locations, and if the hash of the block read does not match the logical pointer, then go looking in a log of recent block moves. This way a BP rewrite system could simply traverse a dataset looking for blocks to move, copy the blocks to new locations, update the cached locations where the block pointer was found, log the move, then add the old location to the freelist.
But ZFS is not like that. ZFS is not a content-addressed store, but it's a Merkle hash tree all the same. That makes BP rewrite insanely too difficult.
I do wonder how much effort it would take to make a ZZFS that is based on ZFS but with the CAS redesign sketched above. This is not the first time I've written about this. I think I first proposed this back in the early days of Illumos, and one or two people thought that not hashing the physical locations into the Merkle hash tree would be folly -- I think they're wrong, because either you trust your hash function or you don't. Granted, to make ZFS into a proper CAS does require a strong cryptographic hash function, but ZFS uses one so...
How does btrfs do it? Afaik it doesn't use content-addressing either but it does have defrag, on-demand dedup, rebalance and all the other things that need BPR.
My understanding is that btrfs uses a B+tree to locate the data, so the block being hashed would contain keys not physical locations. That is, there's an indirection.
That would make btrfs a CAS filesystem. The indirection hurts performance, which is why I recommend having an ex-Merkle cache of physical locations right where the pointers are. The indirection gets you dedup automatically, too.
The btrfs key isn't necessarily a hash of the content, just a unique ID, so it's not really a CAS. Btrfs can allow in-place overwrite of data, which would be impossible in a true CAS store.
I'm not aware of a great write-up of this, but btrfs uses a "chunk tree" to map extents of logical addresses to >=1 physical stripes.
I'm very partial to Merkle hash trees and CAS. The reason for that being that you can get some very good security properties out of that (e.g., having a single, small hash identify and secure enormous amounts of information, which then lends itself very well to things like measurement in TPMs for securing the boot process).
But it's true that it has some bad performance properties. The ZIL is essentially a way to amortize what would otherwise be very expensive b-tree transactions -- expensive because every interior node on the path to the block you're trying to write also needs a new write, so you get O(depth) write magnification, which means write performance becomes 1/depthth of storage write performance, which is awful. But the ZIL properly amortizes all those interior node writes, making it possible to do just one write of each of those for any number of leaf node writes that fit in the space of time between full transactions.
So a ZIL-like log is essential and makes CAS write performance tolerable.
My dream is to be able to use a TPM to hold a key for the whole zpool that can't be recovered unless you boot into a blessed dataset snapshot whose root hash is part of the TPM key unlock policy, then combined with other bits of secure boot technology and remote attestation (this latter for enterprises, not individuals) you'd get a pretty secure-against-physical-theft setup.
Because making such a change seems ETOOHARD, I don't expect anyone to be interested in making it in the OpenZFS community. It would take a fork, and no one will want that either.
Instead I think the OpenZFS community could maybe develop an automatic and transparent feature like what you do: internally "zfs send" a dataset, and then apply any transactions that took place at the origin while doing that, then lock the origin dataset, do one more round of transaction copying to the new dataset, then atomically swap the old and new, unlock, then destroy the old.
But I'm very curious what filesystems devs think of the CAS idea. It's not original, mind you. CAS is a well researched topic. I'm dead certain there's nothing wrong with a CAS design for a filesystem -- the key is to make the logical -> physical block pointer mapping fast, which is why I'd have a cache right next to every interior block, with the cache excluded from the Merkle hash tree.
I wonder if a company like rsync.net would fund work to make that "zfs send and swap" thing automatic and transparent. Or to fund a ZZFS. If for you the defragmentation need is critical... Not that I'm offering to do any of that work, mind you, but I am curious how much it would cost, and whether it'd be worth anyone's while.
I understand making this change in ZFS is a on-disk format-breaking (concept-breaking?) change, so not likely to happen. Hopefully newer filesystems (bcachefs) will have considered this in the design phase.
The on-disk format can evolve, and new features can be added. It being a breaking change is not a big deal as long as you don't mean to rollback to an earlier version of ZFS.
The problem with this change is not that it's breaking, but that it essentially forks too many code paths in ZFS to be worth doing.
Subsets of BPR have gone in...that's more or less what vdev evacuation does, for example.
I would be curious to see how practical someone writing something like this would be without requiring one make an entirely new pool or do offline conversion...
vdev evac is NOT BP rewrite. vdev evac simply plays a trick of a) copying a vdev to another, then b) making the new vdev's ID the same as the old so that all the physical block pointers that referred to the old vdev remain valid after the move. But this cannot support defragmentation because the block addresses within the vdev (old vdev and new vdev) cannot change.
The fundamental problem I described above is just too hard to overcome in ZFS. Therefore the only thing that can be done about problems like this is to find alternative solutions. vdev evac is one such solution to one of the problems BP rewrite was meant to solve. In a reply to a sibling to yours I outline a way in which ZFS could solve the defragmentation problem without a proper BP rewrite. Enough such solutions might make BP rewrite not necessary at all.
Okay, you're correct, it's not rewriting BPs on disk, it's adding an indirection table for them, but that's usually what's meant when discussing the topic, given the immutability constraint, e.g. that is my understanding of how it worked when it was implemented and not integrated at Sun, and if you check the vdev evac talk, they talk about "implementing bits of BPR as needed".
I'm aware, but it's not BP rewrite. It's a workaround for not having BP rewrite. It's a fine workaround for not having BP rewrite. Enough such workarounds can make BP rewrite unnecessary -- that is what was expressed in that talk, but it's not that vdev evac == BP rewrite or that vdev evac == a bit of BP rewrite.
BP rewrite is still needed for defragmentation, and for other reasons too (like if you wanted to change the configuration of a zpool to have more mirrors or stripes, or if you wanted to change compression algorithms, etc).
(A proper CAS would not be able to handle things like compression algorithms, unfortunately, not unless the filesystem hashed the decompressed block rather than the compressed block, but that has security issues if you don't trust the storage, so it's not ideal. Of course, if you don't trust the storage then you need to be signing/MACing Merkle hash tree roots, but that's another story.)
I'm not a ZFS contributor or anything, but there's a lot of 'missing' features that would be nice but require moving or altering data that's stored on disk, so I think there's a general apprehension to that. ZFS is trying to be a lot of things, one of which is kind of archival quality storage; fiddling with things that were stored just fine on disk makes that harder to do.
A list of 'missing' features that need data moving off the top of my head: defrag (since we're talking about it), shrinking partitions (has some overlap with defrag), compression of existing data, dedup of existing data, moving data between filesystems on the same pool without having to read the data and write it a second time. I think there's maybe one other gotcha I can't remember.
All these things might be nice, but are tricky to get right, and users can figure out other ways to do them, so there you go. Limited resources and all.
That one's only a little bad if you do it - you break uncompressed ARC with compressed L2ARC, nopwrite, and dedup, but all three of those are uncommon cases, and you can add helpful glue to make it work without breaking those with only a little work.
For example, we've had three different zlibs compressing gzip streams for a while, and nobody really complained about it. (Linux ships 1.1.x; FBSD and everyone else ships 1.2.x; Intel QAT produces identical output to neither.)
For a while, zstd was outputting different results on BE and LE systems and nobody noticed.
Most of the reason we haven't updated compressors nowadays is that nobody's convinced anyone it's got enough benefit to be worth the added complexity.
This requires you to have at least 50% free space, right?
At which point fragmentation usually isn't that big of a deal. But sure, if you add enough new storage it works.
It's more of a problem for individuals. That and that you can't increase the size of a vdev is (in the works but with lots of caveats) really dampens my enthusiasm as a hobbyist.
... we've not disussed this item yet but I am finalizing the most recent "notes" and will include it.
If you don't have busy filesystems filling beyond ~90% (used to be 80%...) then I wouldn't ever worry about this.
OTOH, if you have very busy filesystems with lots of different consumers and hundreds of millions (or billions) of inodes then a recent dataset might be scattered all over your zpool. There are absolutely performance ramifications to this.
So you expand that zpool (you have to expand it for this to work) by adding a vdev, and then you 'zfs send' that dataset onto the same zpool and it will be laid down nice and orderly onto the new free space that was added to the zpool.
That dataset will then be more performant. This is analogous to "defrag" as we think of it but, again, in a duct-taped-engineering kind of way.
Is there a way to subscribe to these technical notes postings? I (and I assume others) would be quite interested in being alerted when a new one is posted!
I think the best method would be to follow us (@rsyncnet) on twitter - the publication of these tech notes is one of the very few things we post on that platform ...
The author appears to be mistaken about what's going on here, I think.
It's not that ZFS takes a long time to update free space - it's that it actually doesn't necessarily free the space immediately.
Deletes over a certain size (by default 20480 blocks, so 10/80 MB for 512b/4k sectors, respectively, I believe) dump the thing to be deleted (assuming it's to be freed, dedup refcounting/snapshots/etc might mean it's not actually freed) on an async queue that gets worked through in the background, rather than blocking rm on it.
(Otherwise, if the files are smaller than that, I think it's as the author says, with bundling things into a txg and only updating once they're flushed out. But the example was ~40 files totalling several GB, so I think it's as I describe above.)
Yes, there is this and then too the fact that it's hard to predict how much space you'll get back by deleting files because of snapshots and compression. Combine the two things and managing free space is hard for users.
There's zfs wait, which can wait on the delete queue. Otoh, files that are deleted and open will be on the delete queue until they're closed, so that queue doesn't have any guarantee of progress while the system remains running.
A process to delete less important files to maintain free space that waited on the delete queue to drain before reassessing the situation would need something more sophisticated.
I'm not aware of one, but don't think it would be terribly hard to implement. I don't know of anyone who has, probably because it's not really a thing that's broken for anyone in a way where that data point would be interesting...that I know of.
Could be wrong. Maybe I'll remember and go try it in a day or two, when I'm not at an event[1]...
I mean 5-10 seconds for the free space to be updated doesn't sound extraordinary? With all the accounting ZFS has do, doesn't this sound like something you would want running asynchronously in the background? The blog makes a point (with references) of ZFS being transactional, but, as a related matter, it's also COW, one would assume finding/guaranteeing "free" space is much harder than it would be on a traditional filesystem.
It's clearly not extraordinary, it's normal for ZFS. UFS2 on FreeBSD does is too, it also has snapshots, but isn't generally COW.
The first time you delete something bit and it doesn't show up in df right away, it might be surprising though. Or this case where automation deleted more than expected because the author wasn't aware of the need to wait.
I have seen the same behavior with BTRFS. We create nightly snapshots on our servers then prune older snapshots. Watching 'top' output shows a 5-8 second delay until the BTRFS cleanup agent starts. The agent could take many seconds (20?) to completely remove the old snapshots
> I mean 5-10 seconds for the free space to be updated doesn't sound extraordinary? With all the accounting ZFS has do, doesn't this sound like something you would want running asynchronously in the background?
If system is busy, sure, batching requests helps performance, especially on spinning rust.
In one that it isn't ? Why wait ? Then again checking space after every removed file (which I assume what the problem was here) instead of "generate list of files to remove, remove them, then go to sleep for an hour" is a bit suboptimal in the first place
Using --track-bytes-deleted on ZFS wont really track the bytes deleted, but just the size of the file that was deleted -- as the author points out, that may not be the number of bytes actually freed.
For my money, "sleep 10 before checking free-space" is a more robust solution.
The "sleep 10 seconds" solution isn't robust enough. Consider if you have created a snapshot on the filesystem. That snapshot has to keep hold of all the data, so when you delete files, the amount of free space doesn't change. A loop with "sleep 10 seconds" is likely to delete every single file in the filesystem in an attempt to get that desired amount of free space.
A very common thing to do with ZFS is to have automatically rotating snapshots, so that you can go and retrieve a file that you accidentally deleted or changed, without having to go to the backups. In this case, the space is freed when the last snapshot containing those files is rotated away, which could be day or a week later, depending on how it is set up.
It is a hack, but given the complexity and non-timeliness of what's going on behind the scenes, and the lack of control that the user has, perhaps a good hack is the right solution.
My choice would be a cron job to run every few hours (or every day) that calculates space required and deletes logs files to that size. As long as the desired free space leaves an adequate margin, this would be robust and work in the presence of compressed files, snapshots, etc.
"Sleep 10" should be fine, but it's also a hack, and the author would need to consider how long it takes for cleared space to become available on a heavily loaded system. I would expect that under severe load, the clear-up lag could be very long.
> For my money, "sleep 10 before checking free-space" is a more robust solution.
And running more often. 50GB(?) at a time can back up any filesystem.
Not to mention, there is another option which perhaps better matches this use case.
And author seems to be using the zfs quota property when the refquota property might better account for descendants and perhaps make a better initial calculation of what free space is available easier?
ZFS is great but absolutely has tradeoffs. It makes really good choices. Maybe this is one?
You can do the math yourself every loop instead of asking the OS. Ask how much free space is available once at the start, track how big each file is, delete. You can sleep and check free space again at the end as a sanity check; if the free space isn't going down nearly as much as you expected report it to the user and inhibit future script executions b/c it means there's something referencing extents and your script is about to go on a rampage.
A bit more turns up `zdb -e mainpool -ul` (or maybe without the -e?) that includes a bunch of txg ids along with tons of other stuff. I'd think there ought to be a less noisy way to find that, but I don't know where to start looking.
It seems to that #3 at the bottom, snapshots holding onto a file, is the biggest or most common culprit of free space not getting freed on delete. Having a good snapshot management strategy is critical if you use the feature at all. And if you're using it lazily, just know that you should delete old snapshots when going through a free-space exercise.
It's not just keeping fragmentation in check, above a threshold ZFS switches allocation strategy. The "nearly full" allocator is typically slower, IIRC because it tries to find better fitting free-space holes than the default allocator.
I can't recall if the threshold is 80% or 90% right now, but around there.
In theory, the dedicated log metaslab[1] feature should help avoid that sort of problem in the same way, I think, and is in 2.1, with no slog required.
I'd be curious if you're suggesting you're running something based on 2.1 and still finding you can run it fuller more reasonably.
isn't the concept of deletes not being instantaneous a normal aspect of unix file systems?
i.e. do an unlink a 1TB file that is in use in linux on ext4. file disappears immediately form namespace but space isn't freed up. close program keeping that file opened and space will slowly be reclaimed. similarly, due to unlink / ext4 semantics, if one did unlink that file without anyone keeping it open, the unlink() would block until the file was deleted, but I'm not sure that's really required by posix (i.e. the file could be opened by someone). Therefore I think one has to believe that the expectation on any file system is that unlink returning doesn't mean the space was freed up yet.
But, of course, there is no defrag for ZFS and so we do it the trailer-park way - we add free space to a zpool when adding a vdev, and then we 'zfs send' datasets to ourselves on that same zpool thus allowing ZFS to lay down those bytes in an efficient and orderly fashion ... as opposed to the inefficient way they were laid down over time.
This works.