How does btrfs do it? Afaik it doesn't use content-addressing either but it does...

yencabulator · on Oct 24, 2022

My understanding is that btrfs uses a B+tree to locate the data, so the block being hashed would contain keys not physical locations. That is, there's an indirection.

cryptonector · on Oct 24, 2022

That would make btrfs a CAS filesystem. The indirection hurts performance, which is why I recommend having an ex-Merkle cache of physical locations right where the pointers are. The indirection gets you dedup automatically, too.

yencabulator · on Oct 24, 2022

The btrfs key isn't necessarily a hash of the content, just a unique ID, so it's not really a CAS. Btrfs can allow in-place overwrite of data, which would be impossible in a true CAS store.

I'm not aware of a great write-up of this, but btrfs uses a "chunk tree" to map extents of logical addresses to >=1 physical stripes.

https://github.com/btrfs/btrfs-dev-docs/blob/master/tree-ite...

https://btrfs.wiki.kernel.org/index.php/On-disk_Format

https://btrfs.wiki.kernel.org/index.php/Data_Structures

cryptonector · on Oct 24, 2022

I'm very partial to Merkle hash trees and CAS. The reason for that being that you can get some very good security properties out of that (e.g., having a single, small hash identify and secure enormous amounts of information, which then lends itself very well to things like measurement in TPMs for securing the boot process).

But it's true that it has some bad performance properties. The ZIL is essentially a way to amortize what would otherwise be very expensive b-tree transactions -- expensive because every interior node on the path to the block you're trying to write also needs a new write, so you get O(depth) write magnification, which means write performance becomes 1/depthth of storage write performance, which is awful. But the ZIL properly amortizes all those interior node writes, making it possible to do just one write of each of those for any number of leaf node writes that fit in the space of time between full transactions.

So a ZIL-like log is essential and makes CAS write performance tolerable.

My dream is to be able to use a TPM to hold a key for the whole zpool that can't be recovered unless you boot into a blessed dataset snapshot whose root hash is part of the TPM key unlock policy, then combined with other bits of secure boot technology and remote attestation (this latter for enterprises, not individuals) you'd get a pretty secure-against-physical-theft setup.

cryptonector · on Oct 24, 2022

I don't know. I've not looked at it.