Hacker News new | past | comments | ask | show | jobs | submit login

How does btrfs do it? Afaik it doesn't use content-addressing either but it does have defrag, on-demand dedup, rebalance and all the other things that need BPR.



My understanding is that btrfs uses a B+tree to locate the data, so the block being hashed would contain keys not physical locations. That is, there's an indirection.


That would make btrfs a CAS filesystem. The indirection hurts performance, which is why I recommend having an ex-Merkle cache of physical locations right where the pointers are. The indirection gets you dedup automatically, too.


The btrfs key isn't necessarily a hash of the content, just a unique ID, so it's not really a CAS. Btrfs can allow in-place overwrite of data, which would be impossible in a true CAS store.

I'm not aware of a great write-up of this, but btrfs uses a "chunk tree" to map extents of logical addresses to >=1 physical stripes.

https://github.com/btrfs/btrfs-dev-docs/blob/master/tree-ite...

https://btrfs.wiki.kernel.org/index.php/On-disk_Format

https://btrfs.wiki.kernel.org/index.php/Data_Structures


I'm very partial to Merkle hash trees and CAS. The reason for that being that you can get some very good security properties out of that (e.g., having a single, small hash identify and secure enormous amounts of information, which then lends itself very well to things like measurement in TPMs for securing the boot process).

But it's true that it has some bad performance properties. The ZIL is essentially a way to amortize what would otherwise be very expensive b-tree transactions -- expensive because every interior node on the path to the block you're trying to write also needs a new write, so you get O(depth) write magnification, which means write performance becomes 1/depthth of storage write performance, which is awful. But the ZIL properly amortizes all those interior node writes, making it possible to do just one write of each of those for any number of leaf node writes that fit in the space of time between full transactions.

So a ZIL-like log is essential and makes CAS write performance tolerable.

My dream is to be able to use a TPM to hold a key for the whole zpool that can't be recovered unless you boot into a blessed dataset snapshot whose root hash is part of the TPM key unlock policy, then combined with other bits of secure boot technology and remote attestation (this latter for enterprises, not individuals) you'd get a pretty secure-against-physical-theft setup.


I don't know. I've not looked at it.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: