Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is OT, but since you seem to know a fair bit about filesystem types and their trades and have an opinion on sane patching behavior, I have a question you might be able to answer:

Why aren't modern filesystems based on a content-addressable-store where the content is 100% separated from the organization of the filesystem itself? It seems to me like it would make more sense to have only one copy of a file ever saved and when modifications are made you get a new pointer to the changed file. Obviously it wouldn't make sense to do full copy on modifications for files over a certain size, but in this situation the filesystem could offer an abstraction over a patch and pointer to the original file until the system is idling and a full copy could be made.

The reason I ask about this is because it would make updating anything and rolling things back trivial since it would simply be a pointer change in the file hierarchy hash map from the new file to the old file and back again. Furthermore, such a system would give you dedupe for free and files could be marked for deletion only once the last pointer to the file in the CAS has been deleted.



Plan 9 was designed this way. Except it's designed to never delete anything, so you don't have to worry about the garbage collection.

Camlistore and TahoeLAFS are both designed this way. But they are more of a content store than a file system.

In fact, BTRFS is implemented this way. The content and the metadata can be separately treated with mirrors vs raid. You can balance just the metadata or both data and metadata. When you convert an ext2/3/4 file system to BTRFS, it just sets up it's own metadata pointing to the same blocks that the ext used. You get a free snapshot of your data pre-conversion and it's all COW from then on. I believe the B-Tree of BTRFS is the metadata and the content never changes on disk until no more pointers exist. If you decide to go back, it just restores the superblock and the old data is still in the same spot. You would only regain that space after you delete the snapshot and the last pointers to the data are gone.


Plan 9 had Venti, which was indeed a content-addressable file system. But it was not the main file system for day-to-day use; it was intended to store backups only.

Btrfs is not content-addressable, although everything else you say about it is correct.


Zfs, to take one example, does do approximately what you want - the file system (or rather, the underlying block storage) has a tree structure, and changes to files are propagated via copy-on-write rewrites up the tree, so that earlier snapshots still get to see the original file.

This does make rolling things back trivial.

It doesn't give you dedupe for free, though. Think about what would have to happen: every modification would mean rehashing the modified block (not a problem, that should happen anyway to verify integrity). Let's say we dedupe at the block level rather than the file level to avoid the need for more expensive hashing operations, and to increase the likelihood of actually sharing stuff. Now, to determine whether we can deallocate the new block, we need to look up the hash. So we need an index of every block on the file system by hash. That necessarily involves either a big chunk of memory or a bunch of random I/O. Both are at a premium for a filesystem - the former for cache, the latter for throughput.

If you just want file dedupe, it's a smaller problem, but is less likely to create gains - most people don't store many copies of the same file, unless they're in the third party file storage business. So it isn't really suited to a general file system. If this is something you want, you could periodically go through your file system, hash all files with only one link count in the inode, and hard link them using the hash as a file name, into an set of directories fanning out by hash prefix. There may be some wrinkles with permissions; I believe btrfs has a different kind of copy with copy-on-write semantics that might be useful here.


Deleting is not a time sensitive command anyway.

But I don't think you'd have this problem. Just like we have an inode telling every block of a file, a filesystem like that would need a similar structure telling every hash of the file. When you delete a block, you look at this structure, the same way you look at an inode.

The only thing missing is that you'll need a counter at the blocks. And this counter will create some synchronization problems that may turn out more important than saving disk space.


I haven't used freebsd for 5-6 years, so I've never tried to use it with root on ZFS. Wondering if it suffers the same problems as btrfs when used as root: last time I tried that on Ubuntu, the fsync() performance made a lot of stuff just horribly slow (e.g: dpkg operations two orders of magnitude slower).


I run FreeBSD with ZFS as root and it doesn't suffer from those issues. But then ZFS is quite significantly more mature than Btrfs anyway.

It's also worth noting that ZFS has been available as a root file system long before FreeBSD added ZFS to RELEASE. I remember running OpenSolaris (and some of it's forks, eg Nexenta) with ZFS root about 6 years ago. Possibly longer actually.

The issue with ZFS as root was more of a problem with the boot menu than the OS. OpenSolaris used GRUB where as FreeBSD obviously doesn't, so FreeBSD needs to either port their ZFS drivers to their bootloader, or employ a hacky method of having a UFS boot volume that then points to a ZFS root partition (which, sadly, is how FreeBSD currently works).

Interestingly, since GRUB is GPL, it meant that technically there were GPL ZFS drivers even before Btrfs started life (never mind the various Linux ports of CDDL-licenced ZFS drivers that have appeared since). Albeit those GPL ZFS drivers were read only


> FreeBSD needs to either port their ZFS drivers to their bootloader, or employ a hacky method of having a UFS boot volume that then points to a ZFS root partition (which, sadly, is how FreeBSD currently works).

A ZFS-aware loader hit CURRENT in late 2008, and a dedicated zfsloader for use from (gpt)zfsboot hit the stable branches in late 2009.


AFAIK it's not in RELEASE yet though.


They were in 7.3-RELEASE, March 2010.

https://www.freebsd.org/releases/7.3R/announce.html


Just looked at the config on my file server and it turns out I'm booting FreeBSD in this way. I have no idea why I thought I was bootstrapping from UFS (possibly because I still had to manually create a boot GPT partition and since forgotten why?)

Anyhow, thank you for the correction :)


I don't know when this was fixed, but I have no longer had any problems with btrfs on root with ubuntu. I don't remember when it started working well, but I do remember those bad old days. I don't know if dpkg changed or if btrfs changed.


cp --reflink


One of the few feature btrs has which ZFS hasn't.


At this point, I don't think it's fair to imply that btrfs is lagging behind ZFS. Yes, there are quite a few things that ZFS does better than btrfs, but btrfs isn't following in ZFS's footsteps and has some killer features that ZFS will never have, like on-the-fly changing between RAID modes and resizing arrays in either direction.


Btrfs has a broken by design on disk format. Btrfs went with self describing checksums inside blocks instead of a Merkle-DAG with a round robin root. In doing so they ignored existing research at the start of the btrfs project and the only fix is to change the on disk format to add the children's checkums to the parent nodes.


That doesn't sound "broken by design" to me, just less thoroughly safeguarded than ZFS but still more than almost any other filesystem. It's completely consistent with my claim that btrfs isn't trying to follow exactly in the footsteps of ZFS. (And it's not like Merkle trees don't have any tradeoffs.)


ZFS would need block pointer rewrites to implement those features, right? I don't think ZFS developers are opposed to that, but progress is just stalled. So I think "will never have" is a bit strongly worded.


https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s

The above video is an explanation of a bunch of the barriers to implementing block pointer rewrites. The conclusion is that it would make the code a lot more complicated and break a lot of the layering, and probably make addition of other new features a lot harder. Even a standalone offline rewriting tool wouldn't necessarily be accepted into the OpenZFS codebase because of the maintenance burden. Their advice is that if you think you need that feature to solve a particular problem, you should be looking for a workaround to solve that problem without requiring the huge block pointer rewrite project (which nobody's working on), even if the workarounds have a significant and permanent performance impact.

When you take into account how long the feature's been in demand and been on the roadmaps under "eventually", it's clearly not going to happen anytime soon and won't happen without a major change to how ZFS development is being done. It's not definitely impossible, but it's perpetually several years away from happening. With btrfs already having it's equivalent to that feature and stealing an ever-growing slice of the users who need that feature, it's probably never going to happen for ZFS.


A feature similar to this was discussed at the OpenZFS Developers Summit earlier this month

http://open-zfs.org/w/images/7/71/Fast_File_Cloning-Pavel_Za...


That's called copy-on-write (CoW) and is supported by a few modern filesystems (including ZFS). In fact ZFS does do some very basic deduping in the way you suggest (ie you copy a file instead of move it, and ZFS will just issue a pointer).

However full deduplication could never be free simply because of the overhead of keeping a table of all the duplicated data and scanning new content for duplications.


And ZFS deduplication consumes an absolutely frightening amount of RAM: 5GB of RAM per 1TB of storage is recommended.


Oh totally. ZFS gets a lot right but even ZFS can't make deduplication practical for all bar a few fringe cases.


That's not much. Our baseline config HP DL380s come with 64GiB RAM (!!)


File servers should really have Looooots of ram. after all its cheap cache (compared to a fusion io card) Our file servers have 384 gigs of ram. The next gen will probably have much more.


Copy-on-write is not the same thing as content-addressable.

ZFS has deduplication as an optional feature, which is implemented as a content-addressable store of filesystem blocks. In contrast to e.g. git the content-addressable aspect is an implementation detail that is not exposed to users.

http://blogs.sun.com/bonwick/entry/zfs_dedup


> Copy-on-write is not the same thing as content-addressable.

I know that. But CoW does cover a few points raised by the previous poster.

> ZFS has deduplication as an optional feature, which is implemented as a content-addressable store of filesystem blocks. In contrast to e.g. git the content-addressable aspect is an implementation detail that is not exposed to users.

I know what dedup is and how ZFS utilises it (I've been running ZFS for about 8 years now - I'm quite familiar with it).

If you read my post again, you'll see I was discussing two separate points: 1) that CoW file systems do provide the pointer-like methods the former commenter raised. And 2) deduping isn't free.

What you're arguing with me is semantics and if you read the former post again, you'll understand why I chose the language I chose.


I'm not an FS expert and others have answered your question much better, but I don't want you think I am ignoring you, so I'll talk about this part instead:

> the filesystem could offer an abstraction over a patch and pointer to the original file

I've implemented a delta-based patching system before: the idea is that, given two binary buffers (ostensibly files), encode the differences between them. I have no idea how Xdelta (VCDIFF-based) and bsdiff manage to do this at reasonable speeds (their code is too much for me to understand), but the best I could manage for this was O(n^2). It's a very complicated and difficult thing to do efficiently. But indeed it can result in massive file space savings. You could cheat a bit by intercepting fwrite commands, but that won't catch insertions or deletions. Nor programs that overwrite files instead of updating them in-place (eg most of them.) As you can guess, to really make something like this efficient would require program authors to rethink how they write to files, which is unlikely to ever gain traction.

What I'd really like to see, but know that we'll never get thanks to xkcd.com/927, is a metadata system for files that is portable. Instead of relying on file extensions or unreliable magic byte header detection, you'd have the MIME type included in this metadata. Along with the displayed file name that can have any characters in them (sans maybe the path separator), the file attributes (read/write/exec/user/group/owner/hidden), creation+modification+access times, etc. And then when you'd send a file through your e-mail client or FTP it up to some web server, it'd copy over the metadata along with it. So when you moved your file from extfs over to NTFS, and then copied that over to ZFS, you wouldn't lose all of that metadata.

But good luck getting all file system authors and browser vendors to agree to a common format to transparently wrap files with.


The basis for transferring file metadata is already there in zip in the form of -V for VMS file system metadata. This works well and can be used to transfer files via intermediate platforms which don't support that metadata.

It's not as seamless as one might wish; you do have to use zip. do the transfer and then unzip on the target, and of course remember to use -V in the first place.

FTP implementations also exist in the VMS world for transferring the file metadata, but I haven't used these myself so can't comment on their usability in practice.


Indeed, ZIP gets us part of the way there, and also handles the issue of transferring more than one file in one download request.

The downside to ZIP is that the information is cast away once you decompress it. If your file system doesn't maintain that metadata (eg Windows and file permissions), then it's just gone.

You also can't really just keep your files in ZIP because operating systems don't really natively integrate transparent ZIP support into files, and you probably don't want the file compressed in most cases (your app may want to read it often, and not want to pay the cost of decompressing the whole thing.)

But that is very close to what I am seeking, yes.


openSUSE uses btrfs's copy-on-write behavior to allow rollbacks. Any time you use the package management system to change system state on a machine using btrfs the package manager actually snapshots the filesystem before (and I believe after) the install or remove. Due to COW this effectively only wastes the space for new versions of old files and snapshots can be cleared at will. It's a little weird to get used to, but it's actually a far more powerful "Oh Shit" button than package level rollback - a bad post-install script in a package that trashes /usr can usually be recovered easily if you read the docs and still have snapper (or can get a copy of it). Unfortunately RDBMSes don't usually play nice with copy on write filesystems, so it's much harder to do rollback there.


Thanks for everyone's answers. I now have a bunch of reading/learning for the weekend.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: