Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't think the conventional wisdom of building on top of filesystems exists. In distributed systems you always naturally gravitate towards using raw storage devices instead of filesystems, it becomes obvious very early on that filesystems suck too much and only create problems. And it's the same with all the embedded database libraries, you really want to write your own, because none of the existing ones were made to address performance and operational problems that arise even in small distributed systems. But at the same time early on you don't yet know most of the problems and don't want to invest time implementing something you don't yet understand well enough, so you end up building on top of filesystems and embedded databases and making plenty of poor choices and learning on your mistakes.


As I pointed out to the authors, this wheel has turned a couple of times. In the late 90s, many distributed filesystems (and most cluster filesystems) used raw disks and their own format. This was a burden, both for the developers who had to maintain an entire low-level I/O stack in addition to the distributed parts and to the users who had to learn new tools to deal with these "alien" disks in their system (which also limited deployment flexibility). Thus, when the current crop - e.g. Ceph, Gluster, PVFS2 - came around, they went toward a more local-FS-based approach. All of the issues mentioned in the paper were still real, but on the hardware of the time (both disks and networks) those weren't the bottlenecks anyway so the convenience was worth it. Now the tradeoffs have shifted again, and so have the solutions.

Context: I've been an originator/maintainer for multiple projects in this space, and currently work on a storage system where we add space in bigger increments than Ceph's entire worldwide installed base (according to numbers in the paper).


The filesystem as an abstraction has been getting long in the tooth for a long time -- it's taken a long time for the industry to recognize this, but that's the way it is with filesystems because a boring filesystem that never munches your data is preferable to most people to an interesting filesystem.

Some evidence:

* The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.

* Microsoft's failure to replace NTFS with ReFS. (Interacts with storage spaces in such a way that it will never be reliable)

* Microsoft's giving up on the old WSL and replacing it with a virtualized linux kernel because metadata operations on NTFS are terribly slow compared to Linux but people don't usually notice until they try to use an NTFS volume as if it was an ext4 volume.

* The popularity of object stores, systems like Ceph, S3, etc.

* Numerous filesystems promising transaction support and then backing away from it (btrfs as mentioned in the article, NTFS, exFAT, etc.)

* Proliferation of APIs to access filesystems more efficiently. On one hand there are asyncio access methods for filesystems which aren't quite as solid as asyncio for networks, on the other hand there is mmap which can cause your thread to block not only when you make an I/O call but later when you access memory.

* Recently some filesystem based APIs that address the real issues (e.g. pwrite, pread) but astonishingly late in the game.


I don't quite disagree with your conclusion, but some of your supporting points don't really support it.

> If any of those filesystems were truly adequate there wouldn't have to be so many.

Diversity is not bad. Different filesystems can have different design points, e.g. read-heavy vs. write-heavy workloads, optimized for different kinds of media, realtime requirements (traditionally an XFS strength) etc. That's a good thing. What's a problem is that there are too many filesystems trying to be all things to all users of a too-general API, and too many FS developers competing with each other on long-irrelevant microbenchmarks instead of making useful progress.

> Microsoft's failure

...is Microsoft's.

> Recently ... pwrite, pread

Those have been standardized since 1998, and I'm pretty sure I remember them existing on some systems before that. The fact that they're not recent is IMO the real problem. There have been attempts to create useful new abstractions such as the setstream API from CCFS[1], but they're too few and rarely gain traction.

Personally, I think local filesystems need to adapt in a bunch of ways to be a better substrate for higher-level applications - including but not limited to distributed filesystems. OTOH, I don't think putting complex transactions at that level is a good choice. There's enough innate complexity there already. Adding transactions there just increases inertia, and if filesystems offered better control over ordering/durability the transaction parts could be implemented in a layer above.

[1] https://www.usenix.org/system/files/conference/fast17/fast17...


The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.

First of all ZFS is not a “linux” filesystem. Second, available choice of filesystem is a strange thing to use as justification that previously existing filesystems are “inadequate”. By what criteria are you establishing adequacy? Success of technologies is often not dependent solely upon their technical excellence but on a variety of factors that may have nothing to do with the technology itself. (Betamax vs VHS, etc.)


In theory people would pick the filesystem which is ideal for their application.

In practice it is a lot of work to research the choices, and there's a high risk that you'll discover something wrong with your filesystem only when it is too late.

It's one thing to pontificate for and against particular file systems, it's another to use one for years, terabytes, etc. ZFS might scrub your data to protect against bitrot, but I remember reading harrowing tales from ZFS enthusiasts who were recovering (or not recovering) from wrecks every week but seemed to think it was a lot of fun, or conferred them status, or was otherwise a good thing.

I stuck with ext4 for a long time before finally building a server that uses ZFS/HDD for media storage (e.g. not a lot of random access)

I remember the time when a project I was involved with chose reiserfs because they thought it was "better" and then they were shocked when once in a while the system crash and we found that a file that had just been created was full of junk.

That's a trade-off they made, they decided it was important to journal filesystem metadata (the length of the file is right) but not to protect the content. If they read the docs, really thought about it, then understood it, they would have known, but they didn't.

This book points out that in cases where there is too much competition, you can switch all you like between flawed alternatives, but have no way to communicate what you really want:

https://en.wikipedia.org/wiki/Exit,_Voice,_and_Loyalty

And when it comes to the "Filesystem API" probably anybody who has special needs for filesystem performance would find that a different API than the standard filesystem API would be a boon.


> In practice it is a lot of work to research the choices ...

More is not always better:

* https://en.wikipedia.org/wiki/The_Paradox_of_Choice


I don't disagree with the general premise that a single filesystem might not be applicable or appropriate to all cases or that existing filesystems apis are generally deficient.

My primary issue was with the two specific assertions I addressed, one of a given filesystem's origins and that of choice should being used as a proxy for evaluation of adequacy.

As for ZFS enthusiasts "recovering from wrecks every week", I suspect you're specifically referring to ZFS on Linux or one of the BSDs -- which is not the same as ZFS when used in its original environment -- Solaris.


No, it was on Solaris back when ZFS was new.

It seemed like these people enjoyed having wrecks, like the Plebe who enjoyed getting hazed in

https://www.amazon.com/Sense-Honor-Bluejacket-Books/dp/15575...


So because a file system had issues 20 years ago when it was new.... Do you still drive a car with a carburetor and drum brakes?

ZFS is now incredibly stable and durable, with the exception of some of the early non production ZFS on Linux work that is now fixed (and was specifically billed as non prod use). It has seen me through issues that other file systems would have failed on, including drive failures, hard shutdowns, a bad RAM module, SAS card being fried by a CPU water cooler, etc. Years and terabytes just on my systems, zero issues.

In fact one of the tests that Sun did back in the day was to write huge amounts of data to a NAS and pull the power cord mid writing. Then repeated that a few thousand times. It never corrupted the file system.


Needs citation. I was in the very earliest crop of ZFS users (ca. 2003), and then went on to build a storage product based on ZFS (ca. 2006) and then a cloud based on ZFS (ca. 2010) and an object storage system based on ZFS (ca. 2012) -- and ran it all up until essentially present day. I have plenty of ZFS scars -- but none of them involve lost data, even from the earliest days...


> * The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.

While I get your point, I would like to point out that ZFS was developed by Sun (now Oracle). I've used ZFS for years from a data-integrity and array-mirror perspective and love it. No other file-system you mentioned next to it gives me the confidence that ZFS does (maturity, stability, etc).


Another way to look at it is that we keep reinventing the wheel.

Why are there so many filesystems? Because everybody starts from scratch.

Then everybody writes code first and later a description of the filesystem.

We need a way to learn from previous mistakes and then find a way to fix those problem.


I think this isn't quite true.

People who develop new filesystems generally have a problem with existing filesystems and have some goal they want to accomplish.

There is also the issue that somebody else's "existing" file system becomes your "new" filesystem when support comes to your OS. For instance Linux has support for many obscure filesystems such as Amiga and the old mac filesystems because somebody might want to mount an old disk. I don't think anybody really wants to run a volume like that because they want to use it to do their ordinary work on Linux.

XFS, ZFS, JFS to name a few are foreign filesystems that claim to be good enough that you might want to use them on a Linux system not for compatibility but because of performance.


What do you mean by "embedded database libraries"? Libraries like SQLite or RocksDB? What kind of issue are you thinking of?


Yes, like these libraries. Think, for example, sharding, where you run lots of database instances per disk, this requires strictly bounded memory usage and no background operations. Or running on a disk full of bad blocks with constant retries.


But SQLite provides an API to set memory usage limits [1] and doesn't use background operation as far as I know.

I'm not sure about RocksDB (I think the "levels" compaction runs in the background by you can control it).

[1] https://www.sqlite.org/malloc.html#_setting_memory_usage_lim...


Well, it was just an example, it wouldn't really work either way, would be too slow for HDDs and would still require handling of all the disk reliability and performance issues.

But if I were to do it today and store metadata in an SQLite database I would use a single database for all the shards and use its VFS API where I would add I/O scheduling, remapping of blocks with redundancy into different blocks and automatic recovery, marking of bad and slow blocks, and maybe even scrubbing. Looks like half of the storage engine already and it would still be somewhat slow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: