> We looked at this issue earlier. Fundamentally the tension here is that copy-on-write semantics don’t fit with the emerging zone interface semantics.
While the paper writes:
> It is not surprising that attempts to modify
production file systems, such as XFS and ext4, to work with
the zone interface have so far been unsuccessful [19, 68],
primarily because these are overwrite file systems, whereas
the zone interface requires a copy-on-write approach to data
management
This seems to be a contradiction, and I'd side with the original paper.
Zone interface is specifically good at log structured CoW is what it says. Later they talk about how they tried using a write ahead log but that it does not work well for managing metadata in a distributed filesystem.
I ran a ~ 0.5 PB Ceph cluster for a few years, on quite old spinning disk hardware (bought second-hand.) It was great: it just worked, coped very well with hardware failures, told the operator what was happening. An extremely solid, well-engineered system. My thanks to the Ceph team :)
Yup, adding to this: For about four years we ran a similar size Ceph cluster that supported a burgeoning public cloud platform with users making extensive use of block and object. We did this on a range of hardware from creaky second-hand SuperMicro boxes to newer all-flash Quanta-based hardware.
Through questionable hardware selection as well as standard operational challenges such as upgrades and scaling, Ceph never let us down.
The only time we had a major problem turned out to be our fault. The creaky machines we were using at the start 'lost' half of their memory during a round of power failure testing. We didn't have monitoring in place to spot this, and unfortunately it manifest after we lost a node which triggered a significant rebalancing operation across the cluster. With several machines missing 50% of their RAM, this quickly descended into a horrendous disk thrashing exercise. Again, all credit due to Ceph - we were able to coax the cluster back into life with no data loss.
I was just discussing with a colleague how technology accretes and how no one reevaluates high-level design decisions even after every single factor leading to those decisions has changed.
It's weird that basic filesystems today are so out of touch with modern realities that we are universally forced to resort to using complex databases even in cases when the logical model of files and directories fits the storage needs really well.
It's weird that hierarchical storage is the only universal model available on all OSes and in all languages.
The more I think about it, the more I realize that we live in a bizarro world where software runs everything, yet makes little to no sense from either human, or modern hardware or system design perspectives.
Programming languages are in the same boat - the modern CPU works dramatically different from a PDP-11, but most programming models still assume an accumulator machine; flat memory hierarchy; effectively unbounded LIFO program-control stacks; uniform machine word sizes; sequential in-order execution; and byte streams for I/O. This is despite large register files, multiple levels of caching, coroutines/promises, SIMD, multicore/GPU, and page-level I/O being things. In many cases even assembly language encodes assumptions from the 1970s, and then is internally translated by the processor into how the hardware actually works.
Designing file systems for newer storage devices is an engineering problem that requires a lot of effort, but not a great amount of new insight.
Designing a new programming language that somehow exposes the concepts you mentioned more transparently while still being actually useful would require major, major breakthroughs and insights in PL design.
I wonder what your thoughts are on the Mill Architecture? It's throwing out every assumption on how to build a CPU, which mandates that the compiler needs to be rewritten to generate code for it.
I hadn't heard of it before. I just looked it up and it looks interesting, but I don't have enough hardware engineering experience to judge its feasibility.
I think a larger problem with new architectures is that consumer adoption follows price/performance/power, not any inherent architectural quality. The architecture we're stuck with is the one that most hardware devices get sold with (x86/ARM right now). That gets determined by hardware OEMs, which in turn make their decisions based on what'll help them sell the fastest devices with the lowest power consumption for the least money. So something like RISC V is fascinating and quite elegant, but until there are RISC V chips that are cheaper and faster than Intel ones, it remains an academic curiosity. Then if you're a compiler writer, you gotta work with what you've got for an installed base, and you can't really get adoption for a new language unless it lets startup founders unlock new markets because your combination of development velocity + execution speed lets them do things they wouldn't otherwise be able to.
Ah yeah, I'm just as cynical in all the ways you are about the prospects of a 'new architecture' becoming relevant. Just thought you might have known Mills + had some personal insights on the project.
That's the subject of a research project I don't have time for, but some ideas:
1.) automatically defining SoA/AoS transformations for data types in the language, and
2.) automatic profiling of code execution counts & array lengths to determine when inserting such a transformation would save more time than the transformation costs.
3.) native SIMD primitive types
4.) GPU backend a la Futhark
5.) native syntax for parallel execution, where the compiler enforces no data sharing (or read-only shared data) between executions.
6.) [more speculative] pluggable backend for this syntax so that you could run these concurrent executions sequentially, using SIMD instructions on the same thread, in multiple threads, on the GPU, or spread across multiple machines in a cluster without altering the code. That'd let you scale up from a few K to a few T without altering your code.
7.) I/O mechanisms that default to mmap, 0-copy vectored I/O, or RDMA as appropriate.
8.) An unfold primitive in either the language or standard library that lets you compute multiple values with a single traversal over a list, ideally while still keeping each individual computation isolated and modular. I find this is a really common pattern in my Kotlin code, and the biggest culprit for still writing manual accumulator loops.
9.) Support for multi-pass initialization in data structures, where you can hold some fields as uninitialized for a later computation, but mark the whole structure as immutable once every field has been computed.
10.) A standard library that uses these parallelization techniques under the hood so you get really good speed for common operations but it just works. For example, Boyer-Moore can be done on the GPU. Many common regexps can be reduced to string search on a common left prefix + a DFA on potential matches, where the prefix search can be done with Boyer-Moore. That opens the possibility of blazingly fast search over gigabytes of data for common regexps like /<a href=["']?([^"'\s])["']?>(.+?)<\/a>/. Many times the algorithms you'd use when your haystack is ~1G are very different from the algorithms you'd use when your haystack is ~1K, which goes back to the utility of an integrated profiler.
It occurs to me that a lot of these boil down to the inefficiency of doing lots of traversals of arrays in a world where touching memory is usually the critical path. So for example, this is slow but modular:
val high = oneBillionFloats.max()
val low = oneBillionFloats.min()
val total = oneBillionFloats.sum()
val average = oneBillionFloats.sum() / oneBillionFloats.size()
This is fast but coupled:
var high = Float.MIN_VALUE
var low = Float.MAX_VALUE
var total = 0
var count = 0
oneBillionFloats.forEach {
high = maxOf(high, it)
low = minOf(low, it)
total += it
++count
}
val average = total / count
Wouldn't it be nice if you could write your code like the first example but it would generate a single loop that's parallelized onto the GPU? Even moreso if oneBillionFloats isn't an array of a billion primitives, but rather a gig full of deeply nested structs, where each loop pulls out a particular field and does some arithmetic operations to it?
it's not just file systems. Take a look at the NVMe specifications and then answer this question: "Do you think it's efficient to use this as a block device?"
The whole thing reminds me about the disputed articles, "Are U.S. Railroad Gauges Based on Roman Chariots?" and by extension the Space Shuttle was limited by the width of two horses walking side-by-side. I don't think software is all that different than other standards like switching to the metric system, or Southern California switching their power standard from 50hz in 1948 to match 60hz (Japan never unified their power). It's a huge investment and a lot of it is political appetite for adopting the change.
It seems pretty obvious that the average person doesn't grok/like using folders (everything goes on the desktop). I was excited when Microsoft and Apple seemed experimental in the late 90's early 2000's with WinFS and Smart Folders. That kind of fizzled out. I think the reliability prevents adoption and it's a bit of a chicken and the egg with architecture vs presentation. Gmail supporting tags instead of folders seemed to drive everyone nuts for awhile (it still seems to come up when exposing it to other clients).
That's really not the case. Plenty of storage vendors will sell you storage exposed as an FS that scales wonderfully, provided you follow some sane best-practices like e.g. sharding your directories. The same goes for storing things in a database and not running into certain pathological cases with indexes.
> I was just discussing with a colleague how technology accretes and how no one reevaluates high-level design decisions even after every single factor leading to those decisions has changed.
For anyone looking for more information and benchmarks on the performance improvements in recent versions of Ceph (and with Bluestore in particular), here's a write-up that was done as part of testing for infrastructure to support the Human Brain Project: https://www.stackhpc.com/ceph-on-the-brain-a-year-with-the-h...
Unfortunately Julia was just a prototype system that they are going to turn off. KNL made it into one of their main production systems (Jureca) as a "booster" module, but they use GPFS on the main storage system (IIRC).
The end-to-end principle strikes again: lowest common denominator abstractions like filesystems are often incorrect, inefficient, or both for complex applications and ultimately must be bypassed by custom abstractions tailored for the application.
It's kinda why I think something like an exokernel would make more sense than way OSes trying to abstract things to a certain level. We should be trying to build abstractions that can be peeled layer by layer like an onion, not a potato.
Unix files aren't quite an exokernel-style "just securely multiplex the disk" but in some ways they come closer than some other alternatives. No file types, just randomly accessible bytes (well, and an execute bit); no multiple streams; no ISAM, just bytes (except that you do have directories); no insertion (but you do have append and truncate).
You could make them more disk-like by making them fixed-size, with the size specified at creation time, and accessing them in blocks rather than in bytes. Would those be an improvement? I tend to think not. Certainly if those were the semantics provided by the kernel you would want userland filesystem processes to provide appendable files.
Copy-on-write file versioning and cross-file, cross-process transactions, on the other hand, could be real pluses. I'd be okay with those being provided by userland processes rather than the kernel, but I'd sure like to have them.
We're actually facing an issue with our Ceph infrastructure in the 'upgrade' from FileStore to BlueStore: the loss of use of our SSDs.
We created our infrastructure with a bunch of hardware that had HDDs for bulk storage and an SSD for async I/O and intent log stuff.
The problem is that BlueStore does not seem to have any use for off-to-the-side SSDs AFA(we)CT. So we're left with a bunch hardware that may not be as performant under the new BlueStore world order.
The Ceph mailing list consensus seems to be "don't buy SSDs, but rather buy more spindles for more independent OSDs". That's fine for future purchases, but we have a whole bunch of gear designed for the Old Way of doing things. We could leave things be and continue using FileStore, but it seems the Path Forward is BlueStore.
Some of us do not need the speed of an all-SSD setup, but perhaps want something a little faster than only-HDDs. We're playing with benchmarks now to see how much worse the latency is with BlueStore+no-SSD, and whether the latency is good enough for us as-is.
Any new storage design that cannot handle an "hybrid" configuration of combining HDDs and SSDs is silly IMHO.
I joked that we could tie the HDDs together using ZFS zvol, with the SSD as the ZIL, and point the OSD(s) there.
Yes, the confounding factor was/is we are using ceph-ansible with ceph-disk. If we want to upgrade we have to make a whole bunch of inter-related changes.
(I'm not the lead on the project, so no doubt have forgotten some of the exact complications involved. Though it does seem slight strange (IMHO) that you're getting rid of the file system, but still keep the LVM layer.)
I have sympathy with, and am open-minded to, the conclusions of this article - even as a die-hard, true believer in the filesystem (esp. ZFS) as a useful foundational building block.
However, I hope that these conclusions do not lead to the intentional deprecation of support for filesystems in projects like Ceph. If a non-filesystem backing store is superior, then by all means do it, but I hope the ability to deploy a filesystem-backed endpoint will be retained.
In a pinch, it's very flexible and there are a lot of them lying around ...
If nothing else, most filesystems can expose a big file that can be used, with decent efficiency, as though it was a disk. For COW filesystems, turning off COW on the file may improve performance, and, if the FS is on RAID, the resulting performance and correctness properties will be odd.
> “For its next-generation backend, the Ceph community is exploring techniques that reduce the CPU consumption, such as minimizing data serialization-deserialization, and using the SeaStar framework with a shared-nothing model…“
Seastart HTTPD throughput as mentioned on their site Between 5 to 10 CPU, it can achieve 2,000,000 HTTP request/sec. Just Vow. But If you look at Http Performance data on below URL, running similar configuration on clouds (AWS etc.) looks costly though.
Ceph has a lot of small ops, where lock and cache contention becomes very significant. (Basically a small piece of data/request comes in from the network and the OSD [object storage daemon] network thread has to pass it to the I/O worker thread and then forget it. The I/O thread similarly just needs to get the request issue a read/write, and let the kernel work.)
Since the whole Ceph I/O model is async the less waiting, scheduling, contention, etc. happens the better.
Currently Ceph is CPU bound, that's why they are trying to improve CPU perf.
DPDK runs on Linux and FreeBSD officially. All of the momentum is on Linux though*
*I was recently part of an effort to add FreeBSD testing to DPDK because SPDK's FreeBSD test agent kept failing when we updated DPDK. I'm a core maintainer for SPDK, which is basically DPDK for storage.
Dumb http-response benchmarks can be achieved everywhere. Seastar excels on complex (cpu,disk,ram,network) scenarios (big disk-based db) with oltp/olap and keeping SLA low.
This isn't surprising, but I guess the results need to be put into perspective with the use case for distributed file systems and NFS eg. reasonable scaling for static asset serving with excellent modularity, in particular when paired with node-local caches. Of course Ceph etc. won't scale to Google Search and Facebook levels, but it's still damn practical if you're scaling out from a single HTTP server to a load-balanced cluster of those without having to bring in whole new I/O infrastructures. And they help with cloud vendor lock-in as well; for example you can use CephFS on DO, OVH, and other providers.
Slightly off topic but I love Adrian Coyler’s blog. Since I never pursued graduate studies in CS I never really got into reading research papers but would love to start reading some on my commute to work.
Does anyone have any recommendations for finding interesting papers? Do I need to buy subscriptions? Is there a list of “recommended” papers to read like we have with programming literature e.g. The Pragmatic Programmer?
ACM subscription helps, since their "Digital Library" has pretty good coverage of the CS literature. This is something that your employer might want to sponsor.
A strategy to find interesting papers that are worth reading as a novice in the field, is to pick any current paper that is interesting to you and look for heavily cited references.
The following websites have citation data for papers:
Don't most of their problems go away when you fallocate a pile of space and use AIO+O_DIRECT like a database to get the buffer cache and most of the filesystem out of the way?
CoW filesystems like BTRFS proviode ioctls to disable CoW as well, which would be useful here when you've grown your own.
XFS has supported shutting off metadata like ctime/mtime updates for ages.
If you jump through some hoops, with a fully allocated file, you can get a file on a filesystem to behave very closely to a bare block store.
No. Yes, you can fallocate. Yes, you can use AIO, or even better io_uring. Yes, you can use O_DIRECT ... if you want to give up caching and have to deal with alignment restrictions, and it turns out that O_DIRECT turns into O_SYNC in some unexpected edge cases. That gets you something that kind of sort of behaves like a block device, but that doesn't solve any of the problems the authors identify.
* No transactional semantics. Roll your own. (I'm actually OK with this one BTW, but others feel differently.)
* No help with slow metadata operations. Still slow if you're still using multiple files, or roll your own within a single file.
* No improvement in support for new media types (e.g. shingled/zoned).
In an ideal world, local filesystems would do a decent job supporting distributed filesystems (and other data stores). Instead, we're in a world where local filesystems fall short in many ways, and the solution to every deficiency is to avoid the local filesystem as much as possible. That way leads to silos and lock-in, so I don't think it's a good answer. Local filesystems need to be better, or someone needs to create an equally standard abstraction and set of tools to do what local filesytems can't.
Older thread with people having discussed this that back up my words right up top. I don't have a single issue with KirbyCMS + NTFS, I have distributed back end stuff as I desire, and it just works and has mature documentation.
I don't think the conventional wisdom of building on top of filesystems exists. In distributed systems you always naturally gravitate towards using raw storage devices instead of filesystems, it becomes obvious very early on that filesystems suck too much and only create problems. And it's the same with all the embedded database libraries, you really want to write your own, because none of the existing ones were made to address performance and operational problems that arise even in small distributed systems. But at the same time early on you don't yet know most of the problems and don't want to invest time implementing something you don't yet understand well enough, so you end up building on top of filesystems and embedded databases and making plenty of poor choices and learning on your mistakes.
As I pointed out to the authors, this wheel has turned a couple of times. In the late 90s, many distributed filesystems (and most cluster filesystems) used raw disks and their own format. This was a burden, both for the developers who had to maintain an entire low-level I/O stack in addition to the distributed parts and to the users who had to learn new tools to deal with these "alien" disks in their system (which also limited deployment flexibility). Thus, when the current crop - e.g. Ceph, Gluster, PVFS2 - came around, they went toward a more local-FS-based approach. All of the issues mentioned in the paper were still real, but on the hardware of the time (both disks and networks) those weren't the bottlenecks anyway so the convenience was worth it. Now the tradeoffs have shifted again, and so have the solutions.
Context: I've been an originator/maintainer for multiple projects in this space, and currently work on a storage system where we add space in bigger increments than Ceph's entire worldwide installed base (according to numbers in the paper).
The filesystem as an abstraction has been getting long in the tooth for a long time -- it's taken a long time for the industry to recognize this, but that's the way it is with filesystems because a boring filesystem that never munches your data is preferable to most people to an interesting filesystem.
Some evidence:
* The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.
* Microsoft's failure to replace NTFS with ReFS. (Interacts with storage spaces in such a way that it will never be reliable)
* Microsoft's giving up on the old WSL and replacing it with a virtualized linux kernel because metadata operations on NTFS are terribly slow compared to Linux but people don't usually notice until they try to use an NTFS volume as if it was an ext4 volume.
* The popularity of object stores, systems like Ceph, S3, etc.
* Numerous filesystems promising transaction support and then backing away from it (btrfs as mentioned in the article, NTFS, exFAT, etc.)
* Proliferation of APIs to access filesystems more efficiently. On one hand there are asyncio access methods for filesystems which aren't quite as solid as asyncio for networks, on the other hand there is mmap which can cause your thread to block not only when you make an I/O call but later when you access memory.
* Recently some filesystem based APIs that address the real issues (e.g. pwrite, pread) but astonishingly late in the game.
I don't quite disagree with your conclusion, but some of your supporting points don't really support it.
> If any of those filesystems were truly adequate there wouldn't have to be so many.
Diversity is not bad. Different filesystems can have different design points, e.g. read-heavy vs. write-heavy workloads, optimized for different kinds of media, realtime requirements (traditionally an XFS strength) etc. That's a good thing. What's a problem is that there are too many filesystems trying to be all things to all users of a too-general API, and too many FS developers competing with each other on long-irrelevant microbenchmarks instead of making useful progress.
> Microsoft's failure
...is Microsoft's.
> Recently ... pwrite, pread
Those have been standardized since 1998, and I'm pretty sure I remember them existing on some systems before that. The fact that they're not recent is IMO the real problem. There have been attempts to create useful new abstractions such as the setstream API from CCFS[1], but they're too few and rarely gain traction.
Personally, I think local filesystems need to adapt in a bunch of ways to be a better substrate for higher-level applications - including but not limited to distributed filesystems. OTOH, I don't think putting complex transactions at that level is a good choice. There's enough innate complexity there already. Adding transactions there just increases inertia, and if filesystems offered better control over ordering/durability the transaction parts could be implemented in a layer above.
The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.
First of all ZFS is not a “linux” filesystem. Second, available choice of filesystem is a strange thing to use as justification that previously existing filesystems are “inadequate”. By what criteria are you establishing adequacy? Success of technologies is often not dependent solely upon their technical excellence but on a variety of factors that may have nothing to do with the technology itself. (Betamax vs VHS, etc.)
In theory people would pick the filesystem which is ideal for their application.
In practice it is a lot of work to research the choices, and there's a high risk that you'll discover something wrong with your filesystem only when it is too late.
It's one thing to pontificate for and against particular file systems, it's another to use one for years, terabytes, etc. ZFS might scrub your data to protect against bitrot, but I remember reading harrowing tales from ZFS enthusiasts who were recovering (or not recovering) from wrecks every week but seemed to think it was a lot of fun, or conferred them status, or was otherwise a good thing.
I stuck with ext4 for a long time before finally building a server that uses ZFS/HDD for media storage (e.g. not a lot of random access)
I remember the time when a project I was involved with chose reiserfs because they thought it was "better" and then they were shocked when once in a while the system crash and we found that a file that had just been created was full of junk.
That's a trade-off they made, they decided it was important to journal filesystem metadata (the length of the file is right) but not to protect the content. If they read the docs, really thought about it, then understood it, they would have known, but they didn't.
This book points out that in cases where there is too much competition, you can switch all you like between flawed alternatives, but have no way to communicate what you really want:
And when it comes to the "Filesystem API" probably anybody who has special needs for filesystem performance would find that a different API than the standard filesystem API would be a boon.
I don't disagree with the general premise that a single filesystem might not be applicable or appropriate to all cases or that existing filesystems apis are generally deficient.
My primary issue was with the two specific assertions I addressed, one of a given filesystem's origins and that of choice should being used as a proxy for evaluation of adequacy.
As for ZFS enthusiasts "recovering from wrecks every week", I suspect you're specifically referring to ZFS on Linux or one of the BSDs -- which is not the same as ZFS when used in its original environment -- Solaris.
So because a file system had issues 20 years ago when it was new.... Do you still drive a car with a carburetor and drum brakes?
ZFS is now incredibly stable and durable, with the exception of some of the early non production ZFS on Linux work that is now fixed (and was specifically billed as non prod use). It has seen me through issues that other file systems would have failed on, including drive failures, hard shutdowns, a bad RAM module, SAS card being fried by a CPU water cooler, etc. Years and terabytes just on my systems, zero issues.
In fact one of the tests that Sun did back in the day was to write huge amounts of data to a NAS and pull the power cord mid writing. Then repeated that a few thousand times. It never corrupted the file system.
Needs citation. I was in the very earliest crop of ZFS users (ca. 2003), and then went on to build a storage product based on ZFS (ca. 2006) and then a cloud based on ZFS (ca. 2010) and an object storage system based on ZFS (ca. 2012) -- and ran it all up until essentially present day. I have plenty of ZFS scars -- but none of them involve lost data, even from the earliest days...
> * The proliferation of Linux filesystems such as ext4, XFS, ZFS, btrfs, reiserfs, JFS, JFFS, bcachefs, etc. If any of those filesystems were truly adequate there wouldn't have to be so many.
While I get your point, I would like to point out that ZFS was developed by Sun (now Oracle). I've used ZFS for years from a data-integrity and array-mirror perspective and love it. No other file-system you mentioned next to it gives me the confidence that ZFS does (maturity, stability, etc).
People who develop new filesystems generally have a problem with existing filesystems and have some goal they want to accomplish.
There is also the issue that somebody else's "existing" file system becomes your "new" filesystem when support comes to your OS. For instance Linux has support for many obscure filesystems such as Amiga and the old mac filesystems because somebody might want to mount an old disk. I don't think anybody really wants to run a volume like that because they want to use it to do their ordinary work on Linux.
XFS, ZFS, JFS to name a few are foreign filesystems that claim to be good enough that you might want to use them on a Linux system not for compatibility but because of performance.
Yes, like these libraries. Think, for example, sharding, where you run lots of database instances per disk, this requires strictly bounded memory usage and no background operations. Or running on a disk full of bad blocks with constant retries.
Well, it was just an example, it wouldn't really work either way, would be too slow for HDDs and would still require handling of all the disk reliability and performance issues.
But if I were to do it today and store metadata in an SQLite database I would use a single database for all the shards and use its VFS API where I would add I/O scheduling, remapping of blocks with redundancy into different blocks and automatic recovery, marking of bad and slow blocks, and maybe even scrubbing. Looks like half of the storage engine already and it would still be somewhat slow.
> We looked at this issue earlier. Fundamentally the tension here is that copy-on-write semantics don’t fit with the emerging zone interface semantics.
While the paper writes:
> It is not surprising that attempts to modify production file systems, such as XFS and ext4, to work with the zone interface have so far been unsuccessful [19, 68], primarily because these are overwrite file systems, whereas the zone interface requires a copy-on-write approach to data management
This seems to be a contradiction, and I'd side with the original paper.