I've wanted somewhat different file system semantics, but close to the POSIX model:
- Unit files. The unit of consistency is the entire file. Files are opened for writing, written, and closed. Then then become openable by other programs. Overwriting a file replaces the file as a unit; other readers see the old file until the writer closes and the reader reopens. This is the default and the case for most files. On a crash, the file reverts to the previous good file, if any. Can be memory-mapped as read only. File replacement can be done now through non-portable renaming gyrations. It should just work.
- Log files. Append-only mode, enforced. Can't seek back and overwrite. On a crash, the file recovers to some recent good write, with a correct end of file position. It does not tail off into junk.
- Temp files. Not persistent over restarts. Not backed up. Can be memory-mapped as read/write. On a crash, disappears.
- Managed files. These are for databases. Async I/O available. May have additional locking functions. Separate completion returns for "accepted data" (caller can reuse write buffer) and "committed data" (safely stored), so the database knows when the data has been safely stored. Can be memory-mapped as read/write. Intended for use by programs which are very aware of the file system semantics. On a crash, "committed data" should be intact but data not yet committed may be lost.
> Overwriting a file replaces the file as a unit; other readers see the old file until the writer closes and the reader reopens.
This is the semantic rename provides, and the strength of that semantic is a common complaint from many in the anti-POSIX crowd. I suspect you knew that already, but the number of syscalls would be similar in either case except that the latter file would need to be visible before the rename. Linux does provide O_TMPFILE + AT_SYMLINK_FOLLOW which provides the nearly ideal behavior. (Nearly because it still requires /proc/self/fd access, AFAIU.)
> Log files. Append-only mode, enforced
Both chflags (BSD) and chattr (Linux) provide append-only modes attached to the file/inode (instead of the file descriptor or open file table entry)[1].[2] Neither command nor their options are defined by POSIX, but adding more requirements to POSIX filesystem conformance goes against the grain of prevailing sentiments.
> Temp files. Not persistent over restarts. Not backed up. Can be memory-mapped as read/write. On a crash, disappears.
Again, the semantics of unlinking deliberately provide this, though it does make it invisible in the normal namespace. But there's a tension between leveraging the namespace for implicit semantics (Plan 9) vs adding a multitude of options and modes attached to each particular file (Windows).
It's also not uncommon for temporary filesystems like /tmp to be reformatted on boot if the backing store is persistent. That's one reason to use multiple filesystems for a Unix install.
[1] Note that O_APPEND mode is attached to the open file table entry, not the file descriptor, so the mode is inherited across dup and interprocess descriptor passing. Interestingly, on Linux /proc/self/fd (also /dev/fd, which is a symlink) has the semantics of open, not dup, so file table entry state like O_APPEND is lost; whereas on BSD /dev/fd has the semantics of dup so O_APPEND is preserved/enforced.
[2] Solaris lacks chattr and chflags, but chmod supports an append-only option.
> Both chflags (BSD) and chattr (Linux) provide append-only modes attached to the file/inode (instead of the file descriptor or open file table entry)
On Linux, by default only the superuser can set the append-only flag. That severely limits its usefulness for many applications
> adding more requirements to POSIX filesystem conformance goes against the grain of prevailing sentiments
POSIX is of declining relevance. What really matters nowadays is can you get Linux and *BSD (including macOS) to agree on implementing something new. Get one to add it, and convince the others to copy it. If you can do that, then getting it added to the POSIX standard is likely to be easy.
Vendors hardly care about POSIX certification anymore. The latest version of POSIX/SUS, UNIX V7, currently has zero certified implementations. (Oracle Solaris did achieve V7 certification, but has since lost it; I don't know exactly what happened, but I suspect Oracle refused to pay the certification renewal fees.)
> On Linux, by default only the superuser can set the append-only flag. That severely limits its usefulness for many applications
Ah, good point.
> Get one to add it, and convince the others to copy it. If you can do that, then getting it added to the POSIX standard is likely to be easy.
In actuality it usually works in reverse: Red Hat (now IBM), which has the most dominate presence on the committee, convinces POSIX to add or modify something, and then everybody else adopts it. Examples: asprintf, fmemopen, stpcpy, vdprintf. I can't really think of an example that worked the other way around, though there may be some in the next specification.
The fact that nobody is certified to V7 is beside the point. Nobody is certified to the latest HTML5 standard, either, AFAIK. The point is to provide a shared target. Few people want to copy Linux or glibc because the semantics are invariably underspecified and in part accidental, and people would prefer to avoid those aspects even if they have no other choice but to nominally adopt the interface. Standardization provides a chance to clarify behavior and to fix the bounds of what a portable program can expect long term. If I have to support epoll + kqueue in perpetuity (and I assume I do), I'll structure my program differently.[1] However, if I want to use pselect (or the forthcoming ppoll), I'll target the POSIX-defined semantics, provide a best effort wrapper (or none at all), and tell users on slower evolving platforms to complain to their vendor.
The Linux kernel strives for strong ABI backward compatibility, but it hardly has a perfect track record in that regard (e.g. sysctl(2)), and the future is even more unclear given the various directions it's being pulled. Even POSIX can change course, but it does so more methodically than with a Github code search. And while Linux doesn't make promises regarding POSIX compliance, it's less likely to break POSIX semantics than its own semantics, ceteris paribus. Whether that's because kernel developers value POSIX compliance, or merely because POSIX-vetted semantics are narrower and less accidental is irrelevant.
[1] Also, I'll take the bet that epoll is never standardized, precisely because of its accidental and sometimes very undesirable semantics as compared to kqueue. At best we'll get a greatest common denominator interface that can be built around both--or possibly taking into consideration Solaris ports.
> In actuality it usually works in reverse: Red Hat (now IBM), which has the most dominate presence on the committee, convinces POSIX to add or modify something, and then everybody else adopts it. Examples: asprintf, fmemopen, stpcpy, vdprintf.
I'm not convinced your timeframe is right. stpcpy was added to FreeBSD in 2001, didn't officially become part of POSIX until 2008. Red Hat didn't invent stpcpy either; it was invented on the Amiga in the 1980s, and the GNU project adopted it from there and then Linux acquired it from the GNU project. I think if we studied the history of the other functions you mention, we'd also find that their formal addition to POSIX wasn't the primary cause of their spread, just a formal recognition of an existing de facto reality.
> This function was added to POSIX.1-2008. Before that, it was not part of the C or POSIX.1 standards, nor customary on UNIX systems. It first appeared at least as early as 1986, in the Lattice C AmigaDOS compiler, then in the GNU fileutils and GNU textutils in 1989, and in the GNU C library by 1992. It is also present on the BSDs.
OpenBSD didn't adopt stpcpy until 5.1, circa 2012 (https://man.openbsd.org/stpcpy) and NetBSD until 6.0, also circa 2012 (https://man.netbsd.org/stpcpy.3). In that time frame there was a flurry of activity in both projects regarding POSIX compliance.
My experience with the others was similar, though not uniformly. Support on OpenBSD, NetBSD, macOS, and AIX typically post dated their addition to POSIX. Whether they would have been added independent of POSIX I can't say, but there are plenty of GNU extensions that remain unsupported, and some, like strerror_r (specifically the return type and behavior on error), that likely will never be.
While not specifically on point, I think it's noteworthy that the proposal to add stpcpy to C2X was made by Martin Sebor, a Red Hat employee: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2352.htm Red Hat doesn't share the seemingly pervasive cynicism regarding standardization. Precisely what motivates that, I'm hesitant to speculate. Like most things I'm sure there are mixed motives.
macOS's man page says "The stpcpy() function first appeared in FreeBSD 4.4". I believe FreeBSD 4.4 was released in September 2001. So that's approximately 7 years before it was added to POSIX. Given that timeframe, it seems unlikely that POSIX has much to do with its presence in FreeBSD. A much more likely explanation is to ease porting of GNU projects (and software developed on Linux) to FreeBSD.
From checking opensource.apple.com, I conclude that OS X added it in 10.3, released in 2003, so that's around 5 years before POSIX 2008. Again, given the timeframe, it seems hard to argue that POSIX triggered Apple's action here.
Given Linux, FreeBSD and Darwin all already supported it, I wonder to what extent NetBSD/OpenBSD's decision was motivated by formal POSIX conformance versus improving compatibility with Linux/FreeBSD/Apple. It is hard to say, but I think the later was likely just as important as the former, and given the importance of the later, they might still have added it even if it had never formally made it into POSIX.
> Given Linux, FreeBSD and Darwin all already supported it, I wonder to what extent NetBSD/OpenBSD's decision was motivated by formal POSIX conformance versus improving compatibility with Linux/FreeBSD/Apple. It is hard to say, but I think the later was likely just as important as the former, and given the importance of the later, they might still have added it even if it had never formally made it into POSIX.
One of the benefits of standardization is, "here is a definitive list of things that need to be added". With rare exception (e.g. async I/O), there's no hand-wringing regarding whether a POSIX interface should be added. Search for the keyword "POSIX" among the OpenBSD 5.1 (circa 2012) changes at https://www.openbsd.org/plus51.html Now search for Linux. For whatever reason stpcpy isn't listed (a tweak is listed in plus52.html), but you'll see where the emphasis lay.
Linux POSIX conformance is incredibly good, particularly at the libc level notwithstanding lack of formal certification[1], so invariably it's tough to say, absent direct confirmation, what was on people's minds. But at least on the OpenBSD mailing-lists, more often than not the explicit reason is POSIX compliance. And I get the same impression from reading NetBSD changelogs. FreeBSD is unique because they've been far more proactive with not only adding POSIX interfaces, but also adding their own extensions. And historically they were a bigger player and in many ways the heir to the "BSD" mantle; certainly more so than NetBSD. But pre-existing support doesn't negate my point (see plus51.html for evidence concerning the immediate motive behind adding stpcpy), nor does the fact that an interface supported by multiple vendors is more likely to be adopted by POSIX. I could likewise dredge up some SysV and other extensions that weren't supported in Linux or glibc until they were adopted by POSIX.
Which isn't to say platforms don't adopt Linux interfaces (whether or not they originated with Linux) with an eye toward Linux compatibility; of course they do, and do so increasingly. For example, I know that OpenBSD refactored their getpeerid interface as a wrapper around newly added getsockopt+SO_PEERCRED support, which I've always assumed was a nod to Linux compatibility, or at least an admission of the obscurity of getpeereid. OpenBSD also adopted MSG_NOSIGNAL (POSIX, SysV'ish, Linux) instead of the SO_NOSIGPIPE that already existed on FreeBSD, macOS, and NetBSD (likewise for F_SETNOSIGPIPE and O_NOSIGPIPE supported by macOS and NetBSD), which goes to show that OpenBSD doesn't merely add whatever interface might marginally aid portability.[2] MSG_NOSIGNAL was a no-brainer as compared to the alternatives, and that would have been true regardless of whether Linux supported MSG_NOSIGNAL. If you want to add an interface, the default answer is "yes" if it's POSIX; if it's not POSIX the default answer is "tell me more". POSIX is self-justifying. That difference in friction might seem de minimis, but ask anyone who has maintained or tried to contribute to a large open source project. The ability to appeal to a standard model which already has buy-in by the project is a major convenience in terms of everybody being on the same page.[3] In that way it can drive behavior unintentionally, which isn't an accident as it pertains to the the salience of a formal standard. Of course, Linux is a de facto target for most projects, and one of far more concern than nominal POSIX conformance, but that doesn't diminish the value of POSIX targeting, and IME neither has it diminished the interest in POSIX conformance among the people for whom it actually matters--those interested in portability, which tend to be disjointed from the set of people for whom POSIX is a dirty word.
[1] IME it's better than macOS even though macOS is UNIX03 certified. glibc and musl take POSIX very seriously, both the letter and the spirit. Developers from both projects actively file tickets on https://austingroupbugs.net to remediate errors and omissions in the text, and problems with actual semantics, explicit and accidental. Various BSD developers are also active there, but I get the sense of a pecking order, if only because of Red Hat's large presence (literally and figuratively).
[2] Contrast that with kqueue. FreeBSD, macOS, NetBSD, and OpenBSD kernels have diverged significantly since kqueue was adopted. You can't copy+paste kqueue-related kernel code across them. But they still look to each other for prior art when it comes to filling gaps or extending the behavior of kqueue, and uniformity is given significant weight. (Especially across *BSD. macOS extensions don't always make the most sense, like with macOS's poorly considered EV_OOBAND. See https://sandstorm.io/news/2015-04-08-osx-security-bug).
[3] "Running a successful open source project is just Good Will Hunting in reverse, where you start out as a respected genius and end up being a janitor who gets into fights." https://diff.substack.com/p/working-in-public-and-the-econom... Fights over whether or not to support POSIX are relatively rare. The ones that happen are infamous because they're the exception.
My feeling is POSIX is still pretty important in the safety critical embedded world. Green Hills Integrity, VxWorks, QNX, RTEMS all have some market presence and aim for POSIX compliance.
Out of those, it appears that only Green Hills Integrity and VxWorks have any formal POSIX certification – http://get.posixcertified.ieee.org/register.html – and only the first is actually certified against the full 1003.1 standard, VxWorks is only certified for the PSE52 realtime subset
To "aim for POSIX compliance" without formal certification doesn't mean much, since to "aim for POSIX compliance" is compatible with supporting an arbitrary subset of the POSIX standard
I work for a vendor of an operating system that has PSE54 certified conformance to the latest POSIX 1003.1 2018 Edition Specification. Our customers demand this certification as part of their ISO 26262 and IEC 61508 safety story.
POSIX certification is real and current and there are certified conformant systems out there.
Do note that the last certified Unix specification was UNIX V7. Unix is not POSIX. POSIX is not Unix.
Renaming is commonly proposed for rewriting files but does not provide the guarantees people usually think it does. Posix renames are only atomic during normal operation. But here atomic merely means that other syscalls will not return results for some partially-renamed state. The operation does not need to be atomic in the case of a crash (where here “atomic” means “you can’t read the file system in some partially renamed state after a crash”, a different definition but one which people commonly assume when they hear that rename is atomic). I’m not aware of any normal Linux file system where renames are always atomic over a crash
Yeah, the way you're 'supposed' (as near as I can tell) to do this is:
read 'foo'
write 'foo~'
hardlink 'foo~' onto 'foo'
unlink 'foo~'
on crash:
if 'foo' exists:
it's the correct state for the file
unlink 'foo~'
continue
otherwise:
hardlink 'foo~' onto 'foo'
unlink 'foo~'
I'm not sure Posix actually requires link and unlink to be atomic enough on crash for this work, but it seems to be fine for any sane filesystem (where the fs itself doesn't fall apart in presence of crashes).
This doesn't work because the link() syscall fails if the target exists.
You use a hardlink only for atomic creation of the initial data:
write 'foo~'
hardlink 'foo~' onto 'foo'
To atomically update you simply use rename:
read 'foo'
write 'foo~'
rename 'foo~' into 'foo'
I don't know what dan-robertson's claim is based on, but this surely is atomic even on a crash, as the filesystem will never be in a state where 'foo' contains only partially updated data. That would require some severe fs corruption.
It doesn't require severe FS corruption.
In reality "write 'foo~'" just starts an asynchronous write; it's possible for rename to happen before that async write is complete.
So in case of a system crash you might end up seeing partial data in 'foo'.
fsync() may or may not protect against that depending on what system we're talking about.
Sure, you are still dependent on constraints to durability and atomicity of the filesystem write of the new data. But this is handled properly in most modern filesystems (ext3,ext4,zfs) by default (as data=ordered ensures data writes precede metadata writes).
The point of rename() is to prevent a version to ever exist that mixes the old and the new version, which it always does, regardless of the filesystem.
ZFS is really well behaved. It never reorders metadata in an observable way so renames should truely be atomic. In the event of a crash you'll get the old or the new name.
What I’d prefer to atomicity is automatic file versioning such that a file descriptor opened at time t0 always points to the same contents, unless you refresh it. You’d also need CoW and/or writing patches rather than rewriting the whole file and a GC/compactor for cleaning up transient versions of a file.
The workflows this would be enable is that, when vim wants syntax checking, it writes a transient version of the file to the fs without updating the “HEAD” pointer and then hands a fd to the syntax checker that points to that new transient version.
How can a rename not be atomic over a crash? What does a partially renamed state even mean?
Surely, if I do "write foo~", "rename foo~ foo", regardless of any crash, foo is either pointing to the old data or the new data. There simply isn't any inode with any "partially renamed state" mixing the two.
Whenever I bring this up, I get a list of complicated platform-specific workarounds. My point here is that unit file behavior, as outlined, should be the default. You should never have an incomplete unit file around and readable.
Agreed. There are so many infuriating software bugs I've hit by "something was watching file X and tried to consume it mid-write and pooped it's pants because there was invalid data".
Not small things, either - I mean like "IIS crashed all our apps are down we have minutes before SLA problems".
Along those lines, one thing that has bothered me is that we can't prepend files, even with modern SSDs with fancy controllers. Those controllers are jumbling the contents of a continuous disk onto different chips, and different sections on those chips.
For things like MP4 metadata that usually need to be near the beginning of a file for the MP4 to be considered streamable, it can be quite cumbersome. Being unable to expand a file from the beginning means that fancy MP4 writers write a bunch of blank space (perhaps guessing the necessary size of the metadata portion), then after the stream data is written, go back and hope that they allocated enough blank space to fit the metadata within. If the writer was wrong, either it has to put the metadata at the end of the file (preventing streaming), or has to rewrite the entire file with a bigger buffer.
Perhaps it is a wish that requires too many layers to be changed to be possible, but I like to dream.
This is already possible on Linux with certain filesystems (notably xfs and ext4), as long as the size of the data you want to prepend is an integer multiple of the filesystem block size.
”Log files. Append-only mode, enforced. Can't seek back and overwrite. On a crash, the file recovers to some recent good write, with a correct end of file position. It does not tail off into junk.“
As wahern says, append-only is available in some OSes. I am not aware of any that guarantees the “recovers to some recent good write” part, though. UNIX-like OSes won’t support it, as they don’t have a way to write more than one byte atomically (https://en.wikipedia.org/wiki/Write_(system_call): The write function returns the number of bytes successfully written into the array, which may at times be less than the specified nbytes)
> - Log files. Append-only mode, enforced. Can't seek back and overwrite. On a crash, the file recovers to some recent good write, with a correct end of file position. It does not tail off into junk.
Linux already semi-supports this with chattr +a, which is filesystem-specific, plus filesystem-specific mount options (e.g. data=journal). unfortunately, implementing the latter is non-trivial to do on a per-file basis.
> - Temp files. Not persistent over restarts. Not backed up. Can be memory-mapped as read/write. On a crash, disappears.
O_TMPFILE exists on Linux, and I don't know of any reason why it couldn't be added to POSIX.
> - Managed files. These are for databases. Async I/O available. May have additional locking functions. Separate completion returns for "accepted data" (caller can reuse write buffer) and "committed data" (safely stored), so the database knows when the data has been safely stored. Can be memory-mapped as read/write. Intended for use by programs which are very aware of the file system semantics. On a crash, "committed data" should be intact but data not yet committed may be lost.
io_uring exists now.
these features are Linux-specific, but my point is that these features do exist now already, there is no need to wish for them.
File replacement can be done now through non-portable renaming gyrations.
I was under the impression that POSIX requires renames to be atomic, and have relied on that in embedded Linux systems with UBIFS. Are there exceptions among truly compliant filesystems and OSes, and/or is the atomicity guarantee not as strong as I thought?
It's atomic but it might not be persistent in the event of a crash. If the metadata wasn't committed then upon remounting the old file might reappear, and its likelihood of reappearing is independent of any other operations on that filesystem. Though, traditionally rename was not only atomic but also preferentially ordered ahead of other metadata operations so subsequent renames of other files wouldn't be visible if the older rename appeared.
EDIT: Actually, I think the issue I had in mind was that writes to the new file might not be committed before the rename, so if you do open + write + rename + close, on a crash the rename might persist even though the write didn't. You technical should use fdatasync if you want the write + rename to be ordered. See, e.g., https://lwn.net/Articles/322823/ But this is such unintuitive behavior that I think that even for ext4 the more reliable behavior is still the default.
Yes. But traditionally Unix filesystems provided stronger consistency guarantees than required by POSIX, and applications have come to rely on them. Actually, in the event of a crash I think POSIX specifies implementation-defined behavior even for fsync.
Making fsync a no-op and not having any durability at all is perfectly POSIX compliant. POSIX only concerns itself with the visibility of I/O in the live system (e.g. a process either sees a write fully realized, or not at all - this didn't use to be true for larger writes until fairly recently, btw.). POSIX specifies nothing about what happens when a system is restarted or looses power.
The dance can get even more complicated[0] if you also want to ensure that the rename itself is persisted. But you only really have to encode this in an atomic-file-write-with-callback library once.
I agree with you, for me the main limitation of this system is that you have to be sure that the temporary file exists in the same fs as the target, which in general involves creating the temporary file in the same directory. It works well, but I always feel icky creating temporary files in the middle of the filesystem (which may end up lingering if my program crashes before I could issue the rename syscall).
But beyond that I'm not aware of any portability issue, except maybe with weird filesystems like NFS in some non-standard configuration that doesn't implement strict locking.
It can be re-ordered w.r.t. a hard-link and any derived written block data (e.g. indexes) and I've seen this first-hand with early maildir implementations across a crash, and that was even without throwing the spanner of NFS into the gears.
There are have been attempts at "phase tree" filesystems with essentially transactional write semantics for data and metadata simultaneously. Tux2 was one such nascent attempt that ended (rightly or wrongly) due to patent conflict with Netapp's WAFL [1].
The interesting thing is, I just took a look at what I believe were the patents in conflict, and they appear to have expired in 2015 [2,3]. Combining the phase tree concept with, say, Merkle trees might open up very interesting possibilities for large-scale reliable file storage.
Windows has transactional NTFS. It was never widely adopted, in part because it requires use of a whole different filesystem API (CreateFileTransacted instead of CreateFile, etc). I think, if it had been more seamless, it may have been more successful. Given it is rarely used, Microsoft has marked it as deprecated for potential removal in future Windows versions.
Iirc it was implemented with a pretty seamless API at the undocumented ntdll layer, which let you set a current transaction on a handle then do API calls as usual.
The issue is that while NT mostly reasons about files as handles, Win32 deals primarily with the name for many operations. So they duplicated every api that took a filename.
Doesn't surprise me to learn that the NT-level API is nicer.
The NT API is in general much better thought out than the Win32 API. It is a real pity that Microsoft chose to leave the former largely undocumented and tell people to use the later instead.
Anyone know of any file system which allow partial initial blocks? Ie a file can appear less than a full block long by adjusting its length, any where there's an offset into the first block where the file starts?
Then you could easily do size-constrained rolling logs by simply adjusting the starting point of the file.
The characterization in the headlined article that append-only has been "lost" since Multics isn't strictly true. It has been poorly implemented, but write and append rights are two separate things in NFSv4 ACLs.
Sadly even though a system might have one or more of those features, few programs can assume a specific file system (and many not even a specific OS!). So you'd have code branch into two parts, one where this is managed by the file system, and the other where it is manaed in user code (e.g. write to temp file, rename etc). This is the curse of the lowest common denominator. Even innovation makes it worse (more code, not less).
Sounds similar to a version control system to me - except that the version control system stores deltas. But there is nothing preventing it from periodically storing snapshots.
Snapshots in a journaled file system would serve the same purpose. Of course, journaled file systems could also handle the meta data snapshots
Article describes Multics’ ability to have a file's directory entry on disk but its contents on tape, so trying to access its contents will cause it to be retrieved from tape, and then asks "Is something like this offered on any POSIX-compatible file system?"
What the article is describing is basically just HSM (Hierarchical Storage Management), which is a commercially available technology – e.g. Sun/Oracle SAM-QFS on Solaris, IBM Tivoli Storage Manager on AIX, DFSMShsm on z/OS.
Windows NTFS also supports HSM, although the core NTFS itself only provides features necessary to implement HSM (such as FILE_ATTRIBUTE_OFFLINE and reparse points), and you need an add-on to Windows to actually use those features to produce a full HSM solution. (Actually Windows itself used to include such a solution, Remote Storage Service, but Microsoft removed it in Windows Server 2008 onwards; but the underlying functionality is still there in NTFS, and available for third party HSM implementations to exploit.)
Dropbox Smart Sync [1] is HSM using reparse points on Windows and kauth on MacOS. We prototyped using fanotify on Linux, but there were a number of edge cases around moving files and permissions that weren't comfortable with (if I recall correctly).
After we shipped HSM, Microsoft rebuilt the functionality into their Cloud Sync APIs which are used by OneDrive and others [2]. On MacOS, the File Provider APIs [3] provided similar functionality for Cocoa apps using Cocoa APIs but not for POSIX (which made it a no-go for us).
You could also implement this pretty easily in FUSE: if the dirent is present on the underlying (ext or whatever) filesystem, just forward the operations, otherwise leave the syscall blocked and hunt down the relevant backup. I don't know that anyone's actually written that, though.
Nothing about this is “easily” once you work through the edge cases for performance and reliability. The only places which need HSM have enough data volume and range of applications to stress any simple solution (e.g. with the approach you outlined: what happens when someone runs find on that volume?). One of the more interesting challenges is how to deal with not having quite enough fast storage to batch the slow storage. Your system can appear to work well with one test workload and then fail miserably when two people start running different tasks at the same time.
After a couple decades of this, I generally think this class of software is a mistake. Any time you misrepresent one class of storage as another it inevitably leads to very complex software which is still pretty fragile and confuses its users on a regular basis, and the cost savings never deliver to the hoped-for degree.
Well obviously the performance is going to be terrible, but it's probably better than the zero performance you get if you block everything while waiting for the system to fully restore from backup.
> what happens when someone runs find on that volume?
It stalls until all the directories are restored? And hopefully pushes those directories to the front of the to-be-restored-from-backup queue, but even without that it's still better than not being able to run any operations on that volume.
It can be worse than zero: your tape drive get hit with lots of small file requests, running much slower than it would be to stream a restore of a large batch containing all of the files you need, and causing increased failure rates on the hardware and media because tape drives are designed to stream, not seek. I’ve had to explain this to multiple HSM admin teams who were trying to save a few bucks on staging HDD capacity and surprised to see it taking over a month to restore a terabyte of data (not joking - and that was with multiple drives!) and hardware failing at like 5x the manufacturer’s estimates.
What you’re trying to do is akin to saying you can write an interface layer to make a railroad look like Uber: at some point the fundamental differences between the architectures are too much to paper over. The situation has improved now that the major operating systems have offline file support so you can make it more obvious that some files are not instantly available but you still need all of your client software to handle that gracefully.
Except on Multics it was supposed to be not at the file level but at the segment (think block) level. In theory it was one set of segments not all accessible at the same speed, with “filesystem” just a thin layer grouping some sets of segments together.
I’m not sure that design actually survived to the real world; files seemed more coherent to me, but that could have been me projecting: I was pretty young back then.
My understanding is that on Multics, segment = file.
Multics had a segmented memory model, much like segmented memory models on the 286 and 386 – indeed, Multics was one of the influences on the designers of the 286 and 386 – although newer x86 operating systems moved to flat memory model instead, so segmented memory only ever saw significant use on 16-bit versions of Windows and OS/2.
What made Multics unique was that all files were mmaped – opening a file gave you the ID of a memory segment, which you'd then use much as you'd use a segment selector on x86.
segment
User-visible subdivision of a process's address space, mapped onto a storage system file. Each segment is 1MB long, and has a zero address. The term "segment" is used interchangeably with "file" -- except not really: the things that are files in other systems are implemented as segments; also, the term "file" includes multi-segment files, and when talking in terms of COBOL, PL/I, or FORTRAN language runtime objects, one speaks of files. Programs are spoken of as stored in (procedure) segments. Correct use of the terms "file" and "segment" is a sure sign of a Multician.
Surprised no mention of plan9's FS. It always seemed like the core innovation in plan9.
Basically, everything is a file. But to a ridiculous extent, far beyond what you'd normally think of as files. It's been years since I looked into it though, so maybe I'm misremembering.
Designing it that way has lots of advantages. For example, you can connect computers together via networks using the equivalent of `cat`. (And yes, we have netcat, but it's not quite the same thing as having the abstraction built into the OS.)
I am not entirely intimate with plan9, so take this with a grain of salt.
IIRC, everything was a file, down to the graphical UI. This allowed you to interact with the GUI trough the file system, script things, etc. Network connections and computers are also available trough files.
But even more impressive is that this filesystem builds on a relatively simple protocol[1]. Programs and computers can implement this to expose a filesystem view of their internals. The protocol is still used quite a bit, to share files across VMs for instance [2], and more generally when serializing filesystem data over a network link is needed.
I think this allows you to do fancy stuff such as remotely controlling a window on another device, or displaying it on an arbitrary number of machines.
I wish Linux used the filesystem that extensively, to replace things such as d-bus, pulseaudio, gobject, gstreamer, wayland, etc. GNU Hurd has a similar, very interesting mechanism, translators [3] (basically first-class FUSE objects).
Essentially yes. Here's how you get a graphical session on a remote machine: export your local /dev/draw, /dev/mouse, etc. via 9P to the remote machine, then run the GUI on that remote machine. It'll write to /dev/draw, read from /dev/mouse, and so on with no idea that all those file operations are actually traversing the network.
Thinking about it a bit more, indeed. Unless you compress the data of course, which could theoretically be done transparently by the graphical session that owns the framebuffer device.
But I think that /dev/draw (as the name implies) was likely a drawing API? A bit like X11, when the only thing GUI apps did was to draw primitives like circles, lines, text...
If you look at the trend on "mainstream" OSs, it seems to be going the other way. Google and friends have been working hard to move the networking stack upwards into the userland, because they control the userland and not the kernel. I also blame Microsoft who sucks at providing timely and non-intrusive updates to their kernel, and since it's closed source nobody else can really do anything.
I always thought for instance that SSL/TLS ought to be implemented in the kernel, with an userland daemon dealing with certificate verification. That's how for instance Linux implements the spanning tree protocol: the actual bridging is fully done in-kernel and there's a userland daemon tasked with the higher level details of the protocol.
This way certificates stores and critical updates would be centralized in one location instead of having dozens of programs liking their own OpenSSL and using various certificate registries.
> I always thought for instance that SSL/TLS ought to be implemented in the kernel
Linux provides TLS sockets. Only the bulk transmission is handled by the kernel. Handhsaking and renegotiation has to be handled by userspace. That allows offloading to hardware accelerators. In theory you can sendfile() from a NVMe drive through the a network card with crypto/compression handling without ever touching main memory with all the normal IO syscalls, i.e. with no userspace modification other than opening a TLS socket instead of a TCP one.
> If you look at the trend on "mainstream" OSs, it seems to be going the other way. Google and friends have been working hard to move the networking stack upwards into the userland, because they control the userland and not the kernel.
Any links on what they're doing with this, regardless of intention, I'm curious to see.
AIDL means Android IDL, which is the mechanism for Android IPC, also used by Treble for communication with modern driver model (classical Linux drivers are "legacy" drivers from Project Treble point of view).
If you implement, for example, the network stack as a filesystem, and you have generic facilities which let you bind "files" from one system to another, you can forward your traffic through another host as easily as:
import remote-host /net
Now your new network connections will go through remote-host, because a network connection is just reads and writes to files in /net.
Lots of neat applications when you take a few simple ideas and expand them as far as you can: represent everything as files in a namespace, provide advanced tools to manipulate that namespace, and do it all via a networked protocol.
Nope, because the connection to that remote host was established when we were using the local host's /net. This is one of the other key features of Plan 9: per-process namespaces (inherited by children). Only programs inheriting the current process's namespace will use the imported network stack: you open a shell, you run "import remote /net" to bring in the remote network stack, then you run your network applications within that shell. Applications in any other shell will use the regular local network stack. After importing /net from the remote host, you could also run `rio` in that shell, which will start a new instance of the GUI nested within that window, in which all applications will use the imported stack.
How did Plan 9 allow its APIs to evolve? For example, reading /dev/mouse returns records that are exactly 49 bytes, in text format [1]; how did they add new fields without breaking every app?
As a sibling notes, they could have just added a new file while supporting the old format, but they had some leeway in the format itself: the buttons value only uses three bits.
Had they been a little more clever and used a delimited format like S-expressions, they could have simply specified that each mouse event would generate one S-expression, and clients could have read one expression at a time.
I’ve always wondered how Plan9’s “everything is a file” didn’t open up a billion race conditions. Did they solve this problem, or was this just something that was papered over?
If this is what it sounds like would it be like making a /path directory and defining it as a union of all the directories containing executables. almost like an efficient and idiomatic way to create a single directory containing symlinks to all of your executables. How would it handle things like path precedence?
That's exactly what it is, Plan 9 does not have $PATH. You just mount directories with binaries under /bin. When multiple binaries have the same name, the one from the last mounted directory is used, masking the others. You could have something like:
mount /usr/bin /bin
mount /usr/local/bin /bin
mount ~/.local/bin /bin
Since binaries in Plan 9 are usually under additional subdirectories, conflicts are rare. For example instead of 'ping' you have 'ip/ping' which would be expanded to '/bin/ip/ping'. The same way binaries from different system architectures can live in the same filesystem, you just mount the directory for relevant architecture under '/bin'.
> If this is what it sounds like would it be like making a /path directory and defining it as a union of all the directories containing executables. almost like an efficient and idiomatic way to create a single directory containing symlinks to all of your executables.
That's basically it. Although 'defining' in this case really means 'binding directories into it.'
> How would it handle things like path precedence?
By ordering: first bind the main x86 binary directory into /bin, then bind your custom stuff ahead of it (or after it); as I recall, you could bind to the head or tail.
This is incidentally why the Common Lisp file I/O functions are a bit more complicated than you might expect. You can ignore most of it if you assume something vaguely POSIX-like, but if you want to be portable to e.g. both Unix and VMS (as was once desirable), there are functions like make-pathname that build a representation of a file location from up to 6 parameters (host, device, directory, name, type, and version). The 3 examples in the docs are interesting: http://clhs.lisp.se/Body/f_mk_pn.htm
This is a good read. I believe that Steve Bourne (of Bourne Shell fame) implemented the file system in UNIX at Bell Labs but I may be mis-remembering that. He was also a big Multics fan.
What is unsaid in this article is that nearly all file systems that are in widespread use today, started when "disk space" was a constrained resource. It is a very reasonable thing to ask, "now that stable storage space is much more plentiful, how might we design systems that are better than the current ones?"
The ability to scavenge blocks to re-create state is a good example of that.
One of the cool things about WAFL (the Write Anywhere File Layout) system that NetApp used (uses?) was that it's very design made snapshots 'trivial' since every write to disk was to un-allocated blocks. What that meant in practice was that the file system on disk was always sane. This was what let you pull the power from the box at any time, and assuming its non-volatile RAM was still available, it could always recover. Something that you could do with Intel Cross Point memory and a bunch of disks.
Microsoft research built a number of interesting file systems, some more successful than others, which incorporated many of the ideas from Multics and other OSes.
I was expecting another article about the current crop of less than POSIX filesystems, but I was pleasantly surprised to find it was about older systems with features absent in POSIX. Very interesting stuff. Kudos. I wonder if I can find some information about MTS's filesystem, which also had some neat features especially with respect to access control (PKEYs). Might be a worthwhile addition.
The whole point of the Unix FS was that the developers felt the mainframe approach (~"files are actually sorta like databases") as to complex and heavyweight.
What the OP should teach us is that there wasn't any one mainframe approach. There were many approaches, each involving many components. The process of standardizing what we now know as POSIX involved pruning a lot of unnecessary pieces, but it also inevitably involved leaving out some features that might actually have been useful. Some of them (ahem ACLs) even had to be added back in later versions of POSIX. Just as we should never stop looking for new ideas that can make our lives as programmers and users easier, we should also never forget old ideas whose time might well have come back around. It has happened too many times for the possibility to be ignored except by fools.
The MTS file system was non hierarchical, and the data model was not stream-of-bytes but rather line-numbered-record file. That said, it had some nice features, including append-only access, and program keys.
As I recall, fairly complete MTS documentation can be found at Bitsavers, and one can download a complete MTS distribution to run under emulation, see the Wikipedia article for a link.
I was fortunate enough to be working for Basho and attended the RICON where mrb gave his distributed systems archaeology talk. It really highlighted for me how so many important ideas are captured in historical whitepapers and often forgotten today.
Acorn's "ADFS", as used in RISC OS, uses "." as the directory separator. Fully qualified paths look like this:
fs::drive_id.$.directory.directory.filename
Where "$" means "root directory". (On network filesystems, you can also use "&" which means "home directory".)
The top level identifier is the filesystem type, usually "adfs", which is a slightly unusual way of doing it.
Bringing in files from other systems, which invariably have filename extensions, involves converting the . to a /, so you end up with filenames like "readme/txt". ADFS stores the file type not as a filename extension, but as a three-character hex ID in the filesystem instead. (Text, for example, is FFF.)
On the BBC Micro-era pre-hierarchial DFS, there was a single level of single character directories, so files were called things like “d.file”. $ was the default directory rather than the root.
The Acorn/Norcroft ARM C compiler used c and h directories which effectively swapped the file extension around, so sources were named like c.main, and #include would try swapping name and extension for compatibility. It was about as Unix-flavoured as was possible on the Archimedes...
Having used both Multics and the Alto, I can add a few points:
Multics as implemented used a slightly different syntax, such as > for path separation.
Segments were not file descriptors but were intended to be what might today be called blocks. The original idea was that the entire memory would essentially be a single address space, with segments being blocks of memory that might be in core, on disk, or on tape (called very slow storage if I remember correctly but it’s been decades). Security was at the segment level, so you could for example have (in posix parlance) an suid shared library that an ordinary program could, with appropriate authentication and permissions, call into in a controlled manner. Multics’ permission structure was more fine grained than the binary of suid. You can see how the backup system kind of falls out of this automatically. I trust some multician will step up and correct any memory-corruption-based thinkos in the above.
Of course reality didn’t quite match the dream and Multics was cancelled before some of that research could be completed. But if you look at the x86 segment registers they could implement something like that. I think this was also in the intel iapx-430 but fortunately even the name of that ancient dead processor is hazy in my mind.
As for the alto filesystem and its descendants: the labels don’t have to be next to the blocks they describe. After all Unix filesystems have multiple copies of their basic breadcrumbs at least. There’s certainly plenty of room for these in modern storage systems; every modern drive, whether spinning or ssd, does a small amount of this in block remapping and wear leveling.
I would guess there are quite a few filesystems lost in history. My personal favorite will be the Newton's Soups. A modern version with replication would be amazing.
A VMS programmer in the late 80's told me the current direction the industry was moving towards made him sick to his stomach. Microsoft's FAT was one thing. But POSIX was the real trash.
VMS had four different file types out of the box: serial with carriage-return carriage control, serial with FORTRAN carriage control, random-access, and ISAM (indexed-sequential access model). Three of those worked as well from a bank of open-reel TM03 devices or TU-58 carts as they did from a cluster of DECNET drives.
You could, of course, extend those models by plugging in to the rab$, fab$, and xab$ layers or write your own from scratch using the sys$ layer.
Unix, in contrast, had only the random-access file model. You could use just that file model to build any of the other types in userspace, but to talk to a tape drive you had to use the tape archive (tar) utility.
It's just the usual tradeoff about where you draw the line between a solution and a tool.
The headlined article is primarily about things that have been forgotten. ODS hasn't been forgotten. Much of it exists pretty much as-is in NTFS, with just some names changed.
FUSE effectively provides a POSIX-like interface to arbitrary code, whereas the author is lamenting that these features weren't built into Unix/POSIX-like systems to make them widely available in the first place.
The "problem" with modern POSIXish systems is that the definition of what is "POSIX" seems to be set in stone. All the various Unix-likes offer POSIX compatibility (or aim toward it) while new features that extend capability end up being implemented in completely different ways across different systems.
So for example while Linux has inotify (arguably an implementation of the filesystem traps that Multics had), SGI had FAM, FreeBSD has kqueue, MacOS has FSEvents, and they're all incompatible with each other... a nightmare for developers of portable applications.
>"The very first hierarchical file system was developed for Multics. It is described in the paper A General-Purpose File System For Secondary Storage (1965) by Robert C. Daley and Peter G. Neumann. There are several things that I find astounding about this paper:
There were apparently no hierarchical file systems before Multics. The references do no cite any previous work on this and I haven’t found any."
(PDS: As an amateur computer historian, I'd be interested in this myself, if there existed any hierarchical file system before Multics, or if one was conceived in any academic paper before the one cited...)
[...]
>"Directory entries that point to secondary storage. This is a game changer for file system management. More on this below."
(PDS: Which, to this day, is one of the great abstractions, one of the great ideas, in computing!)
For anybody who has the desire and opportunity to interact with something currently in production and truly alien-feeling, I warmly suggest IBM’s AS/400 (now iSeries) OS/400’s filesystem. Library versus folders, logical versus physical files, integrated DB2 database is one of the many, many head-scratchers you’ll encounter until you suddenly “_get it_”.
Portability was important and so more complex capabilities were not sufficiently used to survive and spread. I wonder how the similar struggle ends for advanced cloud features.
Anyone remembers Novel NSS? Even when it got ported to Linux as part of Open Enterprise Server it lacked a lot of POSIX features. The ACLs were the weirdest part since they had to be configured in eDirectory’s LDAP console.
What I really want that isn't there in POSIX is not directly a file system feature but something more general: I really want some sort of "middleware-system" to intercept all sorts of events (like file system access, binary execution, network or device access etc.). There should be multiple intercepting programs that handle one or more events each and can decide to block or pass the event to the next interceptor. They could also log or modify parameters of the event (like redirect a file read or wrap a binary that's about to be executed in another program like script or torsocks).
You could even use this system to implement some Unix features as interceptors: Shebangs and even file system permissions could be handled this way. You could also implement containers with this or provide some kind of "switchboard" UI akin to uMatrix for letting the user decide on permissions.
It would be interesting if a "file type" existed to provide these more complex operations, and if those could also be bound to a daemon running with provided parameters.
This would allow for a filesystem name linked to a database API / any other daemon or program.
I also love the TRAP (related to above) and both directory operations. Quotas can sort of fulfill the space limits thing (but I've never bothered using such a system and haven't been on a system that did). Append Only directories are a far harder beast to slay, but would very nicely fulfill many queue submission systems.
No mention of the immutable content addressable hash file system presented in the Artifacts System[]. One of the benefits of a system like this is that you can have multiple versions of a library coexisting without conflict with your requirements system. Each program can reference a different hash, duplicates are easily made unnecessary because they resolve to the same hash, etc.
Sounds like having your libraries stored in git repos and then having your programs point to the commit they want to use. Something like this would solve headaches that there are innumerable tools for. A much more elegant solution for the problems things like pyenv or nix are attempting to solve.
> Why can your browser run sudo? ...Suppose only your SSH client had the operation required to use your SSH keys.
Perhaps I’m missing something, but why could a malicious application (say, Chrome in the context of a browser running sudo) not also include the capability to access the file system with impunity?
It seems that Hydra simply turns the technical problem of managing filesystem protections into a political one, of auditing the applications we use; which may have been practical in the 80’s, but is now demonstrably a failed method of protecting user data.
Interesting POSIX history. It used to come down to Windows/Mac and POSIX, but for the past 20 or so years we've been seeing new filesystem styles emerge, which blend cloud and on-device storage. Navigating the iOS file save menu is a good example. 3rd party apps like Google Drive and DropBox hook into the OS and show up in the list. I think this is probably the way things are headed, and POSIX and the "Windows/DOS/Classic Mac" style will both fade away.
POSIX will last forever because it’s simple, standard, solid, portable and fairly flexible (see FUSE). DOS will last forever because inertia is the only reason Windows still exists, and MS knows this and prioritizes backwards compatibility, eg. it’s still basically impossible to name a file CON on Windows 10 because of backwards compatibility.
That's a shortsighted and very programmer-centric view of what a filesystem is.
Increasingly we are seeing a bifurcation between the kind of filesystems developers use for programming (POSIX style) and the kind of "filesystems" being exposed to end users. I put the word in quotes because they are not thought of as filesystems by its traditional definition. But it's a reality that many users now think of these siloed-by-app systems as the real filesystem.
The iOS menu mentioned by GP is a great example because that's what's user facing, not the developer-facing one like /private/var/mobile/Containers/blah-blah-blah. Today even many developers won't have to deal with POSIX-style filesystems that often, since they generally write code to store structured data in a database (sqlite on mobile, more sophisticated ones on the server side) while large blobs are stored in services like S3, making the idea of a POSIX filesystem quaint.
My own prediction is that POSIX filesystems in another decade or so will be like assembly language today: it's still there, still being taught and learned, but users and a majority of developers won't need to know about it.
Casual users' confusion about the presentation of technical details is a different conversation. This one is about actual filesystems.
It may well be that end users are abstracted away from their filesystems, but you seem to assume both that 'The Cloud' is the natural end-state of systems evolution and that everyone comes along for the ride.
I think it’s very typical of very casual and exploitable users to have product names deeply connected with a certain activity — eg. using Google or Zoom as verbs — and “cloud” storage companies would like them to think of their brand name as the main storage area on their devices (and certainly not as just rented space on normal drives in a big room full of someone else’s computers), and I think they are and will be successful, and these users won’t think of them with generic terms like “filesystem”.
> There are reserved device names in DOS that cannot be used as filenames regardless of extension as they are occupied by built-in character devices....
That's my bad (I thought it looked weird when I typed it). Still the same syntax works with CON and any other reserved DOS devices
echo "hey DOS" > \\?\C:\temp\CON
del \\?\C:\temp\CON
While it can be "trivially" done, it's a hit or miss if any application actually supports it. You effectively need to deal with the NT path prefix \\?\ when dealing with the Win32 APIs to be able to open a handle to the file. The prefix essentially tells Win32 to back off and get the handle directly from the NT object manager.
Windows still exists because no FOSS UNIX clone can get their act together in what means to provide a proper desktop experience both to users, as well as developers.
KDE and GNOME are quite close to it, but keep being ostracized, so one is left with Android and ChromeOS where the Linux kernel is an implementation detail.
Fascinating to see other possibilities that were out there. Were the capabilities in Hydra the inspiration for what Fuchsia OS has, or are they a long standing concept?
Hydra and Fuchsia are both operating systems that use Capability-based security. Hydra would have been one of the very earliest ones, while Fuchsia is a very new one. There have been many others in between, like KeyKOS or EROS, as well as use of capabilities in "normal" operating systems, like FreeBSD's Capsicum, or arguably even just passing file descriptors over Unix sockets with SCM_RIGHTS on any POSIX system. There have also been capability-based programming languages like E and Pony, and capability-based network protocols like CapTP or (my own) Cap'n Proto. So it's unlikely that the designers of Fuchsia directly based it on Hydra, but there was probably indirect inspiration, yes.
> When you use a file system through a library instead of going through the operating system there are some extra possibilities. You are no longer required to obey the host operating system’s semantics for filenames. You get to decide if you use / or \ to separate directory components (or something else altogether)
This is really about time in 2020 that someone (Apple, MS, Linux, FreeBSD, Plan 9 I'm looking at you) implemented this!
This is a great article. For anyone wondering about the current data storage stack on Unix-like systems today, I wrote a simple article that goes over it [1].
Maybe another filesystem worth mentioning would be WinFS.
It should have been one of the major features of Windows Longhorn, which later became Vista.
Simply put, it was a relational database, with features you typically find in RDBMS like Postgres or Microsoft's own SQL Server, which took advantage of some of the work done on WinFS.
Comments usually trail upvotes, and placement is an algorithm of upvotes to age (plus the occasional editorial tweak from admins). My guess is that 5 upvotes in 5 minutes is enough to get a front page placement during certain times, and from there as long as the title seems interesting regular effects of popularity take over.
- Unit files. The unit of consistency is the entire file. Files are opened for writing, written, and closed. Then then become openable by other programs. Overwriting a file replaces the file as a unit; other readers see the old file until the writer closes and the reader reopens. This is the default and the case for most files. On a crash, the file reverts to the previous good file, if any. Can be memory-mapped as read only. File replacement can be done now through non-portable renaming gyrations. It should just work.
- Log files. Append-only mode, enforced. Can't seek back and overwrite. On a crash, the file recovers to some recent good write, with a correct end of file position. It does not tail off into junk.
- Temp files. Not persistent over restarts. Not backed up. Can be memory-mapped as read/write. On a crash, disappears.
- Managed files. These are for databases. Async I/O available. May have additional locking functions. Separate completion returns for "accepted data" (caller can reuse write buffer) and "committed data" (safely stored), so the database knows when the data has been safely stored. Can be memory-mapped as read/write. Intended for use by programs which are very aware of the file system semantics. On a crash, "committed data" should be intact but data not yet committed may be lost.