If you click through to the actual paper at the bottom of the article, the methodology here is just astonishing: they start by, without evidence, asserting that the main thing fork() does is to insert and remove processes from a list of processes. Then they write a program that creates a table of processes in a distributed database and adds and removes rows from it. Then they compare the speed of that with the speed of actually running fork() on the same cluster of machines. That’s the entire paper. You can’t make this stuff up.
“I am not able rightly to apprehend the confusion of ideas that could provoke such a question.” - Charles Babbage
> they start by, without evidence, asserting that the main thing fork() does is to insert and remove processes from a list of processes
Since fork() as far as I understand is effectively copy-on-write in modern architectures on linux, that's probably not a terrible model at least for the initial step but it does seem kinda over-simplified if you think about the rest of the lifetime of the process and the different execution models with processes on a computer vs a supercomputing cluster.
What looks like it happened is that they had a cool enough idea: with supercomputers, the scheduler is kinda like the operating system. The scheduler becomes the bottleneck in some workloads. So why not replace the core data structures and operations with a DBMS which is kinda built for this sort of highly distributed, high contention state management situation. Also the larger general idea of comparing it to an OS sounds curiously close to the mesos / kubernetes worldview of things in some ways: "the operating system of your datacenter".
Then some annoying but probably correct person was like "hey we need a basis for comparison and a benchmark because this is an academic paper and that's how systems papers must be because academia in 2018" so they shoehorned in some apples and oranges comparison and we got this.
> Then some annoying but probably correct person was like "hey we need a basis for comparison and a benchmark because this is an academic paper and that's how systems papers must be because academia in 2018"
That person was correct regarding the fact that a benchmark is needed, if only in order to validate their mathematical model. But they hardly benchmark the correct thing. Adding a process to the process tree is only on of the things that happens when you fork() a process, and frankly the most trivial one.
Quite unsurprisingly, sparse matrix associative multiplication in a distributed system (which is how they simulate forking) is faster than a local fork() implementation that actually implements process creation. There's no mention about how well this DBMS-backed scheduler deals with process and resource management, process group management, core- or node-pinning, process tracing and so on.
Edit: to clarify this point with a trivial example. In Linux, fork() relies heavily on copy-on-write, so that process creation can be as cheap as possible. When a child is created, it doesn't get new memory pages -- by default, it just shares the parent's. A long time ago, on systems that didn't have a MMU, the system would copy the entire memory space of the parent (!!!); nowadays it just copies a couple of system structures (which is not trivial, either, because there's potential for race conditions there). Only when the process wants to write to a memory page does the operating system actually create a copy of that page for the process that wanted to do the modification.
Managing memory pages isn't all that easy, because the memory pages aren't a software-only construct, they're backed by an actual hardware device (the MMU) which has a sort of cache (the TLB) because, as the authors like to remind us, this ain't a PDP-11 anymore. fork()-ing 2^32 processes at the same time is not a trivial problem, but it's a piece of cake compared to 2^31 processes saying "hey yeah I kindda need a new copy of this page" at seemingly random times, some of them in very particular patterns that need to be serviced really quickly -- especially when some processes request pages one at a time, while others just request their whole space because they do exec().
It's anyone's guess how well a DBMS would handle this sort of stuff. Frankly, I don't know enough about DBMSes to do more than speculate -- but I don't see this article shedding light upon this problem, either.
Edit done :).
The conclusion that "... the simulations show that TabulaROSA has the potential to perform operating system functions on a massively parallel scale." (emphasis mine) is greatly inflated. The simulations show that it can efficiently perform a small part of one operating system function; if you squint and you're optimistic, you can maybe conclude that they show it has the potential to implement process creation on a massively parallel scale.
Even later edit: I feel like this is one of those upsetting cases of re-discovering something that we knew already. There's a lot of literature dating from 1990s about how one could manage processes running on a massively-distributed scale, and while I recall most of it too dimly for actual details, I think there was a widespread feeling that whatever orchestration solution we're finally gonna figure out is going to look a lot like a DBMS system. This paper doesn't seem to cite any of it, though, and so it unsurprisingly doesn't try to solve any of the many problems that have been shown to exist.
I read the paper completely differently than you. I think you were influenced by the title of the news article. The paper seems to me very reasonable:
Given the goal of implementing an architecture for part of the tasks of the OS for a 32,000+ core supercomputer, can we use mathematics to model such an architecture? The answer is yes, the mathematical model is defined and the simulation of that compared with the equivalent implementation provided in the current Linux.
The way I see it, it was then never about "replacing the whole Linux" but about "how to better implement some piece an OS which is to manage e.g. 32,000+ cores and simultaneous forkings on them."
Bizarre. Linux must already have some kind of "database" system for keeping track of system resources. At best, the claim would be that Linux is using an inefficient database, which seems unlikely.
You would absolutely be surprised how far you can get using a process’s memory as it’s official database and just dumping it out to disk periodically for persistence. I kind of want to write an article but the article can’t do it justice: It’s something you need to experience yourself to fully appreciate it and be wowed like I was. I’m sure Linux uses files for its persistence layer for most cases.
That’s Redis’ initial persistence design, that is still available, but nowadays most people use it with append-only logs because it provides more resistance to random crashes (eg: power going off)
Yup, I believe the kernel pretty much just uses linked lists (at least, all of the running processes are kept as a linked list). But when most of your resources only number in the tens of thousands, if that, a linked list works perfectly well. And I’m not a kernel hacker so I’m sure there are optimizations I’m not aware of.
I know the kernel uses trees for a good number of different things. I'd imagine that processes would be a BST of some time and I remember Linus mentioning RB-Trees. It would definitely make the most sense as RB-Trees have better access and insertion than a sorted linked list.
z/TPF [1], is basically a database system OS that runs almost all VISA, AMEX, Citibank (authorizations); Holiday Inn, Amtrak, Mariott, Air Canada, Delta, Japan Airlines, Travelport, Trenitalia (reservations). But it's IBM proprietary and runs only on IBM licensed hardware.
It's always in my programming dreams to be involved in building a database OS with I/O capabilities since I worked with TPF and MUMPS.
To be honest, if your database has decent queuing facilities, it isn’t a bad programming strategy to use the database as a tuple space for piranha processes. You don’t run afoul of corporate architecture teams for using ‘new’ software and you can devolve a lot of business processes into small, independent programs. It also provides an easy way to upgrade parts of your software by freezing a queue table for reads but not writes.
This makes sense. Once you have something with a lot of compute nodes, you have to orchestrate them all somehow. That's the real "operating system" for such systems. Each node just needs a loader and setup program to load work containers. Supercomputers, big cloud data centers, and things built from lots of GPUs tend to be organized like this.
Have things reached the point yet where containers run as Xen guests,
and the real OS is Kubernetes? Then you don't need Linux under your container, and with the right library, you don't need Linux in your container. Less OS state to get screwed up.
I feel like collectively the tech industry has forgotten what an OS is. We’re so used to Linux bring batteries included now that we assume we can supplant it.
Linux itself is not going away simply because replacing all those device drivers and ABIs is going to be incredibly difficult. Now, if you’re suggesting something new will sit atop Linux as the user land acting as a sandwich-filling between the Linux kernel and a multitude of userlands spanning many systems- I might agree.
>So when using POSIX, or any other programming model rich enough not to mess with OS specific APIs, the OS becomes irrelevant.
Except that things like systemd and a lot of redhat related software rely on Linux-specific functions like cgroups so they are not portable. Good luck replacing Linux even if something that much better comes up.
Not everyone makes use of those features and the last time I was using a Red-Hat system, it was for Java deployments, which is also not Linux related in any form.
Same applies to the C, C++, Fortran and now Chapel code that usually runs on HPC clusters.
Everyone uses systemd... I hope people are smart enough to not fall for it and not use all the scope-creeped features of it but if you add all the little things up, it turns out that we are getting more and more stuck with linux.
Leveraging Linux's driver support and resulting hypervisor independence was a major part of the rationale behind Xen, and a differentiator from VMWare, which has a strict HCL.
Linux is not the minimum viable OS. We have a lot of different situations where an OS can be much simpler, offering a limited set of functions or offloading multiple functionalities to different systems.
As pointed out before, mainframes have OSs that are exceedingly simple from the program's point of view and that offer a limited set of functions in exchange for extreme performance and stability.
Linux doesn't seem to be the bottleneck for neither performance nor stability in a lot of contemporary systems, though, particularly not if we're comparing against a new, hypothetical operating system.
Consider unikernels on VMs with paravirtualized hardware and thin hypervisors. When I boot my servers, hardware that's never used gets initialized. Filesystem code for filesystems that are never mounted is loaded. A whole lot of code that's specific to the CPU and chipset is patched in important execution flows. A unikernel has nothing but the code I want to run.
The fact we got it right most of the time, for most of the hardware combinations we use, does not make it any simpler, faster, or more reliable.
When I mentioned mainframe OSs, I was not talking about hypothetical new OSs but technology that has consistently excelled for decades now.
I think a more Apt-model to replace Linux is Erlang VM. It is a OS/Kernel/Scheduler in its own right. All that is required is a bootable microkernel erlang vm, with the same process monitoring as the mainstream and you have a highly scalable multi processor , network transparent computing device.
I don't recall where I read it, but I've came across a concept to where the computer, then kernel, simply load your application and everything runs in this user or application space which takes all memory and resources to itself.
I mean, every ”rescue disc” that loads its own shell instead of an OS works this way. Plenty of software in the DOS days loaded right off a floppy. Not to mention that DOS software worked pretty close to the metal in the first place.
I think I've also seen rescue discs that use DOS for initialization but then the software takes over so that when you exit it just reboots the computer instead of dropping you to the shell.
“I am not able rightly to apprehend the confusion of ideas that could provoke such a question.” - Charles Babbage