If you click through to the actual paper at the bottom of the article, the methodology here is just astonishing: they start by, without evidence, asserting that the main thing fork() does is to insert and remove processes from a list of processes. Then they write a program that creates a table of processes in a distributed database and adds and removes rows from it. Then they compare the speed of that with the speed of actually running fork() on the same cluster of machines. That’s the entire paper. You can’t make this stuff up.
“I am not able rightly to apprehend the confusion of ideas that could provoke such a question.” - Charles Babbage
> they start by, without evidence, asserting that the main thing fork() does is to insert and remove processes from a list of processes
Since fork() as far as I understand is effectively copy-on-write in modern architectures on linux, that's probably not a terrible model at least for the initial step but it does seem kinda over-simplified if you think about the rest of the lifetime of the process and the different execution models with processes on a computer vs a supercomputing cluster.
What looks like it happened is that they had a cool enough idea: with supercomputers, the scheduler is kinda like the operating system. The scheduler becomes the bottleneck in some workloads. So why not replace the core data structures and operations with a DBMS which is kinda built for this sort of highly distributed, high contention state management situation. Also the larger general idea of comparing it to an OS sounds curiously close to the mesos / kubernetes worldview of things in some ways: "the operating system of your datacenter".
Then some annoying but probably correct person was like "hey we need a basis for comparison and a benchmark because this is an academic paper and that's how systems papers must be because academia in 2018" so they shoehorned in some apples and oranges comparison and we got this.
> Then some annoying but probably correct person was like "hey we need a basis for comparison and a benchmark because this is an academic paper and that's how systems papers must be because academia in 2018"
That person was correct regarding the fact that a benchmark is needed, if only in order to validate their mathematical model. But they hardly benchmark the correct thing. Adding a process to the process tree is only on of the things that happens when you fork() a process, and frankly the most trivial one.
Quite unsurprisingly, sparse matrix associative multiplication in a distributed system (which is how they simulate forking) is faster than a local fork() implementation that actually implements process creation. There's no mention about how well this DBMS-backed scheduler deals with process and resource management, process group management, core- or node-pinning, process tracing and so on.
Edit: to clarify this point with a trivial example. In Linux, fork() relies heavily on copy-on-write, so that process creation can be as cheap as possible. When a child is created, it doesn't get new memory pages -- by default, it just shares the parent's. A long time ago, on systems that didn't have a MMU, the system would copy the entire memory space of the parent (!!!); nowadays it just copies a couple of system structures (which is not trivial, either, because there's potential for race conditions there). Only when the process wants to write to a memory page does the operating system actually create a copy of that page for the process that wanted to do the modification.
Managing memory pages isn't all that easy, because the memory pages aren't a software-only construct, they're backed by an actual hardware device (the MMU) which has a sort of cache (the TLB) because, as the authors like to remind us, this ain't a PDP-11 anymore. fork()-ing 2^32 processes at the same time is not a trivial problem, but it's a piece of cake compared to 2^31 processes saying "hey yeah I kindda need a new copy of this page" at seemingly random times, some of them in very particular patterns that need to be serviced really quickly -- especially when some processes request pages one at a time, while others just request their whole space because they do exec().
It's anyone's guess how well a DBMS would handle this sort of stuff. Frankly, I don't know enough about DBMSes to do more than speculate -- but I don't see this article shedding light upon this problem, either.
Edit done :).
The conclusion that "... the simulations show that TabulaROSA has the potential to perform operating system functions on a massively parallel scale." (emphasis mine) is greatly inflated. The simulations show that it can efficiently perform a small part of one operating system function; if you squint and you're optimistic, you can maybe conclude that they show it has the potential to implement process creation on a massively parallel scale.
Even later edit: I feel like this is one of those upsetting cases of re-discovering something that we knew already. There's a lot of literature dating from 1990s about how one could manage processes running on a massively-distributed scale, and while I recall most of it too dimly for actual details, I think there was a widespread feeling that whatever orchestration solution we're finally gonna figure out is going to look a lot like a DBMS system. This paper doesn't seem to cite any of it, though, and so it unsurprisingly doesn't try to solve any of the many problems that have been shown to exist.
I read the paper completely differently than you. I think you were influenced by the title of the news article. The paper seems to me very reasonable:
Given the goal of implementing an architecture for part of the tasks of the OS for a 32,000+ core supercomputer, can we use mathematics to model such an architecture? The answer is yes, the mathematical model is defined and the simulation of that compared with the equivalent implementation provided in the current Linux.
The way I see it, it was then never about "replacing the whole Linux" but about "how to better implement some piece an OS which is to manage e.g. 32,000+ cores and simultaneous forkings on them."
Bizarre. Linux must already have some kind of "database" system for keeping track of system resources. At best, the claim would be that Linux is using an inefficient database, which seems unlikely.
You would absolutely be surprised how far you can get using a process’s memory as it’s official database and just dumping it out to disk periodically for persistence. I kind of want to write an article but the article can’t do it justice: It’s something you need to experience yourself to fully appreciate it and be wowed like I was. I’m sure Linux uses files for its persistence layer for most cases.
That’s Redis’ initial persistence design, that is still available, but nowadays most people use it with append-only logs because it provides more resistance to random crashes (eg: power going off)
Yup, I believe the kernel pretty much just uses linked lists (at least, all of the running processes are kept as a linked list). But when most of your resources only number in the tens of thousands, if that, a linked list works perfectly well. And I’m not a kernel hacker so I’m sure there are optimizations I’m not aware of.
I know the kernel uses trees for a good number of different things. I'd imagine that processes would be a BST of some time and I remember Linus mentioning RB-Trees. It would definitely make the most sense as RB-Trees have better access and insertion than a sorted linked list.
“I am not able rightly to apprehend the confusion of ideas that could provoke such a question.” - Charles Babbage