Seriously cool. That also reminds me of DragonFlyBSD's process checkpointing feature that offers suspend to disk. In Linux world there were many attempts, but AFAIK nothing simple and complete enough. To be fair I don't know if DF's implementation is that either.
This article[0] on LWN seems to suggest that Linux has no kernel support for checkpoint/restore because it's hard to implement (no arguments there). But hypervisors support checkpoint/restore for virtual machines, e.g. ESXi VMotion and KVM live migration, so it seems like these technical problems are solvable. Indeed all the benefits of VM migration seem to also apply to process migration (load balancing, service uptime, etc).
IMO, big reason for why Linux doesn’t support it, is that big tech companies, who drive a lot of development and funding, found that it’s just not worth it.
Tech moved away from mainframe model of keeping single processes up for as long as possible and isolating software from the failures.
Instead they embrace failure at the software layer, which gives you (when executed correctly) both same high level of availability, but also layers of protection against other issues (as node going down in expected way is the same, no matter what’s the root cause) and makes overall maintainability and upgradability systems higher.
Basically, why deal with complicated checkpoints at kernel level, that doesn’t understand your software, when you ca instead deal with that at the software layer, with full control over it.
In old days swap was necessity to keep system up, but nowadays you need it only if your dataset is really big (other than some weirdness in kernel, that actually like to have some small amount of swap to do cleanup and maintenance)
And if you have that problem, you have custom solutions that implement memory hierarchy for you, that give you much better result, as they can work across all the layers (cache/hdd/sdd/net/etc) in much more effective way than generic, software agnostic solutions focused just on one dimension.
How commonly do people need to use swap in the first place? On my current computer, I don't even have a swap partition because I've never gotten close to filling my 16GB, even with a big virtual machine and hundreds of browser tabs. In my experience on desktop systems, it just makes your computer painfully slow when you write a program that allocates far too much memory. Without swap, the OOM killer gets involved much earlier and you don't have to do a hard power cut to stop the disk from thrashing.
If you don't have enough memory for a single app, then your life is completely miserable - in that case, swapping is almost indistinguishable from crashed. But swap can be the difference between being able to run one app at a time, vs run three or four, as long as you only have one in the foreground at a time.
Swap doesn't play well with GC, or pointerful programming languages generally. Swap is a latency hit every time you touch an evicted page, and the more pointers your structures have, the more frequently that occurs.
In older times, when memory needed to be conserved, GC was less frequently used, and as a consequence of the accounting overheads of manual memory allocation, pointerful data structures weren't quite as common. Instead, you'd have pre-allocated arrays which supported up to N items of a particular type (and tough luck if you wanted more), or in more sophisticated applications, arena allocators. In such systems, the working set of memory is more likely to be contiguous; paging works better because every page load is more likely to bring in relevant data.
Switching between two open applications was the moment where you'd typically hit swap; that, and every now and then when you touched an area of the app you hadn't in a while, or jumped to a distant area of the document you were working on. The HDD light would come on, and you'd have to wait a few seconds while the computer ground through (HDDs being audible, some much louder than others) before you'd switch.
At some point in the mid 2000s, GC languages tipped the balance, especially JS. The browser became the main application, and wouldn't tolerate swap well. It became a mini OS unto itself, and had to balance the competing needs of different tabs. Then mobile exploded, and didn't use swap at all in order to compete with the iPhone, remarkable for its smooth animations. If there was to be latency, the application had to hide it; OS-level introduced latency on a page fault wasn't good enough, because if it happened on the UI thread, you dropped frames.
And of course swap never worked well on servers with frequent requests. Swapping on a server is a great way to back up all your queues.
When done well, swap can be pretty useful. I have a Fall 2019 MacBook Pro maxed out on CPU (8 cores) and RAM (64 GB) - MacOS still regularly compresses (highest I've seen so far: 12 GB) and/or swaps (highest I've seen so far: 16 GB) relatively unused memory to make more space for other applications and hot caches (which are still useful even with an PCIe SSD).
Keep in mind that everything feels fully snappy and responsive while this is all happening.
By disabling swap, you are taking away your operating system's ability to trade less-frequently-used parts of memory for extra disk cache, which could hinder overall performance even if you have an excess of physical memory for your use case.
The problem is, system cannot know when you're going to use it. If application is swapped out and you use it after several hours, there would be excruciatingly long delay when it restores from spinning HDD disk. If I have reasonable amount of RAM I definitely prefer not to have swap.
Even if that means lower throughput from all your other apps over that several hour period? That is worth not having to wait a few extra seconds to go back to the less-used app?
Depends, but usually not. Harder to notice minimally better performance than to be annoyed by 2 second lag on application switch. ("this computer have 32GB RAM and it's trashing disk! grr")
There is no difference in terms of how much memory apps will be able to allocate before getting killed. But there will be a significant performance difference once apps start using more memory than the 8GB that is physically available (of course).
I just opened the site of the biggest retailer in my country, and looked at one entry-level laptop: it has 2GB of RAM. Not everybody has a high-end computer.
What's hilarious to me, growing up in the era of swap, is that now you can have so much RAM that you can start caching your disk to RAM for a little extra performance (well, we've kind of moved past that, too, since SSDs are so much faster than spinny disks).
In fact, it's often worth moving some stuff in RAM to swap in order to have more space for disk cache (if you ever max out the RAM on a linux machine without swap you'll find it's a miserable experience because of the lack of cache and the fact that code from executables just gets memory mapped into the process address space).
You were faster than me. Yes, CRIU does checkpointing in userspace on Linux.
They contributed a lot of patches to the Linux kernel to support this feature. So there isn't any support for checkpointing in the mainline kernel. Rather, there are many kernel features that allows building such a feature in userspace. For instance, you can set the PID of the next program you spawn on Linux thanks to CRIU, because CRIU needs to be able to restore processes with the same PIDs as when they were checkpointed [1].
CRIU is used by OpenVZ, which, if I remember correctly, are moving or moved away from a kernel-based approach.
(in this run it restored a previous VM snapshot to skip the setup, and took a snapshot after the webserver started to give a lazy-loaded staging server)
They seem to be different problems though, as that LWN article suggests. It's probably easier to checkpoint/restore an entire image's state than an individual process' state.
But even calls like gettimeofday() work differently when running on hypervisors than they do when running on bare metal
Same reaction here. MOSIX was basically this idea developed for real. Turns out there are a bunch of secondary problems to do with namespaces and security, scheduling and resource use, topology, performance, and so on. In the end it never turned out to be all that practical, and I know of several efforts (e.g. LOCUS) that went quite far to reach the same conclusion. There's a reason that non-transparent "shared nothing" distributed computing came to dominate the computing landscape.
Whoa, I was just talking about OpenMOSIX the other day - at my college job, when we retired workstations, they would just sit in the back closet for years until facilities would get around to getting them up for auction. We set up a TFTP boot setup and had a few dozen nodes at any given time. It wasn't high performance by any fashion, but it worked pretty transparently and was always fun to throw heavy synthetic workloads at it and watch the cluster rebalance and hear the fans spin up on the clunky old pentium's.
I recall OpenMosix too, I remember playing with it on a couple of old machines a while ago. I thought it seemed quite a cool idea. I think I was trying to do mp3 encoding from CDs with it in a distributed fashion.
Funny almost exactly one year ago I was desperately trying to learn about and demo a Linux cluster.
I happened upon ClusterKnoppix and used that LiveCD as my demo which uses OpenMOSIX. And MPI as well but I was having a lot of trouble getting it all in my head and having to explain it during a presentation.
Came here to write about ClusterKnoppix! It was amazing and I had 7 small form factor PC's running a mini cluster. It was great for moving things around. Problem was it fell behind in releases and it became hard to get applications to run on it. Knoppix was how I got into Linux, it still ranks as my favorite distro. Thanks Klaus Knopper for your work in setting that up!
It would be nice if it worked on Raspberry Pi, or if there was a simple way to set OpenMOSIX up on Pi's
Came here for the MOSIX, not disappointed. I worked at a facility that ran it on a small science cluster, and it worked extremely well.
A fun wrinkle: When the Intel C compilers first came out, they had warnings about process migration -- there's an executable mode where the executable checks to see what kind of CPU it's on at launch-time, and then branches to CPU-specific code blocks during the run. This can be dangerous, the check is only done once, so if the CPU type changes in mid-run due to migration to different hardware, Bad Things can happen.
This sounds like the same feature that Intel's C compiler used to "cripple AMD" by switching to a slower code path when it detected that the program was running on a non-Intel CPU.
For people that want to try OpenMOSIX out, take a look at this site http://dirk.eddelbuettel.com/quantian.html He has a distro that is called Quantian, with a big collection of science tools added. Shame it's Sunday, I'll need to wait a week to pull it down and see how well it flys.
Kerrighed was a similar project from INRIA in France. Process migration was more transparent than with OpenMOSIX (no special launch command like mosrun). It even supported thread migration!
MPI / MPICH were the first things I thought of when I saw the headline. I was like "thats not new...HPC has been doing message passing for decades this way." :)
What's old is new again -- I'm pretty sure QNX could do this in the 1990s.
QNX had a really cool way of doing inter-process communication over the LAN that worked as if it were local. Used it in my first lab job in 2001. You might not find it on the web, though. The API references were all (thick!) dead trees.
Edit: Looks like QNX4 couldn't fork over the LAN. It had a separate "spawn()" call that could operate across nodes.
Any time I see something about remote processes, I immediately think, "Erlang could probably do this."
I think Erlang would have been the programming language of the 21st century... If only the syntax wasn't like line noise and a printer error code had a baby, and raised it to think like Lisp.
Indeed, weird to only find it that low as a sub-comment:
spawn(Node, Module, Function, Args) -> pid()
Returns the process identifier (pid) of a new process started by the application of Module:Function to Args on Node. If Node does not exist, a useless pid is returned. Otherwise works like spawn/3.
It's nice to see people re-discover old school tech. In cluster computing this was generally called "application checkpointing"[1] and it's still in use in many different systems today. If you want to build this into your app for parallel computing you'd typically use PVM[2]/MPI[3]. SSI[4] clusters tried to simplify all this by making any process "telefork" and run on any node (based on a load balancing algorithm), but the most persistent and difficult challenge was getting shared memory and threading to work reliably.
It looks like CRIU support is bundled in kernels since 3.11[5], and works for me in Ubuntu 18.04, so you can basically do this now without custom apps.
Really cool idea! Thanks for providing so much detail in the post. I enjoyed it.
A somewhat related project is the PIOS operating system written 10 years ago but still used today to teach the operating systems class there. The OS has different goals than your project but it does support forking processes to different machines and then deterministically merging their results back into the parent process. Your post remind me of it. There's a handful of papers that talks about the different things they did with the OS, as well as their best paper award at OSDI 2010.
Condor, a distributed computing environment, has done IO remoting (where all calls to IO on the target machine get sent back to the source) for several decades. The origin of Linux Containers was process migration.
I believe people have found other ways to do this, personally I think the ECS model (like k8s, but the cloud provider hosts the k8s environment) where the user packages up all the dependencies and clearly specifies the IO mechanisms through late biniding, makes a lot more sense for distributed computing.
I clicked through to mention Condor too... I first came across it in the 90's, and it seems like one of those obvious hacks that keeps being reinvented.
I was actually channeling the creator of Condor, Miron Livny, who has a history of going to talks about distributed computing and pointing out that "Condor already does that" for nearly everything that people try to tout as new and cool.
Few people outside academia, maybe? But inside it still seems to dominate in areas like physics computation. CERN uses a world wide Condor grid. LIGO too. It’s excellent for sharing cycles for those slow-burn, highly parallel, massive data scale problems.
I spent more than a decade bringing Condor to semi-conductor, financial and biomedical institutions. It was always a fight to show them there was a better way to utilize their massive server farms that didn’t require paying the LSF Tax. Without a shiny sales or marketing department, Condor was hard to pitch to IT departments.
Still, to this day, I see people doing things with “modern” platforms like Kubernetes and such and I chuckle. Had that in Condor 15 years ago in many cases. :)
> Still, to this day, I see people doing things with “modern” platforms like Kubernetes and such and I chuckle. Had that in Condor 15 years ago in many cases. :)
I'm reading the docs and it seems used mostly for solving long running math problems like protein folding or seti at home?
Can it be used for scaling a website too? I think that's k8s "killer" feature heh.
Most systems like Condor have a concept that a job or task is something that "comes up, runs for a while, writes to log files, and then exits on its own, or after a deadline". I've talked to the various batch queue providers and I don't think they really consider "services" (like a webserver, app server, rpc server, or whatever) in their purview.
In fact, that was what struck me the most when I started at Google (a looong time ago): at first I thought of borg like a batch queue system, but really it's more of a "Service-keeper-upper" that does resource management and batch jobs are just sort one type of job (webserving, etc are examples of "Service jobs") laid on top of the RM system.
Over time I've really come to prefer the google approach, for example when a batch job is running, it still listens on a web port, serves up a status page, and numerous other introspection pages that are great for debugging.
TBH I haven't read the Condor, PBS, LSF manuals in a while so it's very well possible they handle service jobs and the associated problems like dynamic port management, task discovery, RPC balancing, etc.
But in a world where you're continuously deploying on a cadence that's incredibly quick, how do things differ? I contend the batch and online worlds start to get pretty blurry at this stage. We're not in a world where bragging about uptime on our webserver in years is a thing any more.
I was routinely using Condor in semi-conductor R&D and running batches of jobs where each job was running for many days -- that's probably far longer than any single instance of a service exists at Google in this day and age, right?
None of the batch stuff does the networking management though. No port mapping, no service discovery registration, no load balancer integration, etc. That's Kubernetes sugar they lack. But...has never struck me as overly hard to add, especially if you use Condor's Docker runner facilities.
Edit: I should say that I don't _really_ think you could swap out Kubernetes for Condor. Not easily. But it's always been in my long list of weekend projects to see what running an cluster of online services would be like on Condor. I don't think it'd be awful or all that hard.
The other killer Condor tech is their database technology. the double-evaluation approach of ClassAd is so fantastic for non-homogenous environments. Where loads have needs and computational nodes have needs and everyone can end up mostly happy.
Yes, it can scale based on metrics. And metrics can be anything.
What it’s missing is all the discovery and network plumbing to tie running instances together with load balancing and inter-service comms.
Googles old Borg paper mentions Condor as a thing they considered and cribbed features from.
Honestly, serving a website is not as different from batch processing problems as you’d think. There are differences but they’re subtle, not mountainous.
I've used condor for ~5 years now, mostly for running simulations and processing data. Everything i've done with it has been trivially parallelizable (divide data into chunks based on time, etc), and in those applications it has been a superb tool that just works.
It should be possible to run a scalable website with it, but then you don't get "infinite" scalability like cloud services offer, since you're limited by the size of the compute pool. It would probably have its pitfalls.
That being said, coming in without knowledge of either, i found it much easier to learn and get started doing things with condor than kubernetes. I had all kinds of issues just getting simple things like LaTeX compilation as part of gitlab CI to work reliably. Clearly the experts know how to make things go, but condor is lower barrier to entry. For use cases where condor CAN work, especially data processing, i always recommend that.
If you want to dig deep into computer science history, you’ll note that LSF was once a slightly modified Condor. There’s a wild history between the two and U.Wisc and U of T.
We use Condor. Oh and did you know it can run docker containers too? And it's constantly being updated an improved. What is lacking is in my opinion is a cool GUI to monitor and spawn things.
That goes back to the 1980s, with UCLA Locus. This was a distributed UNIX-like system. You could launch a process on another machine and keep I/O and pipes connected. Even on a machine with a different CPU architecture. They even shared file position between tasks across the network. Locus was eventually part of an IBM product.
A big part of the problem is "fork", which is a primitive designed to work on a PDP-11 with very limited memory. The way "fork" originally worked was to swap out the process, and instead of discarding the in-memory copy, duplicate the process table entry for it, making the swapped-out version and the in-memory version separate processes. This copied code, data, and the process header with the file info.
This is a strange way to launch a new process, but it was really easy to implement in early Unix.
Most other systems had some variant on "run" - launch and run the indicated image. That distributes much better.
I worked at Locus back in the 90s, when this technology was part of the AIX 1.2/1.3 family. The basic architecture allowed for heterogeneous clusters (i386 and i370 -- PC's and IBM Mainframes all running on the same global filesystem.) Pretty sure you couldn't migrate processes to a machine with a different architecture though. It was awesome to be able to "kill -MIGRATE" a long-running make job to somebody else's idle workstation, or use one of the built-in shell primitives to launch a new job on the fastest machine in the cluster "fast make -j 10 gcc".
There's also an ergonomics to process creation APIs - rather than needing separate APIs for manipulating your child process vs manipulating your own process, fork() lets you use one to implement the other: fork(), configure the resulting process, then exec().
CreateProcess* on Windows is a relative monstrosity of complexity compared to fork/exec.
This can let you stream in new pages of memory only as they are accessed by the program, allowing you to teleport processes with lower latency since they can start running basically right away.
I guess this post is a little bit different, because VMs are designed to be portable across different hosts. Even hypervisor software without live migration still lets you freeze the VM’s state to a file which can be copied to a new host. However, an already running process is not designed to be portable in the same way.
Telescript [0] is based on this idea, although at a higher level. I wish we could just build Actor-based operating systems and then we wouldn't need to keep reinventing flexible distributed computation, but alas...[1]
We'll keep poorly reinventing distributed computing features until we have a real distributed operating system. We're actually not that far off from it, but good luck convincing a mainline kernel dev to accept your patches.
I think the problem with a lot of these ideas is that the value of fork() is only marginally higher than the value of starting a fresh process with arguments on a remote machine. The complexity of moving a full process to another machine is 10 times higher than just starting a new process on the remote machine with all the binaries present already.
Quite frankly, vfork only exists and gets used because it's so damned cheap to copy the pagetable entries and use copy-on-write, to save RAM. Take away the cheapness by copying the whole address space over a network, adding slowness, and nobody will be interested any more.
And both techniques are inferior to having a standing service on the remote machine that can accept an RPC and begin doing useful work in under 10 microseconds.
RPC is how we launch mapshards at Google with a worker process that is a long-running server and it just receives a job spec over the network and can execute right away against the job spec.
Also of interest might be Sprite- a Berkeley research os developed “back in the day” by Ken Shirriff And others.
It boasted a lot of innovations like a logging filesystem (not just metadata) and a distributed process model and filesystem allowing for live migrations between nodes.
https://www2.eecs.berkeley.edu/Research/Projects/CS/sprite/s...
In essence this is manually implementing forking — spawning a new process and copying the bytes over without getting the kernel to help you, except over a network too.
It reminds me a bit of when I wanted to parallelise the PHP test suite but didn't want to (couldn't?) use fork(), yet I didn't want to substantially rewrite the code to be amenable to cleanly re-initialising the state in the right way. But conveniently, this program used mostly global variables, and you can access global variables in PHP as one magic big associative array called $GLOBALS. So I moved most of the program's code into two functions (mostly just adding the enclosing function declaration syntax and indentation, plus `global` imports), made the program re-invoke itself NPROCS times mid-way, sending its children `serialize($GLOBALS)` over a loopback TCP connection, then had the spawned children detect an environment variable to receieve the serialized array over TCP, unserialize() it and copy it into `$GLOBALS`, and call the second function… lo and behold, it worked perfectly. :D (Of course I needed to make some other changes to make it useful, but they were also similar small incisions that tried to avoid refactoring the code as much as possible.)
PHP's test suite uses this horrible hack to this day. It's… easier than rewriting the legacy code…
Indeed! I was talking to someone about their attempt to fork a Graal Java process and recreate all the compiler and GC threads, and I said if I had that task I'd be tempted to just use my new knowledge to implement a fork that also recreated those threads rather than trying to understand how to shut them down and restore them properly.
I did something similar using the 'at' daemon's 'now' time specification to "fork" off background tasks from a web request using the same .php file. It actually worked well for what I needed at the time!
This reminds me a little bit of the idea of 'Single System Image'[1] computing.
The idea, in abstract, is that you login to an environment where you can list running processes, perform filesystem I/O, list and create network connections, etc -- and any and all of these are in fact running across a cluster of distributed machines.
(in a trivial case that cluster might be a single machine, in which case it's essentially no different to logging in to a standalone server)
The wikipedia page referenced has a good description and a list of implementations; sadly the set of {has-recent-release && is-open-source && supports-process-migration} seems empty.
That was the original concept that led Ian Murdock and John Hartman to found Progeny. The idea was that overnight, while no one was working at their desks, companies could reboot their Windows boxes into a SSI network of Linux nodes to run parallel compute tasks.
Roughly, anyway, I got the sales pitch 20 years ago so my memories are fuzzy. I wasn’t remotely sold on it but was so anxious to work for a Linux R&D company in Indianapolis of all places that I accepted the job anyway.
Sadly we didn’t get far on the concept before the dot-com crash. Absent more venture capital we pivoted to focus on something we could sell, Progeny Linux, and tried to turn that into a managed platform for companies who wanted to run Linux on their appliances.
Bonus points if you can effectively implement the "copy on write" ability of the linux kernel to only send over pages to the remote machine that are changed either in the local or remote fork, or read in the remote fork.
A rsync-like diff algorithm might also substantially reduce copied pages if the same or a similar process is teleforked multiple times.
Many processes have a lot of memory which is never read or written, and there's no reason that should be moved, or at least no reason it should be moved quickly.
Using that, you ought to be able to resume the remote fork in milliseconds rather than seconds.
userfaultfd() or mapping everything to files on a FUSE filesystem both look like promising implementation options.
If you just pull things on demand, you're going to get a lot of round-trip-time penalties to page things in.
I think you should still be pushing the memory as fast as you can, but maybe you start the child while this is still in progress, and prioritize sending stuff the child asks for (reorder to send that stuff "next"), if you've not already sent it.
Yah that is indeed a super important optimization for avoiding round trips. CRIU does this and calls it "pre-paging", their wiki also mentions that they adapt their page streaming to try to pre-stream pages around pages that have been faulted: https://en.wikipedia.org/wiki/Live_migration#Post-copy_memor...
edit: lol I didn't realized that isn't CRIU's wiki since they just linked to a Wikipedia page and both use WikiMedia software. This is the actual CRIU wiki page, and it's way harder to tell if they do this, although I suspect they do and it's in the "copy images" step of the diagram https://criu.org/Userfaultfd
That’s a great idea. One of my thoughts was to “pre-heat” the process by executing a bit locally with side effects disabled to see what would get immediately accessed and send that first.
If your systems strictly match somehow (machine image with auto update disabled? or regularly hash and timestamp files on both systems) you can also cheat by mapping some of the files locally on the other side.
I do in fact mention this idea in the article. In fact userfaultfd was added to the kernel so that CRIU and KVM live migration could implement exactly this.
Another cool project that does something like this is https://github.com/gamozolabs/chocolate_milk which is a fuzzing hypervisor kernel which can back a VM snapshot memory mapping over the network to only pull down the pages that the VM actually reads during the fuzz case.
If you ever needed to bring the process back, you could use soft-dirty-bit[1] to determine which pages were modified since forking and only transfer those. CRIU uses it for incremental snapshots (in fact, they wrote the kernel patch afaik)
Don't get me wrong, this is great hacking and great fun. And this is a good point:
> I think this stuff is really cool because it’s an instance of one of my favourite techniques, which is diving in to find a lesser-known layer of abstraction that makes something that seems nigh-impossible actually not that much work. Teleporting a computation may seem impossible, or like it would require techniques like serializing all your state, copying a binary executable to the remote machine, and running it there with special command line flags to reload the state.
There was a lot of work on mobile agents 20 years ago, Java programs that could jump from machine to machine over the network and continue executing wherever they landed. The field stagnated because there were some really difficult security problems (how can you trust the code to execute on your machine? How can the code trust whatever machine it lands on and use it's services?). I think later work resolved the security issues but the field has not re-surged. Might be a good place to start to see what the issues and risks of mobile task execution are.
It’s touched on at the very end, but this kind of work is somewhat similar to what the kernel needs to do on a fork or context switch, so you can really figure out what state you need to keep track of from there. Once you have that, scheduling one of these network processes isn’t really all that different than scheduling a normal process, except the of course syscalls on the remote machine will possibly go to a kernel that doesn’t know what to do with them.
Wow, this is really interesting. I bet that there's a way of doing this robustly by streaming wasm modules instead of full executables to every server in the cluster.
It‘s a really nice idea. But when reading it i came to the conclusion that web workers are a really genious idea which could be used for c-like software equally good:
Everything which should run on a different server is an extra executable, so this executable would be shipped to the destination, started and then the two processes can talk by message passing. This concept is so generic there could be dozens of „schedulers“ to start a process on a remote location, like ssh connect, start a cloud vm, ...
The page states that CRIU requires kernel patches, but other sources say that the kernel code for CRIU is already in the mainline kernel. What's up with that?
CRIU can use two mechanisms to detect page changes. One is the soft-dirty kernel feature which is mainlined and can be accessed via /proc/PID/pagemap [1]. The other is userfaultfd which is only partially merged in the newest kernel. userfaultfd lacks write detection which the article mentioned. My understanding is that using pagemap requires the entire process to be frozen while it is scanned for memory changes and the memory is copied while uffd allows for a more streaming/on-demand approach that doesn't require stopping the entire process.
CRIU does work on mainline kernels and has for several years. However, it is still being actively developed (userfaultfd is the example the article uses -- a mainline feature which is still seeing lots of development).
It doesn't usually require additional patches to work, though they do have a fairly big job (every new kernel feature needs to be added to CRIU so it can both checkpoint and restore it -- and not all features are trivial to checkpoint). It's entirely possible that programs using very new kernel features may struggle to get CRIU to work.
I didn't say it requires kernel patches, just that it requires a build flag and new enough versions. It does happen that most distros do enable that build flag though.
There is a patch set for write protection support in userfaultfd that was just merged recently and I'm not sure if it's made it into any actual releases yet.
I could imagine having a build system which produces a process as an artifact and then just forks it in the cloud without distributing those pesky archives!
that... is a really fucking cool idea. i mean, it's of limited utility in the sense that builds really should generally work most of the time & be reproducible, so even though there is a bit of wasted effort, bandwidth, etc. in running the same build in multiple places at once, it's not like the worst thing in the world... but it just feels like there have to be cool / "useful" scenarios where you want to distribute a thing, but not the environment, etc. required to build the thing. oh wait, that already exists: it's called "a binary"... lol. iono.
i mean i guess if you have a process byte-for-byte frozen in time, you're a bit more certain (provably so? maybe?) maybe that when it is resurrected on some remote host that remote process and local process (as of t_freeze) are in ~"exactly" the same state?
or if you continuously snapshotted a process while it was running, and then when it crashes you (hopefully) have a snapshot from right before when it crashed that can be replayed, modified, etc. at will? but that's a bit of a stretch, and there are simpler ways of accomplishing that for the most part...
the only other thing i could think of is that this could allow you to run completely diskless servers, because you can just beam all the processes in, without ever having to "install" anything, beyond copying all the shared libs, etc. which i guess could get the "1 have one of these and i want 100 of them, in the cloud" latency down by an order of magnitude or two maybe.
What I'd love it's bingding easily remote directories as local. Not NFS, but a braindead 9p. If I don't have a tool, I'd love to have a bind-mount of a directory from a stranger, and run a binarye from within (or piping it) without he being able to trace the I/O.
If the remote FS is a diff arch, I'd should be able to run the same binary remotely as a fallback option, seamless.
I wonder whether the effort by the syzkaller people (@dyukov) could help with the actual description of all the syscalls (that the author says people gave up on for now, because too complex), since they need them to be able to fuzz efficiently...
This is similar to what Incredibuild does. It distributes compile and compute jobs across a network, effectively sandboxing the remote process and forwarding all filesystem calls back to the initiating agent.
https://www.dragonflybsd.org/cgi/web-man?command=sys_checkpo...
https://www.dragonflybsd.org/cgi/web-man?command=checkpoint&...