This vaguely reminds me of Jefferson's "Virtual Time" paper from 1985[1]. The un...

This vaguely reminds me of Jefferson's "Virtual Time" paper from 1985[1]. The underlying idea at the time didn't really take off because it required, like Zookeeper, a greenfield project: except that it kinda doesn't and today you could imagine instrumenting an entire Linux syscall table and letting any Linux container become a virtual time system -- but Linux didn't exist in 1985 and wouldn't be standard until much later.

So Jefferson just says, let's take your I/O-ful process, split it a message-passing actor model, and monitor all the messages going in and coming out. The messages coming out, they won't necessarily do what they're supposed to do yet, they'll just be recorded with a plus sign and a virtual timestamp, and by assumption eventually you'll block on some response. So we have a bunch of recorded message timestamps coming in, we have your recorded messages going out.

Well, there's a problem here, which is that if we have multiple actors we may discover that their timestamps have traveled out-of-order. You sent some message at t=532 but someone actually sent you a message at t=231 that you might have selected instead of whatever you actually selected to send the t=532 message. (For instance in the OS case, they might have literally sent a SIGKILL to your process and you might not have sent anything after that.) That's what the plus sign is for, indirectly: we can restart your process from either a known synchronization state or else from the very beginning, we know all of its inputs during its first run so we have "determinized" it up past t=231 to see what it does now. Now, it sends a new message at say t=373. So we use the opposite of +, the minus sign, to send to all the other processes the "undo" message for their t=532 message, this removes it from their message buffer: that will never be sent to them. And if they haven't hit that timestamp in their personal processing yet, no further action is needed, otherwise we need to roll them back too. Doing so you determinize the whole networked cluster.

The only other really modern implementation of these older ideas that I remember seeing was Haxl[2], a Haskell library which does something similar but rather than using a virtual time coordinate, it just uses a process-local cache: when you request any I/O, it first fetches from the cache if possible and then if that's not possible it goes out, fetches the data, and then caches it. As a result you can just offer someone a pre-populated cache which, with these recorded inputs, will regenerate the offending stack trace deterministically.

1: https://dl.acm.org/doi/10.1145/3916.3988

2: https://github.com/facebook/Haxl