I remember we were working on this exact topic at my University chair ~8-10 years ago. I think it never fully took off. Several Master students worked on it for a while. I like that it's now in QEMU!
Qemu is great, the DEVs and all who worked on her deserve applause, except for documentation. It's like someone creating a huge Japanese titanium indestructible fighting robot, but then using aluminum in the feet/heels.
So much of my qemu work spent on randomly changing options, with no change documentation, discovering features, with no documentation, options with no reason or indication why, manpages out of date, READMEs not updated, changelog not there, etc.
Documentation is not great I admit. The problem is that we don't have anyone who is a capable tech writer in the team. It's not something that you can improvise.
However, all incompatible changes are documented and also announced at least 8 months in advance.
It may seem like there are many, but in practice they are in very old, mostly unused or very badly designed corners. For example configuration of audio was overhauled last year, and is now the same as basically all other backends (e.g. -audio pa,model=sb16; compare with -nic user,model=e1000 for a network card).
As a general thought, would it be possible to put out a "call for tech writers" post or similar on the front of the qemu website, or even a prominent blog post in the blog section?
If there are parts that specifically you'd want to have better documentation for, please let me know here.
Generally we've been moving command line towards a scheme where each option describes an aspect of either the guest (a device, the board type, the CPU model) or the interface to the host (a file holding the contents of the disk, the network bridge to attach to, how to show graphic contents), with some options providing both as a shortcut (for example -nic, -audio, -serial).
Yeah QEMU's story for this sort of thing is pretty rough around the edges for OS dev. Wishing there was something like Unicorn-but-with-devices for making osdev tooling.
There's also panda[1], but I never got it working myself. I share your frustration, as it would help greatly with debugging, especially with nondeterminstic bugs. I likewise never got QEMU's record/replay to work.
This is a big deal. With some tooling around it can be amazing.
I can think of using this for testing, and as a vehicle to change a programming paradigm of existing/legacy software (run a thing, and roll it back aggressively from outside of a vm)
Indeed, the tooling is the problem. And I wouldn't hold my breath to see this tooling being implemented, as the feature has been around for quite a bit.
IMHO, PANDA [1] remains a better/more practical choice for whole-system record/replay analysis. It already offers quite a bit of tooling (including a python interface), as well as hooks to build your own. It does have its own shortcomings (speed and not being in-sync with the latest QEMU), but at least you're not limited to gdb-based debugging.
While it might seem like a small feature, it opens a huge door. It's similar to what reproducible build infrastructure has done for finding bugs, attestation that binary matches source, immutability, etc.
Can imagine this is useful for finding bugs in hardware designs too.
> Record/replay system is based on saving and replaying non-deterministic events
> The following non-deterministic data from peripheral devices is saved into the log: mouse and keyboard input, network packets, audio controller input, serial port input, and hardware clocks (they are non-deterministic too, because their values are taken from the host machine). Inputs from simulated hardware, memory of VM, software interrupts, and execution of instructions are not saved into the log, because they are deterministic and can be replayed by simulating the behavior of virtual machine starting from initial state.
So, it's probably not much, you can probably comfortably save minutes of qemu sessions.
Also note the existence of the rr debugger [1], which allows you to reverse debug applications with a ~10% performance hit while recording. To achieve this, it records results of syscalls (only). It will serialize thread events, so have the effect of running applications like on a single core CPU.
I'm gonna guess not that much - easily doable on a typical desktop.
If it were a problem, you can skip recording your emulated machines bootup process, and simply take a snapshot when you're about to start your QT application. That snapshot probably only takes about 10% extra RAM because most of RAM contents wont change between the snapshot and the live system.
It is very cool, but I think some version of this feature has been around for years? This commit is from 7 years ago, and it looks like the code originates back to 2010.
Seems to me like one of the highest and best uses of this right now would be adding verifiable builds to … literally anything. You no longer need a verifiable-build-capable compiler or language — you can just run the compile and packaging step through a deterministic QEMU session.
Does this sound right? I’m trying to figure out where uncontrollable randomness would come in during a compile phase, and coming up blank.
Thanks for the link, I’m aware of reproducible-builds.org.
Both your causes seem trivially fixable here - the QEMU builds could have a standard system clock time they start with, and an ‘unsorted’ file listing made in a deterministic OS environment will keep the same file order, no?
By comparison the rb.org site says you need to start with stripping all that stuff out of your build process, for the reasons you refer to.
I like your thinking. Deterministic replay with QEMU is supplemental to the larger goal of reproducible builds. The communities concerned with the topic of reproducible software not only expect cohesive human-readable code that runs deterministically to produce binary reproducible results, but their originally stated goals require it.
Deterministic replay with QEMU is a "power tool" in the larger picture of these efforts.
Sounds to me like that wouldn't be quite as good as true reproducible builds (which can run on anybody's computer) because auditing the entire emulated hardware starting state and events log is a harder problem than auditing only your code. Including a virtual machine image for your build effectively makes the VM part of your codebase in terms of users needing to trust it, so verifying the build result means not just engineering but also forensics.
So it'd be good for cases where you otherwise wouldn't be able to provide any verifiability. But for software, it's still not as good as eliminating non-determinism completely.
this is a good point that it does add a lot of weight to a build process. I might try it though. If the ‘weight’ is a lightweight Linux build environment referenced by ISO hash, it might not be that bad (TM) in practice. And, generally, trading bandwidth (cheap) for human refactoring (crazy expensive) is a trade that’s often nice to make.
Any idea if that works in modern VMware Workstation? It's currently on version 17, whereas that post was for version 6.5.
VMware Workstation has such disjointed development spurts that it wouldn't surprise me if the feature had been ripped out at some point. Other useful features such as machine groups have been. :(