I have been down this road a little bit, applying the ideas from jockey to write and ship a deterministic HFT system, so I have some understanding of the difficulties here.
We needed that for fault tolerance, so we could have a hot synced standby. We did have to record all inputs (and outputs for sanity checking) though.
We did also get a good taste of the debugging superpowers you mention in your blog article. We could pull down a trace from a days trading and replay on our own machines, and skip back and forth in time and find the root cause of anything.
It sounds like what you have done is something similar, but with your own (AMD64) virtual machine implementation, making it fully deterministic and replayable, and providing useful and custom hardware impls (networking, clock, etc).
That sounds like a lot of hard but also fun work.
I am missing something though, in that you are not using it just for lockstep sync or deterministic replays, but you are using it for fuzzing. That is, you are altering the replay somehow to find crashes or assertion failures.
Ah, I think perhaps you are running a large number of sims with a different seed (for injecting faults or whatnot) for your VM, and then just recording that seed when something fails.
We needed that for fault tolerance, so we could have a hot synced standby. We did have to record all inputs (and outputs for sanity checking) though.
We did also get a good taste of the debugging superpowers you mention in your blog article. We could pull down a trace from a days trading and replay on our own machines, and skip back and forth in time and find the root cause of anything.
It sounds like what you have done is something similar, but with your own (AMD64) virtual machine implementation, making it fully deterministic and replayable, and providing useful and custom hardware impls (networking, clock, etc).
That sounds like a lot of hard but also fun work.
I am missing something though, in that you are not using it just for lockstep sync or deterministic replays, but you are using it for fuzzing. That is, you are altering the replay somehow to find crashes or assertion failures.
Ah, I think perhaps you are running a large number of sims with a different seed (for injecting faults or whatnot) for your VM, and then just recording that seed when something fails.