Interesting project. I almost wish I had a concurrency bug to test it on.
> Guest software running in the Antithesis platform still experiences concurrency similar to a multi-core / multi-machine system, thanks to the process scheduling imposed by the guest OS
This might not exercise the full set of race conditions. When two threads are running simultaneously on separate cores (or hyper-threaded on the same core) they can interleave instructions at a much finer granularity than any OS time slicing would cause, even within instructions.
For example, could it find a race condition where two threads are executing INC [addr] on the same memory address, where context switching between instructions doesn't trigger it?
You are correct. There's a set of concurrency bugs that require actual SMP setups to trigger (like stuff with atomic operations, memory ordering, etc.). If you're trying to build a very low-level lock-free data structure where this is your primary threat model, Antithesis is not the right tool for you... for now... until we build a CPU simulator...
> For example, could it find a race condition where two threads are executing INC [addr] on the same memory address, where context switching between instructions doesn't trigger it?
I'm not actually familiar with the details of hardware MMUs, but would they not enforce sequential access of the address? Or do MMUs allow parallel reads and writes?
x86 processors have a LOCK instruction prefix, which makes some instructions atomic including increment. Increment is nontrivial to make atomic, because there are two memory accesses: read X and write back X+1. It's a bit slow, because it has to inform every other core in the system, "Hey, I just read this memory address and I'm about to write something back to it, so don't don't use it until I'm done." C++ has functions to do this like std::atomic_fetch_add, but they're clunky to use so people often forget.
The way that's currently implemented is actually just one memory operation these days. L2 (or really wherever coherency is mostly managed) has a tiny ALU so in addition to read or write, atomicop is an operation you can send to L2. It'll gain Modified or Exclusive access to the cache line(s) that op is addressed to and just do the operation right there. That way the normal cache protocol is all you need for atomicity, and really the line is only contended for a single cycle from the L2 controller's (and the rest of the coherency peers) perspective.
That's why they retconned the lock prefix to not be an actual assertion of the #LOCK signal any more.
That's also why TileLink and AMBA include atomic ops like addition and bitwise ops in their coherency protocols rather than just 'claim region'.
That's also why you see newer archs like RISC-V and Arm64 that have both lr/sc style ops, in addition to direct atomic memory ops like amoadd.w, it better matches the primitives of the underlying memory system.
That's not how it works on x86 as far as I know. The atomic ops are simply performed by the ALU, against the L1 cache when the instruction is about to retire. Atomicity is guaranteed by not allowing the line to be stolen by another core between the operation, and memory order is ensured by draining the store buffer before executing the op and other speculation (eg loads can speculatively pass even atomic operations).
Ahh, it's 2 separate memory accesses just encapsulated in one instruction, that explains it. Yeah, unless this hypervisor allows context switching at the level of microcode ops, that seems to be undetectable currently.
I love this. Many a moon ago, I worked on a system called Aikido at MIT, which combined a special built hypervisor with a binary rewriting system (DynamoRio) to enable efficient time travel debugging and race detection of parallel applications.
If anyone's interested, here's a publication that talks about it in more detail:
The use of performance counters here also reminds me of another project I worked on called Kendo, which was a posix thread like replacement that used performance counters to enforce a deterministic interleaving of synchronization operations (mutexes, etc). The system could guarantee determinism for programs that didn't have race conditions. Back then, I found that counting instructions wasn't deterministic on the processors of the time, but counting store operations was. If anyone's interested in that work, here's the publication:
What a tease! They describe in detail two problems they had to “invent workarounds” for, but say nothing about what the workarounds are. I’m very curious, since both of the problems sound quite hard to work around. I wonder if they’re being purposefully vague to make it harder for competitors to replicate their work…
https://rr-project.org/ had the same problem. They use the retired conditional branch counter instead of instruction counter, and I believe then instruction stepping until at the correct address.
I don't understand how this works in the case of testing many applications running on many machines, where many services on many machines need to communicate with each other. We deploy a mix of systemd services and OCI containers (running on podman and Docker) to different machines, the exact mix on each machine depends on the machine's intended purpose.
We currently run CI tests using QEMU VMs. These VMs comprise a few systems representative of those that we deploy to production.
Does adopting Antithesis mean that all non-containerized applications would need to be OCI-ified and every interaction would need to be mocked? There's a sort of combinatorial explosion that I'm concerned about when I'm thinking about testing/adding a new service to a system: All services on which it depends need to be mocked and all services which depend on it require creating a mocked version of it.
Seems like a lot of work. Can someone please help clarify things for me?
Also, how could we test the behavior of non-application code like drivers or the kernel itself?
> The Antithesis environment simulates one or more computers using a collection of containers, all running within a single virtual machine managed by our hypervisor.
No mocking needed, but everything needs to share the single VM.
And it sure sounds like they run a custom kernel in the guest, so this is not for kernelspace testing:
> Since the Antithesis platform controls the guest’s scheduler,
I'm not familiar with the area, so am likely missing something, but how do they do deterministic thread-level context switching? Something like:
var_1 = 0
var_2 = 0
thread_a:
while true:
something_complex()
var_1 ++
thread_b:
while true:
something_complex()
var_2 ++
Under the quoted definition of determinism, for every point in time, var_1 and var_2 should have the same values across all executions. But this would seem to amount to ensuring that exactly the same number of instructions are executed each time a thread is scheduled.
You are exactly correct. Our hypervisor grants you this power (and then we use the power to explore as many possible values of var_1 and var_2 as we can, in case some of them trigger a concurrency bug, e.g.)
The resolution of your deterministic scheduler is probably not based on quanta like "instructions", right? More likely it takes traditional input like a stable clock tick?
Alex discusses this in the post. Performance counters were the first thing we tried, but alas they're non-deterministic about every one-in-a-trillion instructions.
Yes it is, for simple programs anyway. On the other hand, I've tried using Hermit to wrap around tests of a Raft implementation in Rust and Hermit would crash no matter what I did. Perhaps it was me using threads, perhaps it was me using sockets, I have no idea. The error messages were pretty obscure to me. The Hermit devs don't seem to be super responsive in general to issues on the repo. I don't blame them of course.
How does this deal with non-determinism from the outside world? For example, let's say one of my tests is flaky because it asks an external service to give it some data, and that external service is flaky in what it returns?
Or what if my bug is caused by bitflips in failing memory, that lead to impossible control flow paths being hit? Think something like:
if x != 0:
return 1/x
Failing with an error because x is 0.
Not hypothetical scenarios, both real bugs I've had to troubleshoot in my career.
If you know somebody who will pay money for us to prioritize this feature, let me know! Otherwise, I'm sure we'll get to it eventually. We have all kinds of crazy ideas for new faults.
Communication with the outside world is something that we obviously have to ban. This means that all of the inputs to your system are being provided by our platform, and that any dependencies have to be mocked, or run inside the hypervisor with you.
In practice this can cause friction for people with a ton of dependencies. Some of the most common things we've already mocked (for example we have an entire fake AWS that we can run in there with you), and if your dependency is one of our existing customers, we can probably work something out...
The product does not appear to be a record-replay debugging product, it appears to be a precise fault injection test generation product.
In a record-replay debugging product, you want to reproduce the execution of your system to translate what occurred in reality into the debugging lab.
In this product, the goal appears to be creating a deterministic environment so that you can precisely inject non-determinism/faults to probe the response of your product to the environment.
The former is about analyzing bugs your test found, the latter is about creating tests to produce bugs. In that context, your question is unrelated to the product. It does not seek to reproduce flaky tests, it seeks to help invent new and exciting flaky tests.
You're sort of right, but we're actually sort of both. We use determinism to do controlled fault-injection and to explore your program's state space, but we can also use it for very powerful debugging. Look for future announcements about this.
Certainly. Once you have a deterministic environment, the ability to replay execution is basically free.
In fact, what has been described is basically the replay engine in a record-replay system which “runs” the recorded execution of a non-deterministic system in a deterministic mode to replay the exact same execution. As such, it retains the same benefits available to any replay. The key here is that you can “bypass” the record step since you are already running in the replay engine from the start.
Fun read. It has some similar ideas to https://dedis.cs.yale.edu/2010/det/ but that was actually focused on multicore processing and the communication across cores.
I've long thought about these kind of OS designs, and what great features they can enable (such as time travel debugging). But the non-determinism introduced by inter-CPU interactions is a fundamental limitation, hence the need to run everything on a single isolated core.
One day(^TM) I'm really keen to design a multi-core CPU architecture that allows for deterministic message passing between cores in such a way that you could get this kind of software working with true parallelism.
This is a very promising project that I've seen a lot of attempts to do in the past, but never got to the level of progress that you have! Very impressive work!
I am sad that you decided to give up on solving the multi-core parallelism issue, since each guest running on a single core is a dead giveaway to malware that they're not on a real machine, but it's understandable. I do wonder if that means that some class of bugs will be undetectable to this hypervisor, though.
Not a dead giveaway i suppose, you're right. And in fact, since you can control precisely the number of instructions per thread timeslice, it does seem non-trivial to decide if you're running under more then one core due to the hypervisor being able to inject non-determinism into the context switching.
We do something similar in house where I work. Is it hard to onboard new customers? Since they make a special container for you they basically adopt your build system, which may be hard for them. Does this go beyond mutation testing?
With most of our customers, they don't actually have to change their build system at all. We take their normal CI products, plus a small amount of special configuration, and run them. This is actually a good thing because we're testing something very similar to what runs in production.
The kinds of state space exploration we do are a lot more general than mutation testing. Our current product does exploration by (1) varying the space of faults, packet delivery times, thread schedules, etc., and (2) driving a customer-provided pseudo-randomized workload. We have plans to make both these mechanisms much more expressive, powerful, and configurable; and we have longer-term plans to add entirely new kinds of testing to the same platform.
Disclosure: I'm one of the co-founders of Antithesis.
Thanks! Where does the "want" come from to compare your "got"s to? If the customer code handles errors gracefully, can you still surface undesirable behavior?
We provide a bunch of different ways for you to express your test properties. (1) A lot of stuff we can detect automatically, like crashes, OOMs, etc. (2) You should turn on assertions in all the code you send to us. (3) We can write generalized temporal regexes against your log messages or other output. (4) We offer SDKs that allow you to declare test properties inline in your code. (5) We can integrate with off-the-shelf sanitizers like ASAN, the Go race detector, etc.
Did you decide on a hypervisor instead of an emulator for overhead reasons? I don't work there but I heard Microsoft has something called tkofuzz which similarly emphasizes determinism but uses Bochs internally.
This is for a distributed database product that claims to bypass the CAP theorem or have I misunderstood?
> Back then Spanner wasn’t public yet and a lot of people misinterpreted the CAP theorem to say that a strongly consistent database couldn’t also be highly available in the face of network faults.
Thank you :) anteater is Antithesis mascot but this cyber version of it the only illustration I could come up with for the Determinator. There were other much more creepy versions :)
Oh - it's just a standard CP database where, in the event of a partition, the partition knows it's the partition because it's smaller than a quorum and won't let you access it.
it turns out x86/amd chips many of the perf counter events are offset by the (unpredictable) interrupt count because the interrupt return instruction uop gets counted as both a user and kernel instruction. On many processors the retired store instruction avoids this issue.
> Guest software running in the Antithesis platform still experiences concurrency similar to a multi-core / multi-machine system, thanks to the process scheduling imposed by the guest OS
This might not exercise the full set of race conditions. When two threads are running simultaneously on separate cores (or hyper-threaded on the same core) they can interleave instructions at a much finer granularity than any OS time slicing would cause, even within instructions.
For example, could it find a race condition where two threads are executing INC [addr] on the same memory address, where context switching between instructions doesn't trigger it?