You really should just use volatile for device drivers when accessing IO space with side-effects. Do not use volatile to build your own synchronization primitives.
Don't forget about memory barriers. Otherwise your driver will fail on other CPU architectures.
Just because it works on x86, doesn't mean it works on ARM, MIPS, POWER or RISC-V. CPUs other than x86 can reorder stores with other stores and loads with other loads. It can cause the CPU to do the store that starts DMA before the stores that set up length and address are done!
Or just use C11 or C++11 memory model. Although those are still not available in too many cases, curse of having to use an ancient compiler...
Even on x86, even with the C11 memory model, you can still get burned by transaction reordering as the MMIO passes through bridges. Plain old PCI can do this.
I thought x86 wasn't allowed to do write-write reordering? Does that rule not apply to peripherals? Is an `mfence` guaranteed to fix it, or are there just no rules at all at that point?
PCI isn't x86, and x86 isn't PCI. PCI itself, for example in a PCI-to-PCI bridge chip, can do store buffering and read prefetches. The PCI specification lays out what you must to to suppress this.
There are at least 5 different sets of rules for ordering on x86, due to memory types. It's in the Intel manual, along with a table that shows how they interact with each other.
Atomics may be implemented with locks, which makes them unsuitable for signal handlers. The only guaranteed lock-free type is `std::atomic_flag` which is not very useful.
`volatile sig_atomic_t` still seems like the better choice for signals.