The way that's currently implemented is actually just one memory operation these...

BeeOnRope · on March 21, 2024

That's not how it works on x86 as far as I know. The atomic ops are simply performed by the ALU, against the L1 cache when the instruction is about to retire. Atomicity is guaranteed by not allowing the line to be stolen by another core between the operation, and memory order is ensured by draining the store buffer before executing the op and other speculation (eg loads can speculatively pass even atomic operations).