The way that's currently implemented is actually just one memory operation these days. L2 (or really wherever coherency is mostly managed) has a tiny ALU so in addition to read or write, atomicop is an operation you can send to L2. It'll gain Modified or Exclusive access to the cache line(s) that op is addressed to and just do the operation right there. That way the normal cache protocol is all you need for atomicity, and really the line is only contended for a single cycle from the L2 controller's (and the rest of the coherency peers) perspective.
That's why they retconned the lock prefix to not be an actual assertion of the #LOCK signal any more.
That's also why TileLink and AMBA include atomic ops like addition and bitwise ops in their coherency protocols rather than just 'claim region'.
That's also why you see newer archs like RISC-V and Arm64 that have both lr/sc style ops, in addition to direct atomic memory ops like amoadd.w, it better matches the primitives of the underlying memory system.
That's not how it works on x86 as far as I know. The atomic ops are simply performed by the ALU, against the L1 cache when the instruction is about to retire. Atomicity is guaranteed by not allowing the line to be stolen by another core between the operation, and memory order is ensured by draining the store buffer before executing the op and other speculation (eg loads can speculatively pass even atomic operations).
That's why they retconned the lock prefix to not be an actual assertion of the #LOCK signal any more.
That's also why TileLink and AMBA include atomic ops like addition and bitwise ops in their coherency protocols rather than just 'claim region'.
That's also why you see newer archs like RISC-V and Arm64 that have both lr/sc style ops, in addition to direct atomic memory ops like amoadd.w, it better matches the primitives of the underlying memory system.