Thanks I missed that line regarding the O states. Still, a word about write reordering on ARM would probably be useful (unless I missed that also).
I understand that synchronization in code vs hardware is different, but the blog explicitly moves out of hardware-land into source code land with references to Java volatile and such.
The blog mentions about java volatiles (but it would also apply to C++ atomic) to explicitly mention that volatile has no cache coherency implications on a typical MESI (and variants) machine. The fences required to maintain language level memory model guarantees act at a level above the L1 cache, once the data reaches L1 (i.e. the coherence point), the fences have done their job.
[I'm ignoring remote fences which are a specialized and not yet mainstream feature]
I understand that synchronization in code vs hardware is different, but the blog explicitly moves out of hardware-land into source code land with references to Java volatile and such.