Um, does Linux run on such systems? There's a broad assumption that atomic ops work efficiently (or at least, that was true when I dropped out 7 years ago...)
I'm not the right person to ask about this, but some ARM systems definitely have weaker coherence than x86 -- in FreeBSD we have a whole bunch of memory barrier primitives which compile away to nothing on x86 because they exist only for weaker platforms.
I have a vague recollection that it's something to do with whether caches snoop the memory bus for reads of lines they "own" but I could be mistaken. Whatever it was, there were cases where buggy code meant that a core could read a stale value from memory for multiple seconds after another core wrote to the same address.
The terminology is that x86 has a Total Store Order memory model, while ARM doesn't; acquire operations on x86 imply that all relaxed store operations on the core that released are also visible to the reader, which is not true on ARM unless the released value has a data dependency, or until you execute a memory barrier.
ARM still has a coherent cache, however. Basically every modern OS and program depends on having a coherent data cache (though ARM doesn't keep i-cache and d-cache coherent with each other, which basically only comes in play with self-modifying code)
The details are hazy for me but all relevant CPUs have coherent caches, but not all make the same ordering guarantees.
x86 has "total store ordering", meaning stores made by core 1 will always be observed in-order by core 2. ARM doesn't make that guarantee.
In practice it doesn't matter for writing correct programs unless you write assembly: even if the CPU has total store ordering, the compiler is allowed to reorder stores unless you put an appropriate barrier in the high-level language source.
Linux does not work on systems without coherent caches between CPU accesses (some IO device incoherence is allowed). There is no cache shootdown IPI like that required[1].
ARM and x86 both can delay stores behind later loads. "Timeliness" of when stores might become visible is almost never specified exactly by any ISA or implementation (maybe some real-time CPUs, but Linux does not depend on such), but you will never get into the situation where a memory operation reads "stale" data beyond a fairly strict specification of memory consistency.
ARM does have some weaker ordering than x86, but this is all about how operations within a single CPU/thread behave. An ARM CPU can perform two loads out of order, and perform two stores out of order (with respect to what other CPUs can observe). Use barriers to order those, and you have ~same semantics as x86, and those barriers don't need to "reach out" on the interconnect or to the caches of other CPUs. Once the data leaves your store queues, both x86 and ARM, all other CPUs will see the result, because for it to be accepted into coherent caches, all other copies in other caches need to be invalidated first.
[1] Linux does work on some CPUs where instruction caches and/or the instruction fetch/execution pipeline are not coherent with data operations so there can be some situations on some CPUs where code modification may need to send IPIs to other CPUs to flush their instruction caches or pipelines, so speaking purely about memory data operations and data caches above.