I was thinking about code that relies on TSO but doesn't insert synchronisation primitives for the compiler, and it just happens to violate the expected ordering. E.g. maybe the code would break if you increased the optimisation level or switched compiler.
Generally the way this works is that when you write atomic algorithms, you are doing 2 things. The first is telling the compiler what it's allowed to optimize, and the 2nd is controlling the processor. What this means is that the code that relies on TSO (which is pretty close to the C++ memory model), you add a bunch of information to the code that prevents the compiler from doing some optimizations, and then when the compiler is generating native code, on X86 it will turn into regular loads/stores, but on arm it will have additional fence instructions.
it's not actually that clear. hardware can do some pretty neat tricks to make the fences basically free when there aren't multiple cores writing to the same memory. as a result, most of those explicit fences are just extra front end pressure.