I'm pretty sure there's a Herb Sutter C++ talk which explicitly associates newer CPUs having instructions well suited to the Acquire/Release model with that C++ 11 memory model. I have a lot of Herb's talks in my Youtube history so figuring out which one I meant will be tricky. Maybe one of the versions of "Atomic Weapons" ? This idea is out there more generally though.
I don't think I agree that this doesn't mean the memory model infects the CPU design. Actually I don't think I agree more generally either. For example I would say that the Rust strings are fast despite the fact that modern CPUs have gone out of their way to privilege the C-style zero terminated string. There are often several opcodes explicitly for doing stuff you'd never want except that C-style strings exist. They would love to be faster but it's just not a very fast type, so there's not much to be done.
Contrast this with say, bit count, which is a good idea despite the fact it's tricky to express in C or a C-like language. In a language like C++ or Rust you provide this as an intrinsic, but long before that happened the CPU vendor included the instruction because this is fundamentally a good idea - you should do this, it's cheap for the CPU and it's useful for the programmer, the C language is in the way here. "Idiom recognition" was used in compilers to detect OK, this function named "count_pop" is actually the bit count instruction, so, just use that instruction if our target architecture has it. More fragile than intrinsics (because it's a compiler optimisation) but effective.
At an even higher level, from the point of view of a CPU designer, it would be great to do away with cache coherence. You can go real fast with less transistors if only the stupid end users can accept that there's no good reason why cache A over here, near CPU core #0 should be consistent with cache D, way over on another physical CPU, near CPU core #127. Alas, turns out that writing software for a non-coherent system hurts people's brains too much so we've been resolutely not doing that. But that's exactly a model choice - we reject the model where the cache might not be coherent. Products which lack cache coherence struggle to sell.
I don't think I agree that this doesn't mean the memory model infects the CPU design. Actually I don't think I agree more generally either. For example I would say that the Rust strings are fast despite the fact that modern CPUs have gone out of their way to privilege the C-style zero terminated string. There are often several opcodes explicitly for doing stuff you'd never want except that C-style strings exist. They would love to be faster but it's just not a very fast type, so there's not much to be done.
Contrast this with say, bit count, which is a good idea despite the fact it's tricky to express in C or a C-like language. In a language like C++ or Rust you provide this as an intrinsic, but long before that happened the CPU vendor included the instruction because this is fundamentally a good idea - you should do this, it's cheap for the CPU and it's useful for the programmer, the C language is in the way here. "Idiom recognition" was used in compilers to detect OK, this function named "count_pop" is actually the bit count instruction, so, just use that instruction if our target architecture has it. More fragile than intrinsics (because it's a compiler optimisation) but effective.
At an even higher level, from the point of view of a CPU designer, it would be great to do away with cache coherence. You can go real fast with less transistors if only the stupid end users can accept that there's no good reason why cache A over here, near CPU core #0 should be consistent with cache D, way over on another physical CPU, near CPU core #127. Alas, turns out that writing software for a non-coherent system hurts people's brains too much so we've been resolutely not doing that. But that's exactly a model choice - we reject the model where the cache might not be coherent. Products which lack cache coherence struggle to sell.