Like I said, this is absolutely a die-size tradeoff IMO. That 75% IPC gain is on...

rayiner · on Dec 1, 2020

Intel’s 10nm is equivalent to TSMC’s 7nm, so we’re just talking one generation on the process side. I don’t think you can chalk a 75% IPC gain to a die shrink. That’s a much bigger IPC uplift than Intel has achieved from Sandy Bridge to Sunny Cove, which happened over 4-5 die shrinks.

The total performance gain, comparing a 4.7 GHz core to a 3.2 GHz core, is 20%. But there is more to it than bottom line. The conventional wisdom would tell you that increasing clock speed is better than making the core wider because of diminishing returns to chasing IPC. Intel has followed the CW for generations: it has made cores modestly wider and deeper, but has significantly increased clock speed. Intel doubled the size of the reorder buffer from Sandy Bridge to Sunny Cove. Intel increased issue width from 5 to 6 over 10 years.

If your goal was to achieve a 20% speed-up compared to Sunny Cove, in one die shrink, the CW would be to make it a little wider and a little deeper but try to hit a boost clock well north of 5 GHz. It wouldn’t tell you to make it a third wider and twice as deep at the cost of dropping the boost clock by a third. Apple isn’t just enjoying a one-generation process advantage, but is hitting a significantly different point in the design space.

ece · on Dec 1, 2020

Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code. With SMT off, I get 25-50% more performance on single threaded benchmarks, that's because a thread does get full access to 50% more decode/execution units in the same cycle. It's not that simple again, but that's likely the simplest example.

The M1 is definitely a significantly different point in the design space. Intel is also doing big/little designs with Lakefield, but it's still a bit early to see where that goes for x64. I don't think Intel/AMD have specifically avoided going wider as fast as Apple; AVX/AVX2/AVX512 probably take up more die-area than going 1/3 wider, and that's what they've focused on with extensions over the years. If there is an x64 ISA limitation to going wider, we'll find out, but that's highly unlikely IMO.

rayiner · on Dec 1, 2020

> Superscalar vs super-pipelining isn't new. If there's no magic, then a third wider would likely exactly decrease the boost clock by a third with perfect code.

It's not new, but it's surprising. You're correct that going a third wider at the cost of a third of clockspeed is a wash with "perfect code" but the experience of the last 10-20 years is that most code is far from perfect: https://www.realworldtech.com/shrinking-cpu/2/

> The first sign that the party was over was diminishing returns from wider and wider superscalar designs. As CPUs went from being capable of executing 1, to 2, to 4, to even 6 instructions per cycle, the percentage of cycles during which they actually hit their full potential was dropping rapidly as both a function of increasing width and increasing clock rate. Execution efficiency (actual instruction execution rate divided by peak execution rate) dropped with increasing superscalar issue width because the amount of instruction level parallelism (ILP) in most programs is limited.... The ILP barrier is a major reason that high end x86 MPUs went from fully pipelined scalar designs to 2-way superscalar in three years and then to 3-way superscalar in another 3 years, but have been stuck at 3-way issue superscalar for the last nine years.

Theoretical studies have shown that higher ILP is attainable (http://www.cse.uaa.alaska.edu/~afkjm/cs448/handouts/ILP-limi...) but the M1 suggests some really notable advances in being able to actually extract higher ILP in real-world code.

I agree there's probably no real x86-related limitation to going wider, if you've got a micro-op cache. As noted in the study referenced above, I suspect its the result of very good branch prediction, memory disambiguation, and an extremely deep reorder window. Each of those is an engineering feat. Designing a CPU that extracts 80% more ILP than Zen 3 in branch-heavy integer benchmarks like SPEC GCC is a major engineering feat.