Zen 2 has has 8-wide issue in many places, and Ice Lake moves up to 6-wide. Inte...

BlackFingolfin · on Nov 30, 2020

Could you explain what you mean with "8-wide decode in many places" ? How is that possible, isn't instruction coding kinda always the same? I.e. always 4-wide or always 8-wide, but not sometimes this and sometimes that.

All sources I could find say it is 4-wide, so I'd also be interested if you could perhaps give a link to a source?

pclmulqdq · on Nov 30, 2020

https://en.wikichip.org/wiki/amd/microarchitectures/zen_2

The actual instruction decoder is 4-wide. However, the micro-op cache has 8-wide issue, and the dispatch unit can issue 6 instructions per cycle (and can retire 8 per cycle to avoid ever being retire-bound). In practice, Zen 2 generally acts like a 6-wide machine.

Oh, on this terminology: x86 instructions are 1-15 bytes wide (averaging around 3-4 bytes in most code). n-wide decode refers to decoding n instructions at a time.

BlackFingolfin · on Nov 30, 2020

Thanks for the link! Yeah, that's basically the numbers I also found -- although the number of instructions decoded per clock cycle is a different metric from the number of µop that can be issued, so that feels a bit like moving the goal post.

But, fair enough, for practical applications the latter may matter more. For an apple-to-apple comparison (pun not intended) it'd be interesting to know what the corresponding number for the M1 is; while it is ARM and thus RISC, one might still expect that there can be more than one µop per instructions, at least in some cases?

Of course then we might also want to talk about how certain complex instructions on x86 can actually require more than one cycle to decode (at least that was the case for Zen 1) ;-). But I think those are not that common.

Ah well, this is just intellectual curiosity, at the end of the day most of us don't really care, we just want our computers to be as fast as possible ;-).

pclmulqdq · on Nov 30, 2020

I have usually heard the top-line number as the issue width, not the decode width (so Zen 2 is a 6-wide issue machine). Most instructions run in loops, so the uop cache actually gives you full benefit on most instructions.

On the Apple chip: I believe the entire M1 decode path is 8-wide, including the dispatch unit, to get the performance it gets. ARM instructions are 4 bytes wide, and don't generally need the same type of micro-op splitting that x86 instructions need, so the frontend on the M1 is probably significantly simpler than the Zen 2 frontend.

Some of the more complex ops may have separate micro-ops, but I don't think they publish that. One thing to note is that ARM cores often do op fusion (x86 cores also do op fusion), but with a fixed issue width, there are very places where this would move the needle. The textbook example is fusing DIV and MOD into one two-input, two-output instruction (the x86 DIV instruction computes both, but not the ARM DIV instruction).

jlouis · on Nov 30, 2020

X86 isn't fixed width instructions. Depending on the mix you may be able to decode more instructions. And if you target common instructions, you can get a lot of benefit in real world programs.

Arm is different but probably easier to decode. So you can widen the decoder.

pixelpoet · on Nov 30, 2020

This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).

throwarchitect · on Nov 30, 2020

Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.

wk_end · on Dec 1, 2020

You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.

[1] https://www.researchgate.net/profile/Sally_McKee/publication...

throwarchitect · on Dec 1, 2020

That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.

wk_end · on Dec 1, 2020

Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.

Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.

Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.40...

[2] http://www.cs.unc.edu/~porter/pubs/instrpop-systor19.pdf

monocasa · on Dec 1, 2020

But there's more work being done per average x86_64 instruction due to RMW ops. Hence why they just look at an entire binary.

BlackFingolfin · on Nov 30, 2020

OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).

But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).

I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?

kens · on Nov 30, 2020

As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.

As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

Ref: "Modern Processor Design" p268.

Dylan16807 · on Dec 1, 2020

> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.

And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?

happycube · on Dec 1, 2020

Probably should check Abner's guide, but the P6 is still the rough blueprint for everything (except P4) that intel did since.

95014_refugee · on Dec 1, 2020

They were still doing this in the Nehalem timeframe (possibly influence from Hinton?)

coliveira · on Nov 30, 2020

I guess this requires extending the architecture to 8-wide instructions when it makes sense.

BlackFingolfin · on Nov 30, 2020

What do you mean with "8-wide instructions", and what does that have to do with multiple decoders?