Zen 2 has has 8-wide issue in many places, and Ice Lake moves up to 6-wide. Intel/AMD have had 4-wide decode and issue width for 10 years and I'm glad they're moving to wider machines.
Could you explain what you mean with "8-wide decode in many places" ? How is that possible, isn't instruction coding kinda always the same? I.e. always 4-wide or always 8-wide, but not sometimes this and sometimes that.
All sources I could find say it is 4-wide, so I'd also be interested if you could perhaps give a link to a source?
The actual instruction decoder is 4-wide. However, the micro-op cache has 8-wide issue, and the dispatch unit can issue 6 instructions per cycle (and can retire 8 per cycle to avoid ever being retire-bound). In practice, Zen 2 generally acts like a 6-wide machine.
Oh, on this terminology: x86 instructions are 1-15 bytes wide (averaging around 3-4 bytes in most code). n-wide decode refers to decoding n instructions at a time.
Thanks for the link! Yeah, that's basically the numbers I also found -- although the number of instructions decoded per clock cycle is a different metric from the number of µop that can be issued, so that feels a bit like moving the goal post.
But, fair enough, for practical applications the latter may matter more. For an apple-to-apple comparison (pun not intended) it'd be interesting to know what the corresponding number for the M1 is; while it is ARM and thus RISC, one might still expect that there can be more than one µop per instructions, at least in some cases?
Of course then we might also want to talk about how certain complex instructions on x86 can actually require more than one cycle to decode (at least that was the case for Zen 1) ;-). But I think those are not that common.
Ah well, this is just intellectual curiosity, at the end of the day most of us don't really care, we just want our computers to be as fast as possible ;-).
I have usually heard the top-line number as the issue width, not the decode width (so Zen 2 is a 6-wide issue machine). Most instructions run in loops, so the uop cache actually gives you full benefit on most instructions.
On the Apple chip: I believe the entire M1 decode path is 8-wide, including the dispatch unit, to get the performance it gets. ARM instructions are 4 bytes wide, and don't generally need the same type of micro-op splitting that x86 instructions need, so the frontend on the M1 is probably significantly simpler than the Zen 2 frontend.
Some of the more complex ops may have separate micro-ops, but I don't think they publish that. One thing to note is that ARM cores often do op fusion (x86 cores also do op fusion), but with a fixed issue width, there are very places where this would move the needle. The textbook example is fusing DIV and MOD into one two-input, two-output instruction (the x86 DIV instruction computes both, but not the ARM DIV instruction).
X86 isn't fixed width instructions. Depending on the mix you may be able to decode more instructions. And if you target common instructions, you can get a lot of benefit in real world programs.
Arm is different but probably easier to decode. So you can widen the decoder.
This I think is the real answer; for a long time people were saying that "CISC is just compression for RISC, making virtue of necessity", but it seems like M1 serves as a good counterexample where a simpler ISA is scaled up to modern transistor counts (and given exclusive access to the world's best manufacturing, TSMC 5nm).
Considering that x86 is less dense than any RISC ISA, the "compression" argument behind CISC falls apart. No surprise a denser, trivial to decode ISA does better.
You have a source for that? The first google result I found for research on that shows it as denser than almost every RISC ISA [1]. It’s just one study and it predates ARM64 fwiw though.
That paper uses no actual benchmarks, but rather grabbed a single system utility and then hand-optimized it; SPEC and geekbench show x86-64 comes in well over 4 bytes on average.
Sure, I never claimed it to be the be-all-end-all, just the only real source I could find. Adding "SPEC" or "geekbench" didn't really help.
Doing a little more digging, I have also found this [1], which claims "the results show that the average instruction length is about 2 to 3 bytes". On the other hand, this [2] finds that the average instruction length is 4.25 bytes.
Bytes per instruction doesn't really say anything useful for code density when talking about RISC vs. CISC though, since (arguably) the whole idea is that individual CISC instructions are supposed to do more than individual RISC instructions. A three instruction CISC routine at five bytes each is still a win over a four instruction RISC routine at four bytes each. Overall code size is what actually matters.
OK, I could see how one could implement a variable width instruction decoder (e.g. "if there are 8 one-byte instructions in a row, handle them, otherwise fallback to 4-way decoding" -- of course much more sophisticated approach could be made).
But is this actually done? I honestly would be interested in a source for that; I just searched again and could find no source supporting this (but of course I may have simply not used the right search, I would not be surprised by that in the least). E.g. https://www.agner.org/optimize/microarchitecture.pdf#page216 makes no mention of this and calls AMD Zen (version 1; it doesn't saying anything on Zen 2/3).
I did find various sources which talk about how many instructions / µops can be scheduled at a time, and there it may be 8-way, but that's a completely different metric, isn't it?
As a historical note, the Pentium P6 uses an interesting approach. It has three decoders but only one of them can handle "complex macroinstructions" that require micro-operations from the ROM. If a limited-functionality decoder got the complex instruction, the instruction gets redirected to another decoder the next cycle.
As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.
> As far as variable-length instructions, a separate Instruction Length Decoder sorts that out before decoding.
And how fast is that able to run on x86? How many instructions can that process at once, compared to an alternate universe where that circuit has the same transistor and time budget but only has to look at the first four bits of an instruction?
Edited "decode" to "issue" for clarity.