One of the advantages of the M1 is that instructions are fixed size. With x86 you need to deal with instructions that can be anything between 1 and 15 bytes.
As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.
Now I wonder why x64 can't re-encode the instructions: Put a flag somewhere that enables the new encoding, but keep the semantics. This would make costs for the switch low. There will be some trouble, e.g you can't use full 64bit values. But mostly it seems manageable.
> As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.
You're right but not for this reason. The important part is the pre decode, which does exactly this merging of bytes in macro ops. Each cycle, skylake can convert max 16 bytes in max 6 macro ops. These macro ops are then passed to the insn decoders.
Which is impressive, if you think about it. But it is also complicated machinery for a part that's basically free when insns are fixed width and you wire the buffer straight to the instruction decoders. Expanding the pre decoder to 32 bytes would take a lot of hardware, while fixed width just means a few more wires.
The same technique could be extended to cover all of them and and it's not so difficult to implement this in verilog.
As long as this state machine runs at the same throughput as the icache bandwidth then it is not the bottleneck. It shouldn't be too difficult to achieve that.
But it is definitely extra complexity, and requires space and power.
Note how this returns a length, i.e. you can't start the state machine for predecoding the next instruction until you finished decoding the current one. This means longer delays when predecoding more macro ops. I don't know what the gate propagation delays are compared to the length of a clock, but this is a very critical path, so I assume it will hurt.
Then again, both Intel and AMD make it work, so there must be a way, if you're willing to pay the hardware cost. Now I think about it, the same linear to logarithmic trick for adders can be done here: Put a state machine before every possible byte, and throw away any result where the previous predecoder said skip
That's a good solution and it probably wouldn't be too expensive, relative to a Xeon.
This also demonstrates where it really hurts is when you want to do something low cost, and very low power, with a small die. And that's where ARM and RISCV shine. The same ISA (and therefore toolchain, in theory), can do everything from the tiniest microcontroller to the huge server. This is not the case for x86.
The implication of their comment is x86_64 is EoL and a new architecture is necessary to continue improvements (this is not my own opinion, just how I read the comment)
People say this without thinking. There is no real evidence at all it is true.
Something like x86 support on an IA64 chip costs extra transistors. But there is no real fundamental reason why it should make anything slower.
This is even more so for AVX512 instructions, which aren't backwards compatible in anyway.
So - exactly - how would dropping backwards compatibility speed up AVX512 division?