> If you drop backwards compatibility, you can get a lot of headroom for better ...

rbanffy · on May 13, 2021

One of the advantages of the M1 is that instructions are fixed size. With x86 you need to deal with instructions that can be anything between 1 and 15 bytes.

nl · on May 13, 2021

Sure.

But that's one switch, implemented in hardware in the decode pipeline.

It makes implementation more complicated, but no reason it has to be slower.

hyperman1 · on May 13, 2021

As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.

Now I wonder why x64 can't re-encode the instructions: Put a flag somewhere that enables the new encoding, but keep the semantics. This would make costs for the switch low. There will be some trouble, e.g you can't use full 64bit values. But mostly it seems manageable.

nl · on May 14, 2021

> As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.

This is incorrect.

Intel Skylake has 5 parallel decoders (I think M1 has 8): https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

AMD Zen has 4: https://en.wikichip.org/wiki/amd/microarchitectures/zen#Deco...

hyperman1 · on May 14, 2021

You're right but not for this reason. The important part is the pre decode, which does exactly this merging of bytes in macro ops. Each cycle, skylake can convert max 16 bytes in max 6 macro ops. These macro ops are then passed to the insn decoders.

Which is impressive, if you think about it. But it is also complicated machinery for a part that's basically free when insns are fixed width and you wire the buffer straight to the instruction decoders. Expanding the pre decoder to 32 bytes would take a lot of hardware, while fixed width just means a few more wires.

zsmi · on May 15, 2021

It's actually not that complicated. Here is a state machine coded in C that does it for the base set.

https://stackoverflow.com/questions/23788236/get-size-of-ass...

The same technique could be extended to cover all of them and and it's not so difficult to implement this in verilog.

As long as this state machine runs at the same throughput as the icache bandwidth then it is not the bottleneck. It shouldn't be too difficult to achieve that.

But it is definitely extra complexity, and requires space and power.

hyperman1 · on May 16, 2021

Note how this returns a length, i.e. you can't start the state machine for predecoding the next instruction until you finished decoding the current one. This means longer delays when predecoding more macro ops. I don't know what the gate propagation delays are compared to the length of a clock, but this is a very critical path, so I assume it will hurt.

Then again, both Intel and AMD make it work, so there must be a way, if you're willing to pay the hardware cost. Now I think about it, the same linear to logarithmic trick for adders can be done here: Put a state machine before every possible byte, and throw away any result where the previous predecoder said skip

zsmi · on May 16, 2021

That's a good solution and it probably wouldn't be too expensive, relative to a Xeon.

This also demonstrates where it really hurts is when you want to do something low cost, and very low power, with a small die. And that's where ARM and RISCV shine. The same ISA (and therefore toolchain, in theory), can do everything from the tiniest microcontroller to the huge server. This is not the case for x86.

beowulfey · on May 13, 2021

The implication of their comment is x86_64 is EoL and a new architecture is necessary to continue improvements (this is not my own opinion, just how I read the comment)