Instruction decoding is a bottleneck for x86 these days: Apple M1 can do 8-wide decode, Intel just managed to reach 6-wide in Alder Lake, and AMD Zen 3 only has a 4-wide decoder. One would think that dropping legacy 16-bit and 32-bit instructions would enable simpler and more efficient instruction decoders in future x86 versions.
Sadly not. x86_64’s encoding is extremely similar to the legacy encodings. AIUI the fundamental problem is that x86 is a variable-length encoding, so a fully parallel decoder needs to decode at guessed offsets, many of which will be wrong. ARM64 instructions are aligned.
Dumping legacy features would be great for all kinds of reasons, but not this particular reason.
Legacy features are still needed for 16 bit DOS and Windows programs. You can cut out 16 bit legacy mode and use DOSBOX: https://www.dosbox.com/ for your 16 bit programs via emulation.
Considering this ‘needed’ legacy application is running code made for a 8MHz processor, you can trivially get away with emulating that processor on the multi GHz CPU in a recent computer.
By the way, 64 bit Windows doesn’t run 16 bits applications and never has. The only relevant 16 bit code is InstallShield and that is emulated using trickery. For the rest you need to emulate a whole computer.
But Linux does run 16-bit code natively, even on x86_64, via modify_ldt(). The kernel support is straightforward, and the user side is actively supported by DOSEMU2.
Sadly, there was a serious design error in the original x86_64 support for i387 floating point, and it was never possible for a 64-bit kernel to correctly context switch some of the more obscure legacy floating point exception state. Rather than fixing it, both AMD and Intel nerfed the hardware support, and new CPUs no longer fully support this state. This makes some older 16-bit software work poorly.
Then again there is no 16 bit Linux software as there has never been 16 bit Linux, and 16-bit software for other operating systems is trivially emulated without requiring hardware support. It might all be somewhat of an interesting exercise but it doesn’t have real world value.
This suggests another way forward: Re-encode the existing opcodes with new, more regular byte sequences. E.g. 32 bits / instruction, with some escape for e.g. 64bit constants. You'll have to redo the backend of the assembler, but most of the compiler and optimization wisdom can be reused as-is. Of course, this breaks backward compatibility completely so the high performance mode can only be unlocked for recompiles.
That was Itanium, and it failed for a variety of reasons; one of which was a compatibility layer that sucked. You can't get rid of x86's backwards compatibility. Intel and AMD have done their best by using vector prefixes (like VEX and EVEX)[a] that massively simplify decoding, but there's only so much that can be done.
People get caught up in the variable length issue that x86 has, and then claim that M1 beats x86 because of that. Sure, decoding ARM instructions is easier than x86, but the variable length aspect is handled in the predecode/cache stage, not the actual decoder. The decoder, when it reaches an instruction, already knows various bits of info are.
The RISC vs CISC debate is useless today. M1's big advantage comes from the memory ordering model (and other things)[0], not the instruction format. Apple actually had to create a special mode for the M1 (for Rosetta 2) that enforces the x86 ordering model (TSO with load forwarding), and native performance is slightly worse when doing so.
[a]: There's also others that predate AVX (VEX) such as the 0F38 prefix group consisting only of opcodes that have a ModR/M byte and no immediate, and the 0F3A prefix being the same, but with an 8 bit immediate.
* IA-64 failed primarily because it failed to deliver the promised performance. x86 comparability isn't and wasn't essential to success (behold the success on Arm for example).
* M1's advantage has almost nothing to do with the weak memory model, but it has to do with everything: wider, deeper, faster (memory). The ISA being Arm64 also help in many ways. The variable length x86 instructions can be dealt with via predecoding, sure to an extent, but that lengthens the pipeline which hurts the branch mispredict penalty, which absolutely matters.
Think more of a re-encoding than a full re-architecture. Every existing 16/32/64 bit instruction should receive some place in that re-encoded opcode map, including abominations as aaa. But e.g the whole prefix thing should go away. The processor's predecode stage should cease existing.
Let's roughly sketch a proposal, even if many variants are possible
Instructions start on 64 bit boundarys. If the highest bit is 0 you have an insn else a constant. So the whole thing becomes self-synchronizing. You're now of course missing 1 bit, that should get a place in the encoding of a constant.
This means you have 62 bits left. Presuming the prefixes like segment regs get solved, you could encode 2 instructions in there, 1 of them a nop if you need the constant. Do away with short encodings for ax. Try and avoid double encodings uf reg + modrm combos exist. Above all, resist the temptation to mess with the instruction set, just re encode.
The entire approach is misguided for single-threaded performance. It turns out that out-of-order execution is pretty important for a number is things, perhaps most importantly dealing with variable memory instruction latencies (cache hits at various points in the hierarchy vs. misses). A compiler simply cannot statically predict those well enough.
M1 doesn't have a special mode for Rosetta. All code is executed with x86 TSO on M1's application processors. How do I know this?
Well, did you know Apple ported Rosetta 2 to Linux? You can get it by running a Linux VM on macOS. It does not require any kernel changes to support in VMs, and if you extract the binary to run it on Asahi Linux, it works just fine too. None of the Asahi team did anything to support x86 TSO. Rosetta also works just fine in m1n1's hypervisor mode, which exists specifically to log all hardware access to detect these sorts of things. If there is a hardware toggle for TSO, it's either part of the chicken bits (and thus enabled all the time anyway) or turned on by iBoot (and thus enabled before any user code runs).
Related point: Hector Martin just upstreamed a patch to Linux that fixes a memory ordering bug in workqueues that's been around since before Linux had Git history. He also found a bug in some ARM litmus tests that he was using to validate whether or not they were implemented correctly. Both of those happened purely because M1 and M2 are so hilariously wide and speculative that they trigger memory reorders no other CPU would.
I'm sorry, but please cite some sources, because this contradicts everything that's been said about M1's x86 emulation that I've read so far.
> Well, did you know Apple ported Rosetta 2 to Linux? You can get it by running a Linux VM on macOS. It does not require any kernel changes to support in VMs, and if you extract the binary to run it on Asahi Linux, it works just fine too. None of the Asahi team did anything to support x86 TSO. Rosetta also works just fine in m1n1's hypervisor mode, which exists specifically to log all hardware access to detect these sorts of things. If there is a hardware toggle for TSO, it's either part of the chicken bits (and thus enabled all the time anyway) or turned on by iBoot (and thus enabled before any user code runs).
Apple tells you to attach a special volume/FS to your linux VM in order for Rosetta to work. When such a volume is attached, it runs the VM in TSO mode. As simple as that.
The rosetta binary itself doesn't know whether or not TSO is enabled, so its not surprising that it runs fine under Asahi. As marcan42 himself said on twitter[1], most x86 applications will run fine even without TSO enabled. You're liable to run into edge cases in heavily multithreaded code though.
> Both of those happened purely because M1 and M2 are so hilariously wide and speculative that they trigger memory reorders no other CPU would.
In other words, they're not constantly running in TSO mode? Because if they were, why would they trigger such re-orders?
EDIT: I've just run a modified version of the following test program[2] (removing the references to the tso_enable sysctl which requires an extension), both native and under Rosetta.
Running natively, it fails after ~3500 iterations. Under Rosetta, it completes the entire test successfully.
To clarify, the sysctl is related to a 3rd-party kernel extension. That kernel extension is a convenient wrapper around the actual interface which is a per-thread flag that can be turned on and off.
And not one of those threads indicates that there is no Rosetta-related instructions. The first would indicate that marcan believes that M1 chips do have TSO related instructions. The second discusses the fact that Rosetta runs on non-Apple CPUs without TSO enabled, meaning that it running on Asahi linux without Asahi linux having TSO support does not show that M1 chips always run in TSO mode; but rather that Rosetta itself has no way of detecting it, so will attempt to run regardless. The third repeats what I just said.
The twitter threads provide more evidence against your points than for them.
I always wondered if there was a case to be made for a JIT that recompiled "normal" x86-64 into a restricted subset designed to be more optimizable.
This could be paired with a sort of two-tiered processor design-- really fast on the restricted subset, but mediocre-to-bad performance on anything else. The code only has to activate the "mediocre to bad" paths long enough to build a JIT cache though,
It's more complex than that if you'll excuse the pun. Instructions on CISC cores aren't 1 to 1 with RISC instructions, and tend to encode quite a bit more micro ops. Something like inc dword [rbp+16] is one instruction, but would be a minimum of three micro ops (and would be three RISC instructions as well).
Long story short, this isn't really the bottle neck, or we'd see more simple decoders on the tail end of the decode window.
To someone who is interested in bare metal, can you explain the significance of this? Is this how much data a CPU can handle simultaneously? Via instructions from the kernel?
It means how many instructions the CPU can decode at the same time, roughly to "figure out what they mean and dispatch what they actually have to do to the functional units of the CPU which will perform the work of the instruction". It is not directly how much data a superscalar CPU can handle in parallel, but still plays a role, in the sense that there is a number of functional units available in the CPU, and if you cannot keep those busy with decoded instructions, then they lay around unused. So a too narrow decoder can be one of the bottlenecks in optimal CPU usage (but note how as a sibling commenter mentioned, the complexity of the instructions/architecture is also important, e.g. a single CISC instruction may keep things pretty busy by itself).
Whether the instructions come from the kernel or from userspace does not matter at all, they all go through the same decoder and functional units. The kernel/userspace differentiation is a higher level concept.