The Intel 80376 – A legacy-free i386 with a twist (2010)

allenrb · on Aug 15, 2022

I'd forgotten about the 80376 but it hits at a question I've occasionally had over the last few years. Why have we not seen a "modernized" x86 CPU that strips out everything pre-AMD64? The answer seems likely to be one or both of:

1. There are more users of legacy modes than is obvious to us on HN.

2. The gains in terms of gate lost, critical paths reduced, lower power consumption just don't amount to much.

My guess is that #2 is the dominant factor. If there were actually significant gains to be had on a "clean"(er) x86 design, we'd see it in the market regardless of #1.

Macha · on Aug 15, 2022

As the article points out, modern x86 CPUs boot up in 16 bit mode, then get transferred into 32 bit mode, then 64 bit mode. So right out the gate such a CPU is not compatible with existing operating systems, so now you have a non-compatible architecture. Sure Intel could easily add support to GRUB and push Microsoft to do it for new Windows media, but that won't help the existing install base. Intel tried launching a non-compatible CPU once, it was Itanium, it didn't go so well for them.

Plus I'm sure there's crazy DRM rootkits that depend on these implementation details.

Also, AMD has experimented with not-quite-PC-compatible x86 setups already in the consoles. As the fail0verflow talk about Linux on PS4 emphasised, the PS4 is x86, but not a PC. So despite building a x86 CPU with some less legacy, AMD didn't seem to think it worthwhile bringing it to a more general purpose platform

Also AMD/Intel/VIA are the only companies with the licenses to produce x86, and you'd need both Intel and AMD to sign off on licensing x64 to someone new.

messe · on Aug 15, 2022

> As the article points out, modern x86 CPUs boot up in 16 bit mode, then get transferred into 32 bit mode, then 64 bit mode. So right out the gate such a CPU is not compatible with existing operating systems, so now you have a non-compatible architecture

Except that a modern OS is booted in UEFI mode meaning that the steps of going fro 16 -> 32 -> 64-bit mode are all handled by the firmware, not the kernel or the bootloader. The OS kernel will only (at most) switch to 32-bit compatibility mode (a submode of long-mode not protected mode) when it needs to run 32-bit apps, otherwise staying in 64-bit mode 100% of the time.

anyfoo · on Aug 15, 2022

Yeah. Long mode is a bit of a line in the sand, leaving many features behind that were kept for compatibility (segmentation, vm86...). It came at a time where, fortunately, the mainstream OSes had enough abstraction so that software did not have to be written for what effectively was the bare metal anymore, with DOS almost being more of a "software bootloader with filesystem services".

klelatti · on Aug 15, 2022

> Intel tried launching a non-compatible CPU once, it was Itanium, it didn't go so well for them.

More than once. iAPX432 if anything went worse.

anyfoo · on Aug 15, 2022

Yeah, but that, again, was for far worse reasons than just not being "compatible". In fact, iAPX432 was exceptionally bad. Intel's i960 for example fared much better in the embedded space (where PC-compatibility did not matter).

klelatti · on Aug 15, 2022

Indeed and I suppose in fairness I don’t think 432 was ever intended as a PC cpu replacement whilst Itanium was designed to replace some x86 servers.

As an aside I’m still astonished that 432 and Itanium got as far as they did with so much cash spent on them before without conclusive proof that performance would be competitive. Seems like a prerequisite for projects of this size.

foobiekr · on Aug 15, 2022

As an aside, you might enjoy reading about Japan's 5th Generation project if you find the iapx432 or itanium stories interesting.

klelatti · on Aug 16, 2022

Thanks - I'll have a look.

anyfoo · on Aug 15, 2022

> Intel tried launching a non-compatible CPU once, it was Itanium, it didn't go so well for them.

That may be only secondary, though. Itanium simply failed to deliver performance promises and be competitive. The compiler was supposed to effectively perform instruction scheduling itself, and writing such a compiler turned out more difficult than anticipated.

FullyFunctional · on Aug 15, 2022

I've see this a lot, but IMO the truth is slightly different: the assumption behind EPIC was that a compiler _could_ do the scheduling which turned out to be _impossible_. The EPIC effort roots goes way back, but still I don't understand how they failed to foresee the ever growing tower of caches which unavoidably leads to a crazy wide latency range for loads (3-400+ cycles) which in turn is why we now have these very deep OoO machines. (Tachyum's Prodigy appears to be repeating the EPIC mistake with very limited but undisclosed reordering).

OoO EPIC has been suggested (I recall an old comp.arch posting by an Intel architect) but never got green-lit. I assume they had bet so much on compiler assumption that the complexity would have killed it.

It's really a shame because EPIC did get _some_ things right. The compiler absolutely can make the front-end life easier by making dependences more explicit (though I would do it differently) and by making control transfers much easier to deal with (the 128-bit block alone saves 4 bits in all BTB entries, etc). On the balance, IA-64 was a committee-designed train wreck, piling on way too much complexity, and failed both as a brainiac and speed-daemon.

Disclaimer: I have an Itanic space heater than I occasionally boot up for the chuckle - and then shuts down before the hearing damage gets permanent.

klelatti · on Aug 16, 2022

This is really interesting. Any recommendations for further reading on EPIC and related technologies?

FullyFunctional · on Aug 16, 2022

In practice EPIC = IA-64 and Itanium is the only implementation, but IA-64 is probably the easier thing to search for. The only book I have is “IA-64 and Elementary Functions: Speed and Precision”.

EPIC’s problem is shared with VLIW of which EPIC can be understood as a refinement. VLIW excels in a deterministic world where the compiler can predict latencies and produce a good schedule, but falls apart in face of loads with highly variable latencies (an in-order implementation has no option but to stall when load data doesn’t arrive on time).

EPIC patches this a bit by allowing software prefetching and SW exposed speculation, but it comes at a significant code bloat and it can cover only a small fraction of what dynamic scheduling can cover. At Transmeta, our VLIW engine did this too (with some obvious advantages over IA-64) and we suffered from similar problems.

klelatti · on Aug 16, 2022

Thanks!

FullyFunctional · on Aug 16, 2022

It such a fascinating topic and the story is far from over. VLIW (maybe not EPIC though) has tremendous power efficiency so sometimes it’s the right trade-off (say, if you can cover latencies by switching to another hyper-thread).

Micro-architecture is still a hot topic and everything is a trade-off. Next I’ll start talking about super scalar OoO stack machines …

fanf2 · on Aug 16, 2022

Were there any superscalar stack machines other than the inmos T9000 transputer? (It’s a bit of a cheat, though, because the transputer has a very limited stack, and the T9 worked more like a register machine, treating the very short local addressing mode as a register number)

rwmj · on Aug 15, 2022

Quite a lot of modern Arm 64 bit processors have dropped 32 bit (ie. ARMv7) support. Be careful what you wish for though! It's still useful to be able to run 32 bit i386 code at a decent speed occasionally. Even on my Linux systems I still have hundreds of *.i686.rpms installed.

jleahy · on Aug 15, 2022

The 64-bit ARM instructions were designed in a way that made supporting both modes in parallel very expensive from a silicon perspective. In contrast AMD were very clever with AMD64 and designed it such that very little additional silicon area was required to add it.

repiret · on Aug 16, 2022

Citation needed. I'm told that Apple-designed cores have indeed dropped AArch32, and AFAIK, no others have. I don't think its fair to call "only the ones designed by Apple" "quite a lot".

The Cortex-A78 generation of Arm cores was the first set of Arm-designed cores that didn't support AArch32 at EL3, and they retain AArch32 support at EL0, even the N1 and X1. The early APM designed ARMv8 cores supported AArch32 at (at least) EL0, although I haven't kept up with that lineage. I haven't seen an nVidia-designed ARMv8 that dropped AArch32, although I also haven't worked with some of the newer ones.

I suspect that maintaining support at EL0 is very cheap, because it can be done entirely in the instruction decoder.

octoberfranklin · on Aug 16, 2022

Cavium dropped Arm32 support in 2014: "Only the 64-bit AArch64 execution state is support. No 32-bit AArch32 support" https://en.wikichip.org/wiki/cavium/thunderx

Qualcomm dropped Arm32 in Falkor, which shipped in 2017: "AArch64 only" https://en.wikichip.org/wiki/qualcomm/microarchitectures/fal...

Arm64 is really a wild and radical departure from Arm32, they're night-and-day different which is why dumping 32-bit support is attractive. Compare this with x86 where you have a microcode backdoor^H^H^Hengine to smooth over the differences, or POWER64/MIPS64 which built off of very clean 32-bit designs and stayed true to them. And of course RISC-V is the non-trademark-infringing name for MIPS-VI.

garaetjjte · on Aug 16, 2022

There probably won't be any new high-performance cores supporting AArch32. It's already gone from Cortex-X2, Cortex-X3, Cortex-A715.

tenebrisalietum · on Aug 15, 2022

So I learned about the "hidden" x86 mode called XuCode (https://www.intel.com/content/www/us/en/developer/articles/t...) - which is x86 binary code placed into RAM by the microcode and then "called" by the microcode for certain instructions - particularly SGX ones if I'm remembering correctly.

Wild speculative guess: It's entirely possible some of the pre-AMD64 stuff is actually internally used by modern Intel and AMD CPUs to implement complex instructions.

kmeisthax · on Aug 15, 2022

Oh boy, we've gone all the way back to Transmeta Code Morphing Software. What "ring" does this live on now? Ring -4? :P

Jokes aside, I doubt XuCode would use pre-AMD64 stuff; microcode is lower-level than that. The pre-AMD64 stuff is already handled with sequences of microcode operations because it's not really useful for modern applications[0]. It's entirely possible for microcode to implement other instructions too, and that's what XuCode is doing[1].

The real jank is probably hiding in early boot and SMM, because you need to both jump to modern execution modes for client or server machines and downgrade to BIOS compatibility for all those industrial deployments that want to run ancient software and OSes on modern machines.

[0] The last time I heard someone even talk about x86 segmentation, it was as part of enforcing the Native Client inner sandbox.

[1] Hell, there's no particular reason why you can't have a dual-mode CPU with separate decoders for ARM and x86 ISAs. As far as I'm aware, however, such a thing does not exist... though evidently at one point AMD was intending on shipping Ryzen CPUs with ARM decoders in them.

unnah · on Aug 15, 2022

Instruction decoding is a bottleneck for x86 these days: Apple M1 can do 8-wide decode, Intel just managed to reach 6-wide in Alder Lake, and AMD Zen 3 only has a 4-wide decoder. One would think that dropping legacy 16-bit and 32-bit instructions would enable simpler and more efficient instruction decoders in future x86 versions.

amluto · on Aug 15, 2022

Sadly not. x86_64’s encoding is extremely similar to the legacy encodings. AIUI the fundamental problem is that x86 is a variable-length encoding, so a fully parallel decoder needs to decode at guessed offsets, many of which will be wrong. ARM64 instructions are aligned.

Dumping legacy features would be great for all kinds of reasons, but not this particular reason.

orionblastar · on Aug 16, 2022

Legacy features are still needed for 16 bit DOS and Windows programs. You can cut out 16 bit legacy mode and use DOSBOX: https://www.dosbox.com/ for your 16 bit programs via emulation.

There are those users who still want to use DOS programs: https://www.freedos.org/

tinus_hn · on Aug 16, 2022

Considering this ‘needed’ legacy application is running code made for a 8MHz processor, you can trivially get away with emulating that processor on the multi GHz CPU in a recent computer.

By the way, 64 bit Windows doesn’t run 16 bits applications and never has. The only relevant 16 bit code is InstallShield and that is emulated using trickery. For the rest you need to emulate a whole computer.

https://docs.microsoft.com/en-us/troubleshoot/windows-client...

amluto · on Aug 16, 2022

But Linux does run 16-bit code natively, even on x86_64, via modify_ldt(). The kernel support is straightforward, and the user side is actively supported by DOSEMU2.

Sadly, there was a serious design error in the original x86_64 support for i387 floating point, and it was never possible for a 64-bit kernel to correctly context switch some of the more obscure legacy floating point exception state. Rather than fixing it, both AMD and Intel nerfed the hardware support, and new CPUs no longer fully support this state. This makes some older 16-bit software work poorly.

tinus_hn · on Aug 16, 2022

Then again there is no 16 bit Linux software as there has never been 16 bit Linux, and 16-bit software for other operating systems is trivially emulated without requiring hardware support. It might all be somewhat of an interesting exercise but it doesn’t have real world value.

hyperman1 · on Aug 15, 2022

This suggests another way forward: Re-encode the existing opcodes with new, more regular byte sequences. E.g. 32 bits / instruction, with some escape for e.g. 64bit constants. You'll have to redo the backend of the assembler, but most of the compiler and optimization wisdom can be reused as-is. Of course, this breaks backward compatibility completely so the high performance mode can only be unlocked for recompiles.

colejohnson66 · on Aug 15, 2022

That was Itanium, and it failed for a variety of reasons; one of which was a compatibility layer that sucked. You can't get rid of x86's backwards compatibility. Intel and AMD have done their best by using vector prefixes (like VEX and EVEX)[a] that massively simplify decoding, but there's only so much that can be done.

People get caught up in the variable length issue that x86 has, and then claim that M1 beats x86 because of that. Sure, decoding ARM instructions is easier than x86, but the variable length aspect is handled in the predecode/cache stage, not the actual decoder. The decoder, when it reaches an instruction, already knows various bits of info are.

The RISC vs CISC debate is useless today. M1's big advantage comes from the memory ordering model (and other things)[0], not the instruction format. Apple actually had to create a special mode for the M1 (for Rosetta 2) that enforces the x86 ordering model (TSO with load forwarding), and native performance is slightly worse when doing so.

[0]: https://twitter.com/ErrataRob/status/1331735383193903104

[a]: There's also others that predate AVX (VEX) such as the 0F38 prefix group consisting only of opcodes that have a ModR/M byte and no immediate, and the 0F3A prefix being the same, but with an 8 bit immediate.

FullyFunctional · on Aug 15, 2022

Two factual mistakes:

* IA-64 failed primarily because it failed to deliver the promised performance. x86 comparability isn't and wasn't essential to success (behold the success on Arm for example).

* M1's advantage has almost nothing to do with the weak memory model, but it has to do with everything: wider, deeper, faster (memory). The ISA being Arm64 also help in many ways. The variable length x86 instructions can be dealt with via predecoding, sure to an extent, but that lengthens the pipeline which hurts the branch mispredict penalty, which absolutely matters.

Dylan16807 · on Aug 15, 2022

> That was Itanium

What? No. Itanium was a vastly, wildly different architecture.

hyperman1 · on Aug 16, 2022

Think more of a re-encoding than a full re-architecture. Every existing 16/32/64 bit instruction should receive some place in that re-encoded opcode map, including abominations as aaa. But e.g the whole prefix thing should go away. The processor's predecode stage should cease existing.

Let's roughly sketch a proposal, even if many variants are possible

Instructions start on 64 bit boundarys. If the highest bit is 0 you have an insn else a constant. So the whole thing becomes self-synchronizing. You're now of course missing 1 bit, that should get a place in the encoding of a constant.

This means you have 62 bits left. Presuming the prefixes like segment regs get solved, you could encode 2 instructions in there, 1 of them a nop if you need the constant. Do away with short encodings for ax. Try and avoid double encodings uf reg + modrm combos exist. Above all, resist the temptation to mess with the instruction set, just re encode.

ShroudedNight · on Aug 15, 2022

I thought the critical failure of Itanium was that a priori VLIW scheduling turned out to be a non-starter, at least as far as doing so efficiently

atq2119 · on Aug 15, 2022

The entire approach is misguided for single-threaded performance. It turns out that out-of-order execution is pretty important for a number is things, perhaps most importantly dealing with variable memory instruction latencies (cache hits at various points in the hierarchy vs. misses). A compiler simply cannot statically predict those well enough.

kmeisthax · on Aug 15, 2022

M1 doesn't have a special mode for Rosetta. All code is executed with x86 TSO on M1's application processors. How do I know this?

Well, did you know Apple ported Rosetta 2 to Linux? You can get it by running a Linux VM on macOS. It does not require any kernel changes to support in VMs, and if you extract the binary to run it on Asahi Linux, it works just fine too. None of the Asahi team did anything to support x86 TSO. Rosetta also works just fine in m1n1's hypervisor mode, which exists specifically to log all hardware access to detect these sorts of things. If there is a hardware toggle for TSO, it's either part of the chicken bits (and thus enabled all the time anyway) or turned on by iBoot (and thus enabled before any user code runs).

Related point: Hector Martin just upstreamed a patch to Linux that fixes a memory ordering bug in workqueues that's been around since before Linux had Git history. He also found a bug in some ARM litmus tests that he was using to validate whether or not they were implemented correctly. Both of those happened purely because M1 and M2 are so hilariously wide and speculative that they trigger memory reorders no other CPU would.

messe · on Aug 15, 2022

I'm sorry, but please cite some sources, because this contradicts everything that's been said about M1's x86 emulation that I've read so far.

> Well, did you know Apple ported Rosetta 2 to Linux? You can get it by running a Linux VM on macOS. It does not require any kernel changes to support in VMs, and if you extract the binary to run it on Asahi Linux, it works just fine too. None of the Asahi team did anything to support x86 TSO. Rosetta also works just fine in m1n1's hypervisor mode, which exists specifically to log all hardware access to detect these sorts of things. If there is a hardware toggle for TSO, it's either part of the chicken bits (and thus enabled all the time anyway) or turned on by iBoot (and thus enabled before any user code runs).

Apple tells you to attach a special volume/FS to your linux VM in order for Rosetta to work. When such a volume is attached, it runs the VM in TSO mode. As simple as that.

The rosetta binary itself doesn't know whether or not TSO is enabled, so its not surprising that it runs fine under Asahi. As marcan42 himself said on twitter[1], most x86 applications will run fine even without TSO enabled. You're liable to run into edge cases in heavily multithreaded code though.

[1]: https://twitter.com/marcan42/status/1534054757421432833

> Both of those happened purely because M1 and M2 are so hilariously wide and speculative that they trigger memory reorders no other CPU would.

In other words, they're not constantly running in TSO mode? Because if they were, why would they trigger such re-orders?

EDIT: I've just run a modified version of the following test program[2] (removing the references to the tso_enable sysctl which requires an extension), both native and under Rosetta.

Running natively, it fails after ~3500 iterations. Under Rosetta, it completes the entire test successfully.

[2] https://github.com/losfair/tsotest/

kmeisthax · on Aug 16, 2022

Hmm, wasn't aware that someone found a TSO sysctl that actually does something.

The rest of my opinion is based off of these three Twitter threads:

https://twitter.com/marcan42/status/1534053625110351872

https://twitter.com/never_released/status/153412764108259328...

https://twitter.com/marcan42/status/1534156788224098304

messe · on Aug 16, 2022

To clarify, the sysctl is related to a 3rd-party kernel extension. That kernel extension is a convenient wrapper around the actual interface which is a per-thread flag that can be turned on and off.

And not one of those threads indicates that there is no Rosetta-related instructions. The first would indicate that marcan believes that M1 chips do have TSO related instructions. The second discusses the fact that Rosetta runs on non-Apple CPUs without TSO enabled, meaning that it running on Asahi linux without Asahi linux having TSO support does not show that M1 chips always run in TSO mode; but rather that Rosetta itself has no way of detecting it, so will attempt to run regardless. The third repeats what I just said.

The twitter threads provide more evidence against your points than for them.

anyfoo · on Aug 16, 2022

It's not a sysctl, it gets switched on or off during context switch. So, per thread.

In any case, I'm pretty sure that marcan is aware of that, and I haven't read anything from him that would point to the contrary.

anyfoo · on Aug 15, 2022

> M1 doesn't have a special mode for Rosetta. All code is executed with x86 TSO on M1's application processors.

That's not true. (And doesn't your last paragraph contradict it already?)

You might just have figured out that most stuff will run fine (or appear to run fine for a long time) when TSO isn't enabled.

messe · on Aug 15, 2022

I have no idea why you are being downvoted. You are entirely correct.

anyfoo · on Aug 15, 2022

Thanks, I was puzzled as well. The downvotes seem to have stopped, though.

hakfoo · on Aug 16, 2022

I always wondered if there was a case to be made for a JIT that recompiled "normal" x86-64 into a restricted subset designed to be more optimizable.

This could be paired with a sort of two-tiered processor design-- really fast on the restricted subset, but mediocre-to-bad performance on anything else. The code only has to activate the "mediocre to bad" paths long enough to build a JIT cache though,

therealcamino · on Aug 16, 2022

That's sort of similar to Transmeta's approach, except the underlying machine was pretty different from x86, not a fast subset.

umanwizard · on Aug 15, 2022

If you're inventing a completely incompatible ISA, why not just use ARM64 at that point?

anamax · on Aug 15, 2022

Perhaps because you don't want to commit to ARM compatiblity AND licensing fees.

yjftsjthsd-h · on Aug 15, 2022

Alright, then RISC-V?

monocasa · on Aug 15, 2022

It's more complex than that if you'll excuse the pun. Instructions on CISC cores aren't 1 to 1 with RISC instructions, and tend to encode quite a bit more micro ops. Something like inc dword [rbp+16] is one instruction, but would be a minimum of three micro ops (and would be three RISC instructions as well).

Long story short, this isn't really the bottle neck, or we'd see more simple decoders on the tail end of the decode window.

mhh__ · on Aug 15, 2022

Decode bound performance issues are actually pretty rare. X86 is quite dense.

reactordev · on Aug 15, 2022

To someone who is interested in bare metal, can you explain the significance of this? Is this how much data a CPU can handle simultaneously? Via instructions from the kernel?

anyfoo · on Aug 15, 2022

It means how many instructions the CPU can decode at the same time, roughly to "figure out what they mean and dispatch what they actually have to do to the functional units of the CPU which will perform the work of the instruction". It is not directly how much data a superscalar CPU can handle in parallel, but still plays a role, in the sense that there is a number of functional units available in the CPU, and if you cannot keep those busy with decoded instructions, then they lay around unused. So a too narrow decoder can be one of the bottlenecks in optimal CPU usage (but note how as a sibling commenter mentioned, the complexity of the instructions/architecture is also important, e.g. a single CISC instruction may keep things pretty busy by itself).

Whether the instructions come from the kernel or from userspace does not matter at all, they all go through the same decoder and functional units. The kernel/userspace differentiation is a higher level concept.

cat_plus_plus · on Aug 16, 2022

If you already broke compatibility, might as well go all the way and optimize for your problem domain, such as graphics/AI/low power. And in fact GPUs/TPUs/ARM are doing very well, and there is a strong interest in quantum computing. So this leaves conventional high power, non-vector, core speed over concurrency computing, where you additionally expect existing operating systems and software ported and optimized for your architecture. Plus IBM Power PC mainframes and Apple ARM laptops don't fit the bill. There might just not be enough demand for such a product right now and if there is in future, someone will offer a solution?

rodgerd · on Aug 15, 2022

Think about the disruption that Apple caused when they moved from 32-bit x86 being supported to deprecating it - there was a great deal of angst, and that's on a vertically-integrated platform that is relatively young (yes, I know that NeXT is old, but MacOS isn't, really). Now imagine that on Windows - a much older platform, with much higher expectations for backwards compat. It would be a bloodbath of user rage.

More importantly, though, backward compat has been Intel's moat for a long, long time. Intel have been trying to get people to buy non-16-bit-compat processors for literally 40 years! They've tried introducing a lot of mildly (StrongARM - technically a buy I suppose, i860/i960) and radically (i432, Itanium) innovative processors, and they've all been indifferent or outright failures in the marketplace.

The market has been really clear on this: it doesn't care for Intel graphics cards or SSDs or memory, it hates non-x86 Intel processors. Intel stays in business by shipping 16-bit-compatible processors.

danbolt · on Aug 15, 2022

I feel as though a lot of the consumer value of x86+Windows comes from its wide library of software and compatibility.

> than is obvious to us on HN

I think your average HNer is more likely to interact with Linux/Mac workstations or servers, where binary compatibility isn't as necessary.

Dalewyn · on Aug 16, 2022

This. There is a reason Win32 foiled every single attempt from Microsoft to kill it, and a reason Intel has consistently failed to successfully market anything besides x86 processors.

The reason is the PC world at large is a mountain of x86 and Win32 code piled sky high by decades of developers and users. Replacing all that takes money and time that nobody wants to spend, and chances are the replacement won't be a 1:1 drop-in replacement either which will necessitate further money and time nobody wants to spend.

The average computer enthusiast might scoff at being able to run half-a-century old x86/Win32 code and binaries on modern hardware. Professionals consider that backwards compatibility their literal lifeblood and will pay handsomely to ensure nobody violates it (and refuse to pay anyone who does).

I expect the ARM world will also be singing a similar tune in several decades' time, when backwards compatibility with ancient ARM binaries become steadily more important than pursuing the next new shiny at any cost.

anyfoo · on Aug 16, 2022

Things change. Windows cut out 16-bit code with 64-bit Windows, so that particular half of the last "half century" is gone. Macs transitionedto different architectures several times, iPhones to some extend too (from armv7 to armv8, dropping arm32 support entirely relatively early after).

The various Rosettas, most recently the x86-to-ARM one, showed that software emulation is always an option, and so does WoW64, which is a part of ARM Windows (and a part of x86-64 Windows, but there it's not really a software emulator).

> I expect the ARM world will also be singing a similar tune in several decades' time, when backwards compatibility with ancient ARM binaries become steadily more important than pursuing the next new shiny at any cost.

I am not so sure at all about that. The reason is that half to a quarter decade ago, even normal application code was very close to the metal, with MS-DOS basically being a glorified bootloader with file system services that loaded little mini-OSes calling themselves DOS applications.

We are far, far away from that world. Mature abstractions make ISA transitions more and more painless for end-user application developers. The exponential advances in speed make software emulation well feasible for the remainder. High-performance stuff will want to take advantage of the new architectures anyway.

pjmlp · on Aug 16, 2022

Now make that emulation work at feasible speed on a 300 euro laptop, bought in random shopping mall.

anyfoo · on Aug 16, 2022

I don’t have hard data, but doesn’t seem unrealistic to me. Computers are really unbelievably much faster then they used to be.

anyfoo · on Aug 16, 2022

> The reason is that half to a quarter decade ago

Half to a quarter century, I mistyped here. Half to a quarter decade ago we were long since there already.

rbanffy · on Aug 15, 2022

> Why have we not seen a "modernized" x86 CPU that strips out everything pre-AMD64?

In many segments, binary compatibility with legacy software is still vital. This is why you can run Windows 3 software on Windows 11 (and why you can run x86 software on M1 Macs).

For me, I could spend most of my time on Linux and there binary compatibility is a much smaller issue - I'm sure I'd be very happy with a POWER workstation (birthday is in March, Raptor folks) and perfectly functional there. Same on ARM (biggest constraint for me is memory - 16 GB is enough, 32 is good).

Unless I look too closely, the CPU in the box is the least important part.

nl · on Aug 16, 2022

> and why you can run x86 software on M1 Macs

Seems like a pretty good case for the counterargument: that backwards compatibility can be entirely provided for in software.

(To be clear, the M1 chip doesn't support the x86 instruction set at all, and translation is done entirely in software. I don't think there was ever any 16 bit x86 Mac software though, so not sure if that mode is supported).

toast0 · on Aug 16, 2022

The M1 chip does support the x86 memory ordering, which allows the software instruction set emulation to not have to do memory ordering work in software.

I don't think Macs ever had 16-bit code? 68000 was 32-bit, as was ppc and intel macs were 32-bit to start, with 64-bit later.

JonathonW · on Aug 16, 2022

The original 68k (used in the original 1984 Mac) was funky: it had a fully 32-bit instruction set and 32-bit registers, but 16-bit internal and external data buses and a 16-bit ALU. And a 24-bit address bus, because reasons?

Later variants of the 68k family were fully 32-bit internally, but exactly what was exposed externally varied by processor and Mac model. And 24-bit addressing remained a thing in the Macintosh system software up until System 7-- this capped RAM at 8 MB even if the CPU could handle more.

But Apple never really had the same sort of 16-bit/32-bit transition that Microsoft and Intel did. The 68k-to-PPC shift is kind of analogous (and happened at around the same time as Microsoft was going fully 32-bit), but backwards compatibility there was purely based around software emulation-- same basic approach as Apple's PPC-to-x86 and x64-to-ARM transitions, as opposed to Intel's "just make the new processor a superset of the old one" approach.

rbanffy · on Aug 16, 2022

> because reasons?

Chip size and pin count. Besides, 24 address lines allowed 16 megabytes of RAM, something outrageous back then.

nl · on Aug 16, 2022

> I don't think Macs ever had 16-bit code?

Yeah pretty sure you are right.

> 68000 was 32-bit,

Interestingly I discovered today the 68008 which was a 68000 with an 8-bit external data bus.

rbanffy · on Aug 16, 2022

> backwards compatibility can be entirely provided for in software.

My point was that it's so vital on some platforms that their vendors go to great lengths to make it available, despite the fact that the architecture is completely different.

I don't see the same requirement in open-source stacks.

JonathonW · on Aug 16, 2022

You can't run Windows 3 software on Windows 11. You could on 32-bit versions of Windows 10 (after adding some optional Windows components), but Windows 11 is 64-bit only and 64-bit Windows can't run 16-bit (or DOS) code at all.

So, hypothetically, at least in the Windows world, we're approaching a point where some of the legacy cruft could go. Not quite all of it yet (still need 32-bit application support), but a lot of the 8086-era real-mode stuff will never be used on modern hardware running modern software (except perhaps for the first few instructions after boot, until UEFI pushes the system into 32- or 64-bit protected mode).

yarg · on Aug 16, 2022

Why not both?

Apple's proven it feasible to provide high performance emulation of a prior ISA.

https://en.wikipedia.org/wiki/Rosetta_(software)

klelatti · on Aug 15, 2022

x86 did appear in one context where legacy compatibility is likely to have been a much smaller (or non issue?) and where (eg power consumption) efficiencies would have been even more valuable - that’s on mobile running Android.

The fact that a cleaned version wasn’t used would seem to support your hypothesis.

johnklos · on Aug 15, 2022

> The 80376 doesn’t do paging.

Wait - what? How is that even possible? Do they simply not have an MMU? That makes it unsuitable for both old OSes and for new OSes. No wonder it was so uncommon.

anyfoo · on Aug 15, 2022

It claims itself as an "embedded" processor, so it likely just wasn't meant to run PC OSes. According to Wikipedia at least, Intel did not even expect the 286 to be used for PCs, but for things such as PBXs. And ARM Cortex M still doesn't have an MMU either, for some applications you can just do without. Especially because both the 286 and this 376 beast did have segmentation, which could subsume some of the need for an MMU (separated address spaces, if you don't need paging and are content with e.g. buddy allocation for dividing available memory among tasks).

marssaxman · on Aug 15, 2022

It is very common for embedded systems processors not to have an MMU. If it ran any OS at all it would likely have been some kind of RTOS.

blueflow · on Aug 15, 2022

Using paging/virtual memory is not a requirement for an OS, even when all currently popular OSes make use of it.

Intel CPUs before the 80286 did not have an MMU, either.

anyfoo · on Aug 15, 2022

Did they even call it an MMU already? The 286 only had segmentation, which arguably is just an addressing mode. It introduced descriptors that had to be resolved, but that happened when selecting the descriptor (i.e. when loading the segment register), where a hidden base, limit, and permission "cache" was updated. Unlike paging, where things are resolved when accessing memory.

actionfromafar · on Aug 15, 2022

I don't know what to call it, but Minix on 286 had protection between processes. (Each process had its own 64k instruction + 64k data, IIRC.)

So effectively the same result as an MMU. But Minix is the only OS I know of which used this mode.

anyfoo · on Aug 15, 2022

Yes, that's segmentation. And I was just wondering if you could call that an MMU. It's not the same result as an MMU across the board, since you have no paging: You cannot freely rearrange pages in virtual memory with their frames in physical memory, you can only "segment" linear portions of physical memory. You also cannot fault in pages on-demand at their time of being accessed, only when the whole segment is loaded (i.e. the selector written into the segment register).

If an MMU is something that sits between memory and the rest of the CPU, and abstracts physical addressing away, then I argue the 286 did not have an MMU. Segmented access is more of a mandatory addressing mode, where you always access relatively to a segment's base, and have to keep within the limit.

OS/2 is another OS that made use of pre-paging 286 protected mode.

nineteen999 · on Aug 16, 2022

Wikipedia refers to it as having on-chip MMU "capabilities", albeit only supporting segmentation and not paging:

"The 286 was the first of the x86 CPU family to support protected virtual-address mode, commonly called "protected mode". In addition, it was the first commercially available microprocessor with on-chip MMU capabilities (systems using the contemporaneous Motorola 68010 and NS320xx could be equipped with an optional MMU controller). This would allow IBM compatibles to have advanced multitasking OSes for the first time and compete in the Unix-dominated[citation needed] server/workstation market."

anyfoo · on Aug 16, 2022

Thanks, but I don't think I would call what the 286 had "virtual-address mode". Intel themselves referred to "virtual addresses" only in the context of paging in their 386 manual. Addresses relative to segmentation were called "logical addresses", and they translate to "linear addresses".

That Wikipedia passage seems to be inaccurate, or a least not use the common terminology.

rep_lodsb · on Aug 16, 2022

Intel did you use that term to describe segmented addressing, at least in the 286 manual [1]:

""In Protected Virtual Address Mode, the 80286 provides an advanced architecture that retains substantial compatibility with the 8086 and other processors in the 8086 family.""

""In Protected Mode, application programs deal exclusively with virtual addresses; programs have no access whatsoever to the actual physical addresses generated by the processor. As discussed in Chapter 2, an address is specified by a program in terms of two components: (1) a 16-bit effective address offset that determines the displacement, in bytes, of a location within a segment; and (2) a 16-bit segment selector that uniquely references a particular segment.""

[1] https://ragestorm.net/downloads/286intel.txt

anyfoo · on Aug 16, 2022

You're right, I searched wrong in my 386 manual. So apparently they did call it virtual addressing.

I'm still not sure if I like calling that an "MMU". To me, an MMU acts on every memory access. But at the end of the day, who cares what I think what words mean. :)

The critical distinction to me is that segmentation is not much more than, effectively, an indexed addressing mode, where the base is the segment base that has been loaded into the descriptor cache when the segment register was loaded. Nothing happens at the time of actually accessing memory, except for the limit check.

Teknoman117 · on Aug 16, 2022

> Intel themselves referred to "virtual addresses" only in the context of paging in their 386 manual.

This is not correct.

Intel referred to the 286's memory management as virtual memory:

Quote from the 286's manual - 4th paragraph of section 1.1

> Segmentation also lends itself to efficient implementation of sophisticated memory management, virtual memory, and memory protection.

The 286 protected-mode segmentation is dramatically different from the segmentation on the 8086.

Virtual memory is merely the ability to use logical addresses that don't necessarily have a direct relationship with a physical address. Your OS can relocate data in physical memory without an application having to know.

On the 286, this is facilitated via the Global and Local Descriptor tables. When you loaded a segment register, it would first check whether your current permission level could access that segment. Exception if you couldn't. Then it would check if the segment was marked as present. Exception if it wasn't present. Then it would load the base address (a full 24 bit physical address) and the size into internal, hidden registers. The hardware would bounds check accesses to segments.

The GDT contained segments that were common to the whole OS, and the LDT contained "process" specific segments.

Intel advertised the 286 as having 1 GiB of virtual address space per process - up to 8K entries in GDT, up to 8K entries in LDT, up to a 64K segment size 16K * 64K = 1G.

Some examples of virtual memory behaviors:

A process tries to allocate more memory but the system is out of physical memory. The system decides to swap another process's memory to disk to make room. The data from the other process is written to disk, the descriptor table entry is marked as not present, and the physical memory is free'd back to the OS. The new process's allocation is placed there. When the old process switches back in, accesses throw a segment-not-present exception and the OS must retrieve the segment from disk. It finds physical memory to back the segment, copies the data from disk, updates the base address in the descriptor table, and reloads the segment register. As far as the process is concerned, nothing has changed. The pointer values are still all the same yet the physical address is different.

On a modern processor, this would be done on a page by page basis, on the 286, you had to do it a whole segment at a time. If your segment's size is 256 bytes, awesome, if it was 64 KiB, that sucks. You also can't punch holes in the middle of segments.

Another example:

Your process has a heap from which it performs dynamic allocations. This heap is pointed to by a descriptor in your LDT. You want to make another allocation but you've exhausted your heap. You call into the OS to grow the size of the heap. Unfortunately, a different object was placed in physical memory right after your heap, so, you find a new spot in memory large enough for the new heap. You copy the contents of the heap to the new address and update the descriptor. Again, the program is none the wiser.

This is a major weak point for the 286. On a modern processor, you'd just map more pages in because with a paged MMU, contiguous virtual addresses don't have to be physically contiguous.

Obviously, this is very different to today's demand paged MMUs (well, the last 37 years anyhow) and it's not nearly as flexible, but it's still virtual memory.

That's not even getting into the neat hardware multitasking thing that automated switching all these registers around (which existed on the 386 but was never used because it assumed you'd still want to use segmentation and spent time and memory backing up registers and doing permission checks most software didn't want).

IMHO - the only reason the 286 is merely a footnote in history is because Intel didn't provide a way to cleanly switch from protected mode back into real mode. Intel erroneously assumed applications and operating systems would just be rewritten for the 286's native mode. It was a huge step forward from the 8086 but the largest userbase of Intel CPUs (DOS users) never saw the improvements beyond a simple speed boost.

rep_lodsb · on Aug 16, 2022

>hardware multitasking

The main reason this wasn't used more is that is isn't a good fit for what an OS needs to do.

Jumping to a TSS or task gate saves the current context and restores that for the new task, in one atomic operation. But most kernels save the registers on syscall/interrupt entry (while remaining in the context of the current task), and restore a possibly different context on exit when the current one gets suspended.

There is no way to do "half of a task switch" with the TSS mechanism, so it would always do useless work:

    kernel code does:
        push user registers
        handle syscall / interrupt
        select new task and jump to it
    CPU does:
        save context A (mostly garbage values at this point)
        restore context B (containing previously saved garbage)
    kernel code does:
        pop user registers
        return to ring 3

Teknoman117 · on Aug 16, 2022

Ah, that's a good point.

anyfoo · on Aug 16, 2022

I am very well aware of how segmentation works on the 286 and up. The critical difference I see to a paged MMU is that everything is determined "upfront": When you load the segment register with a selector, that causes the CPU to fetch the descriptor, and to load base and limit in the descriptor cache (the hidden portion of the segment registers, can be accessed with LOADALL). Every subsequent memory access to that segment is basically just an indexed addressing mode with that base as the index register.

In a paged MMU, it's more apt to say that it sits "between" CPU and memory. The frame in physical memory to be accessed is different for every page, and faults can happen while resolving that page. It happens during access, not upfront.

I concede that Intel already called this "virtual memory", I searched in my 386 manual wrong.

toast0 · on Aug 16, 2022

I'm writing a hobby OS and while I am using the MMU, all of my virtual addresses are actually physical addresses too, but the MMU helps my system crash when I've broken something leading to accessing unmapped memory. If there were no MMU, I'd just have to do things right, or debug strange behavior rather than page faults.

kazinator · on Aug 16, 2022

0x80386 - 16 = 0x80376

:)

zymhan · on Aug 16, 2022

Since the page is loading slowly, an archive link:

http://web.archive.org/web/20220816012342/https://www.pageta...