Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If you drop backwards compatibility, you can get a lot of headroom for better performance.

Apple is in a unique position of being able to force a new architecture on its customers, without losing them. They have done it twice. They even aren't exactly compatible with normal ARM, due to a special agreement with ARM Holdings.

Intel had I960, quite cool and successful, and could not capitalize on it in the long term for low-power devices. Intel bought rights to ARM, and could not capitalize on it in the long term either, even though ARM was well-suited for battery-powered devices, and sold it!

Intel used to be the king of data centres, using an architecture from 1980s, extended and pimped up to the brim — but still beholden by backwards compatibility. And a king it still is. But this pillar seems to shake more and more.

I don't think this is a result of poor engineering. It was, to my mind, a set of business bets, which worked well, until they didn't any more.



Apple's CPUs are compatible at the userspace level with normal ARM. At the kernel level, I only know of one architectural violation so far (other than adding custom optional stuff): the M1 has the HCR_EL2.E2H bit forced to 1, which forces hypervisors to use VHE mode (non-VHE operation is not supported, which is a violation of the arch spec). This only matters for hypervisors, as regular OS operation doesn't care about this.

It is true that Apple implemented a bunch of custom optional features (some of which are, arguably, in violation of architectural expectations), and they definitely have some kind of deal with ARM to be able to do this, but from a developer perspective they are all optional and can be ignored. I don't think Apple exposes any of them directly to iOS/macOS developers. They only use them internally in their own software and libraries (some are for Rosetta, some are used in Accelerate.framework, some are used to implement MAP_JIT and pthread_jit_write_protect_np, some are only used by the kernel).


Their AMX instructions, and those decompress instructions you found in user space are the exact kinds of things even regular architectural license holders aren't allowed to do, even hidden behind a library.

They very clearly have a special, Apple-only relationship with ARM.


AMX is available in userspace, but only used by their Accelerate.framework library. Apple does not document or expect them to be used by any apps, and I expect they'd reject any App Store submissions that use them, as I doubt they guarantee their continued existence in their current form in future CPUs.

FWIW, the compresssion instructions are used by the kernel, and I don't even know if they work from userspace. I've only ever tried them in EL2.


Sure. But the M1 doesn't drop backwards compatibility with existing (userspace) code, which nine_k arguably suggested it did.


They co-founded Arm. It’s not too surprising that they know the right people there...


I'm not sure if this is the right spec, but it repeatedly references a section "Behavior of HCR_EL2.E2H" that I can't find anywhere in the 4700-page PDF "Arm® Architecture Registers, Armv8, for Armv8-A architecture profile"

https://developer.arm.com/documentation/ddi0595/2021-03/AArc...



This is a link to a pdf document from https://cpu.fyi/ with the link pointing to an entry labeled "ARMv8-A Architecture Reference Manual"


Gah, I copied that link wrong. I meant to reference https://cpu.fyi/d/98dfae#G23.11082057, which is that specific section about HCR_EL2.E2H's behavior.


Ah. Thx!


Thanks, I appreciate the guidance!


> If you drop backwards compatibility, you can get a lot of headroom for better performance.

People say this without thinking. There is no real evidence at all it is true.

Something like x86 support on an IA64 chip costs extra transistors. But there is no real fundamental reason why it should make anything slower.

This is even more so for AVX512 instructions, which aren't backwards compatible in anyway.

So - exactly - how would dropping backwards compatibility speed up AVX512 division?


One of the advantages of the M1 is that instructions are fixed size. With x86 you need to deal with instructions that can be anything between 1 and 15 bytes.


Sure.

But that's one switch, implemented in hardware in the decode pipeline.

It makes implementation more complicated, but no reason it has to be slower.


As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.

Now I wonder why x64 can't re-encode the instructions: Put a flag somewhere that enables the new encoding, but keep the semantics. This would make costs for the switch low. There will be some trouble, e.g you can't use full 64bit values. But mostly it seems manageable.


> As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.

This is incorrect.

Intel Skylake has 5 parallel decoders (I think M1 has 8): https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

AMD Zen has 4: https://en.wikichip.org/wiki/amd/microarchitectures/zen#Deco...


You're right but not for this reason. The important part is the pre decode, which does exactly this merging of bytes in macro ops. Each cycle, skylake can convert max 16 bytes in max 6 macro ops. These macro ops are then passed to the insn decoders.

Which is impressive, if you think about it. But it is also complicated machinery for a part that's basically free when insns are fixed width and you wire the buffer straight to the instruction decoders. Expanding the pre decoder to 32 bytes would take a lot of hardware, while fixed width just means a few more wires.


It's actually not that complicated. Here is a state machine coded in C that does it for the base set.

https://stackoverflow.com/questions/23788236/get-size-of-ass...

The same technique could be extended to cover all of them and and it's not so difficult to implement this in verilog.

As long as this state machine runs at the same throughput as the icache bandwidth then it is not the bottleneck. It shouldn't be too difficult to achieve that.

But it is definitely extra complexity, and requires space and power.


Note how this returns a length, i.e. you can't start the state machine for predecoding the next instruction until you finished decoding the current one. This means longer delays when predecoding more macro ops. I don't know what the gate propagation delays are compared to the length of a clock, but this is a very critical path, so I assume it will hurt.

Then again, both Intel and AMD make it work, so there must be a way, if you're willing to pay the hardware cost. Now I think about it, the same linear to logarithmic trick for adders can be done here: Put a state machine before every possible byte, and throw away any result where the previous predecoder said skip


That's a good solution and it probably wouldn't be too expensive, relative to a Xeon.

This also demonstrates where it really hurts is when you want to do something low cost, and very low power, with a small die. And that's where ARM and RISCV shine. The same ISA (and therefore toolchain, in theory), can do everything from the tiniest microcontroller to the huge server. This is not the case for x86.


The implication of their comment is x86_64 is EoL and a new architecture is necessary to continue improvements (this is not my own opinion, just how I read the comment)


> If you drop backwards compatibility, you can get a lot of headroom for better performance.

Microsoft tried that with the Surface X and failed


Did they really try though? The Surface RT always felt like a tentative "what do you think?" that got thrown under the bus pretty quick as soon as there was any whining.

When Apple rolls out a product, there's some transitional overlap, but you can see them getting ready to burn their viking ships in that period. Microsoft's efforts always have lacked that kind of commit factor. IMO.


They halfheartedly tried that on Windows NT with MIPS, PPC, Alpha, and Itanium versions.

Lack of key software from Microsoft doomed them to a niche.


Not really. They never stopped x86 as the main product.


> If you drop backwards compatibility, you can get a lot of headroom for better performance.

Far less than a lot of people probably think though. AMD has an excellent x86 core, faster single threaded and throughput than the M1, on a generation older process technology, and quite possibly a smaller design and development budget than Apple, although not so power efficient.


>Apple is in a unique position of being able to force a new architecture on its customers, without losing them.

Apple's, in marketing speak, brand permission has given them extraordinary latitude over the past 15 to 20 years. They've been able to get off with making abrupt transitions and other relatively wrenching choices that tech pubs and doubtless forums like this wailed about but which their customers were mostly fine with. Things that Microsoft and WinTel laptops, for example, couldn't with respect to ports, limited options, etc. couldn't.


The difference there is Apple's customer is the actual end customer, the user, while Microsoft's customer is the OEM. It's the OEM Microsoft sell the software to (generally speaking), and who actually decides what devices get made and what their features are, not Microsoft.

As a result Microsoft has to worry about what the OEMs want and how they will use the product. In contrast Apple only cares what the end user wants, what their experience is, what features they get and how they work.

An OEM cares about whether they're making ARM laptops or Intel laptops. They care about and want input into the implementation details. An end user doesn't care if Photoshop is running on an ARM chip or an Intel chip, they care about how well it runs and what the battery life is. They (generally speaking) don't care about the implementation details.


Apple moved from 68k to PPC, and then to Intel. Binary transitions like those are not new, and are made easier by an OS that is architecture agnostic from very early on.


It's not so hard for the OS to be architecture agnostic. Linux runs on lots of architectures, and also Windows NT was portable enough to run on MIPS early on and also runs on ARM in addition to x86.

The main challenge is seamlessly migrating users to the new platform and Apple did a great job at it using Rosetta.

Both Linux and especially Windows struggle at this because they lack something as well integrated as Rosetta and require all applications to be recompiled to a new architecture.


Microsoft could have ported Office to MIPS, Alpha, PPC, and Itanium. I don’t think IE ever had a PPC port. The same applies to Visual Studio. When Alphas started appearing, that’s all I needed to be happy - a way to use email, to browse the web and use Altavista, and Visual Studio so I could write programs. It’d be a different story.


Holy crap, man - I read two sentences into your comment before literally hitting reply to say ‘as a long-time Mac and iOS dev this is one of the most logical and undeniable statements I’ve ever heard’.

I read the rest after typing this.

Intel was...it was king.

To be honest - Apple actually got some pretty damn good performance out of the PowerPC chips and architecture - my Quad-Core G5 tower with 16GB RAM is still used for finalization of my music projects, due to its insanely smooth performance - and tbh coming from a very experienced user of modern Macs it still kicks ass.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: