If you drop backwards compatibility, you can get a lot of headroom for better pe...

marcan_42 · on May 13, 2021

Apple's CPUs are compatible at the userspace level with normal ARM. At the kernel level, I only know of one architectural violation so far (other than adding custom optional stuff): the M1 has the HCR_EL2.E2H bit forced to 1, which forces hypervisors to use VHE mode (non-VHE operation is not supported, which is a violation of the arch spec). This only matters for hypervisors, as regular OS operation doesn't care about this.

It is true that Apple implemented a bunch of custom optional features (some of which are, arguably, in violation of architectural expectations), and they definitely have some kind of deal with ARM to be able to do this, but from a developer perspective they are all optional and can be ignored. I don't think Apple exposes any of them directly to iOS/macOS developers. They only use them internally in their own software and libraries (some are for Rosetta, some are used in Accelerate.framework, some are used to implement MAP_JIT and pthread_jit_write_protect_np, some are only used by the kernel).

monocasa · on May 13, 2021

Their AMX instructions, and those decompress instructions you found in user space are the exact kinds of things even regular architectural license holders aren't allowed to do, even hidden behind a library.

They very clearly have a special, Apple-only relationship with ARM.

marcan_42 · on May 13, 2021

AMX is available in userspace, but only used by their Accelerate.framework library. Apple does not document or expect them to be used by any apps, and I expect they'd reject any App Store submissions that use them, as I doubt they guarantee their continued existence in their current form in future CPUs.

FWIW, the compresssion instructions are used by the kernel, and I don't even know if they work from userspace. I've only ever tried them in EL2.

comex · on May 13, 2021

Sure. But the M1 doesn't drop backwards compatibility with existing (userspace) code, which nine_k arguably suggested it did.

spacedcowboy · on May 13, 2021

They co-founded Arm. It’s not too surprising that they know the right people there...

floatingatoll · on May 13, 2021

I'm not sure if this is the right spec, but it repeatedly references a section "Behavior of HCR_EL2.E2H" that I can't find anywhere in the 4700-page PDF "Arm® Architecture Registers, Armv8, for Armv8-A architecture profile"

https://developer.arm.com/documentation/ddi0595/2021-03/AArc...

saagarjha · on May 13, 2021

https://cpu.fyi/d/98dfae

mlindner · on May 13, 2021

This is a link to a pdf document from https://cpu.fyi/ with the link pointing to an entry labeled "ARMv8-A Architecture Reference Manual"

saagarjha · on May 14, 2021

Gah, I copied that link wrong. I meant to reference https://cpu.fyi/d/98dfae#G23.11082057, which is that specific section about HCR_EL2.E2H's behavior.

floatingatoll · on May 14, 2021

Ah. Thx!

floatingatoll · on May 13, 2021

Thanks, I appreciate the guidance!

nl · on May 13, 2021

> If you drop backwards compatibility, you can get a lot of headroom for better performance.

People say this without thinking. There is no real evidence at all it is true.

Something like x86 support on an IA64 chip costs extra transistors. But there is no real fundamental reason why it should make anything slower.

This is even more so for AVX512 instructions, which aren't backwards compatible in anyway.

So - exactly - how would dropping backwards compatibility speed up AVX512 division?

rbanffy · on May 13, 2021

One of the advantages of the M1 is that instructions are fixed size. With x86 you need to deal with instructions that can be anything between 1 and 15 bytes.

nl · on May 13, 2021

Sure.

But that's one switch, implemented in hardware in the decode pipeline.

It makes implementation more complicated, but no reason it has to be slower.

hyperman1 · on May 13, 2021

As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.

Now I wonder why x64 can't re-encode the instructions: Put a flag somewhere that enables the new encoding, but keep the semantics. This would make costs for the switch low. There will be some trouble, e.g you can't use full 64bit values. But mostly it seems manageable.

nl · on May 14, 2021

> As I understood it, it means you can't look at the next insn before decoding the current one enough to know where the next one starts. Meanwhile, arm can decode insns in parallel.

This is incorrect.

Intel Skylake has 5 parallel decoders (I think M1 has 8): https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

AMD Zen has 4: https://en.wikichip.org/wiki/amd/microarchitectures/zen#Deco...

hyperman1 · on May 14, 2021

You're right but not for this reason. The important part is the pre decode, which does exactly this merging of bytes in macro ops. Each cycle, skylake can convert max 16 bytes in max 6 macro ops. These macro ops are then passed to the insn decoders.

Which is impressive, if you think about it. But it is also complicated machinery for a part that's basically free when insns are fixed width and you wire the buffer straight to the instruction decoders. Expanding the pre decoder to 32 bytes would take a lot of hardware, while fixed width just means a few more wires.

zsmi · on May 15, 2021

It's actually not that complicated. Here is a state machine coded in C that does it for the base set.

https://stackoverflow.com/questions/23788236/get-size-of-ass...

The same technique could be extended to cover all of them and and it's not so difficult to implement this in verilog.

As long as this state machine runs at the same throughput as the icache bandwidth then it is not the bottleneck. It shouldn't be too difficult to achieve that.

But it is definitely extra complexity, and requires space and power.

hyperman1 · on May 16, 2021

Note how this returns a length, i.e. you can't start the state machine for predecoding the next instruction until you finished decoding the current one. This means longer delays when predecoding more macro ops. I don't know what the gate propagation delays are compared to the length of a clock, but this is a very critical path, so I assume it will hurt.

Then again, both Intel and AMD make it work, so there must be a way, if you're willing to pay the hardware cost. Now I think about it, the same linear to logarithmic trick for adders can be done here: Put a state machine before every possible byte, and throw away any result where the previous predecoder said skip

zsmi · on May 16, 2021

That's a good solution and it probably wouldn't be too expensive, relative to a Xeon.

This also demonstrates where it really hurts is when you want to do something low cost, and very low power, with a small die. And that's where ARM and RISCV shine. The same ISA (and therefore toolchain, in theory), can do everything from the tiniest microcontroller to the huge server. This is not the case for x86.

beowulfey · on May 13, 2021

The implication of their comment is x86_64 is EoL and a new architecture is necessary to continue improvements (this is not my own opinion, just how I read the comment)

joenathanone · on May 13, 2021

> If you drop backwards compatibility, you can get a lot of headroom for better performance.

Microsoft tried that with the Surface X and failed

travisgriggs · on May 13, 2021

Did they really try though? The Surface RT always felt like a tentative "what do you think?" that got thrown under the bus pretty quick as soon as there was any whining.

When Apple rolls out a product, there's some transitional overlap, but you can see them getting ready to burn their viking ships in that period. Microsoft's efforts always have lacked that kind of commit factor. IMO.

rbanffy · on May 13, 2021

They halfheartedly tried that on Windows NT with MIPS, PPC, Alpha, and Itanium versions.

Lack of key software from Microsoft doomed them to a niche.

shaklee3 · on May 13, 2021

Not really. They never stopped x86 as the main product.

spankyspangler · on May 13, 2021

> If you drop backwards compatibility, you can get a lot of headroom for better performance.

Far less than a lot of people probably think though. AMD has an excellent x86 core, faster single threaded and throughput than the M1, on a generation older process technology, and quite possibly a smaller design and development budget than Apple, although not so power efficient.

ghaff · on May 13, 2021

>Apple is in a unique position of being able to force a new architecture on its customers, without losing them.

Apple's, in marketing speak, brand permission has given them extraordinary latitude over the past 15 to 20 years. They've been able to get off with making abrupt transitions and other relatively wrenching choices that tech pubs and doubtless forums like this wailed about but which their customers were mostly fine with. Things that Microsoft and WinTel laptops, for example, couldn't with respect to ports, limited options, etc. couldn't.

simonh · on May 13, 2021

The difference there is Apple's customer is the actual end customer, the user, while Microsoft's customer is the OEM. It's the OEM Microsoft sell the software to (generally speaking), and who actually decides what devices get made and what their features are, not Microsoft.

As a result Microsoft has to worry about what the OEMs want and how they will use the product. In contrast Apple only cares what the end user wants, what their experience is, what features they get and how they work.

An OEM cares about whether they're making ARM laptops or Intel laptops. They care about and want input into the implementation details. An end user doesn't care if Photoshop is running on an ARM chip or an Intel chip, they care about how well it runs and what the battery life is. They (generally speaking) don't care about the implementation details.

rbanffy · on May 13, 2021

Apple moved from 68k to PPC, and then to Intel. Binary transitions like those are not new, and are made easier by an OS that is architecture agnostic from very early on.

alien_ · on May 13, 2021

It's not so hard for the OS to be architecture agnostic. Linux runs on lots of architectures, and also Windows NT was portable enough to run on MIPS early on and also runs on ARM in addition to x86.

The main challenge is seamlessly migrating users to the new platform and Apple did a great job at it using Rosetta.

Both Linux and especially Windows struggle at this because they lack something as well integrated as Rosetta and require all applications to be recompiled to a new architecture.

rbanffy · on May 13, 2021

Microsoft could have ported Office to MIPS, Alpha, PPC, and Itanium. I don’t think IE ever had a PPC port. The same applies to Visual Studio. When Alphas started appearing, that’s all I needed to be happy - a way to use email, to browse the web and use Altavista, and Visual Studio so I could write programs. It’d be a different story.

lostgame · on May 13, 2021

Holy crap, man - I read two sentences into your comment before literally hitting reply to say ‘as a long-time Mac and iOS dev this is one of the most logical and undeniable statements I’ve ever heard’.

I read the rest after typing this.

Intel was...it was king.

To be honest - Apple actually got some pretty damn good performance out of the PowerPC chips and architecture - my Quad-Core G5 tower with 16GB RAM is still used for finalization of my music projects, due to its insanely smooth performance - and tbh coming from a very experienced user of modern Macs it still kicks ass.