I'll have to read up on Agner's findings. My assumptions are largely based on an...

phire · on Jan 24, 2024

According to the label, that block contains both the uop cache AND the microcode ROM (which is actually at least partially RAM to allow for microcode updates). I guess it makes sense to group the two functions together, they are both alternative sources of uOPs that aren't from the instruction decoder.

So really depends on what the balance is. If it was two or three of those memory cell blocks, I agree it's quite big. But if it's just one, it's actually quite small.

Agner's findings are for the Sandybridge implementation. He says Haswell and Skylake share the same limitations, but doesn't look like he has done much research into the later implementations.

The findings actually point to the uOP cache being much simpler in structure. The instruction cache has to support arbitrary instruction alignment and fetches that cross boundaries. The uOP cache has strict alignment requirements, it delivers one cache line per cycle and always delivers the entire line. If there aren't enough uops, then the rest of the cacheline is unused.

> Also visible from the die shots: decoding and branch prediction are far from free.

Yeah, it appears to be massive. And I get the impression that block is more branch prediction than decoding.

Nothing is free in CPU design, it's just a massive balancing act.

mbitsnbites · on Jan 24, 2024

> According to the label, that block contains both the uop cache AND the microcode ROM

Yes, so it's hard to tell the exact size. We can only conclude that the uOP cache and the microcode ROM combined are about twice the size of the L1I cache (in terms of memory cells).

Another core die shot of the Zen 2 micro architecture is this (it appears to be correct as it is based on official AMD slides): https://forums.anandtech.com/proxy.php?image=https%3A%2F%2Fa...

Here uCode is in a separate area, and if we assume that the SRAM blocks in the area marked "Decode" represent the uOP cache, then we have:

* The uOP cache has the same physical size as the L1I cache

* uOP cache size = 4K uOPs

* L1I cache size = 32 KiB ~= 8K x86 instructions

If all this holds true (it's a big "if"), the number of uOP instructions that the uOP cache can hold is only half of the number of x86 instructions that the L1I cache can hold, and the size of uOP entries are in fact close to 32KiB / 4K uOPs = 64 bits each (given how similar the SRAM cells for the two caches are on the die shot I assume that they have the same density).

Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked, and instructions with large immediates occupy more than one uOP slot - even the ARMv8 Vulcan microarchitecture sees a ~15% increase in instructions when cracking ARM instructions into uOPs: https://en.wikichip.org/wiki/cavium/microarchitectures/vulca... ), which would mean that the silicon area efficiency of the uOP cache compared to a regular L1I cache is even less than 50%.

Edit:

> Nothing is free in CPU design, it's just a massive balancing act.

Yup, and a large part of the x86 balancing act is to keep the x86 ISA alive and profit from the massive x86 ecosystem. Therefore Intel and AMD are prepared to sacrifice certain aspects, like power efficiency (and presumably performance too), and blow lots of energy on the x86 translation front end. That is a balancing act that designers of CPU:s with more modern ISA:s don't even have to consider.

phire · on Jan 24, 2024

Yeah, that logic seems to all work out.

I found annotated die shots of Zen 3 and Zen 4 that pretty much confirm the op cache: https://locuza.substack.com/p/zen-evolution-a-small-overview

Pretty strong evidence that AMD are using a much simpler encoding scheme with roughly 64bits per uop. Also, That uop cache on Zen 4 is starting to look ridiculously large.

But that does give us a good idea how big the microcode rom is. If we go back to the previous intel die shot with its combined microcode rom + uop cache, it appears intel's uop cache is actually quite small thanks to their better encoding.

> Furthermore I assume that one x86 instruction translates to more than one uOP instruction on average (e.g. instructions involving memory operands are cracked

I suspect it's not massively higher one uop per instruction. Remember, the uop cache is in the fused-uop domain (so before memory cracking) and instruction fusion can actually squash some instructions pairs into a single uop.

The bigger hinderance will be any rules that prevent every uop slot from being filled. Intel appears to have many such rules (at least for Sandybridge/Haswell/Skylake)

> and blow lots of energy on the x86 translation front end

TBH, we have no idea how big the x86 tax is. We can't just assume the difference in performance per watt between the average x86 design and average high performance aarch64 design is entirely caused by the x86 tax.

Intel and AMD simply aren't incentivised to optimise their designs for low power consumption as their cores simply aren't used in mobile phones where ultra low power consumption is absolutely critical.

mbitsnbites · on Jan 24, 2024

> I found annotated die shots of Zen 3 and Zen 4

Ooo thanks! Sure looks like strong evidence.

> TBH, we have no idea how big the x86 tax is.

No, and it gets even more uncertain when you consider different design targets. E.g. a 1000W Threadripper targets a completely different segment than a 10W ARM Cortex.Would an ARM chip designed to run at 1000W beat the Threadripper? Who knows?

> Intel and AMD simply aren't incentivised to optimise their designs for low power consumption as their cores simply aren't used in mobile phones where ultra low power consumption is absolutely critical.

They'll keep doing their thing until they can't compete. They lost mobile and embedded, and competitors are eating into laptops and servers where x86 continues to have a stronghold. But perf/watt matters in all segments these days, and binary compatibility is dropping in importance (e.g. compared to 20-40 years ago), much thanks to open source.

IMO the writing is on the wall, but it will take time (especially for the very slow server market).

phire · on Jan 25, 2024

Yeah, I agree that the writing is on the wall for x86. As you said, power consumption does matter for laptops and server farms.

I'm a huge fan of aarch64, it's a very well designed ISA. I bought a Mac so I could get my hands of a high-preformance aarch64 core. I love the battery life and I'm not going back.

I only really defend x86 because nobody else does, and then people dog pile on it talking it down and misrepresenting what a modern x86 pipeline is.

Though I wouldn't write x86 off yet. I get the impression that Intel are planning to eventually abandon their P core arch (the one with direct lineage all the way back to the original Pentium Pro). They haven't being doing much innovation on it.

Intel's innovation is actually focused on their E core arch, which started as the Intel Atom and wasn't even out-of-order. It's slowly evolved over the years with a continued emphasis on low-power consumption until it's actually pretty completive with the P core arch.

If you compare Golden Cove and Gracemont, the frontend is radically different. Golden Cove has a stupid 6 wide decoder that can deliver 8 uops per cycle... though it's apparently sitting idle 80% of the time (power gated) thanks to the 4K uop cache.

Gracemont doesn't have a uop cache. Instead it uses the space for a much larger instruction cache and two instruction encoders running in parallel, each 3-wide. It's a much more efficient way to get 6-wide instruction decoding bandwidth, I assume they are tagging decode boundaries in the instruction cache.

Gracemont is also stupidly wide. Golden cove only has 12 execution unit ports, Gracemont has 17. It's a bit narrow in other places (only 5 uops per cycle between the front-end and backend) but give it a few more generations and a slight shift in focus and it could easily outperform the P core. Perhaps add a simple loop-stream buffer and upgrade to three or four of those 3-wide decoders running in parallel.

Such a design would have a significantly lower x86 tax. Low enough to save them in the laptop and server farm market? I have no idea. I'm just not writing them off.

mbitsnbites · on Jan 25, 2024

> I'm a huge fan of aarch64, it's a very well designed ISA

I totally agree. I would go as far as to say that it's the best "general purpose" ISA available today. I am under the impression that the design was heavily data driven, building on decades of industry experience in many different sectors and actually providing efficient instructions for the operations that are used the most in real code.

> I only really defend x86 because nobody else does

:-D I can easily identify with that position.

> I get the impression that Intel are planning to eventually abandon their P core arch

Very interesting observation. It makes a lot of sense.

I also think that we will see more hybrid solutions. Looking at the Samsung Exynos 2200, for an example from the low-power segment, it's obvious that the trend is towards heterogeneous core configurations (1 Cortex-X2 + 3 Cortex-A710 + 4 Cortex-A510): https://locuza.substack.com/p/die-analysis-samsung-exynos-22...

Heterogeneous core configurations has only just recently made it to x86, and I think it can extend the lifetime of x86.

For laptops, I can see an x86 solution where you have a bunch of very simple and power-efficient cores in the bottom, that perhaps even uses something like software-aided decoding (which appears to be more power-efficient than pure hardware decoding) and/or loop buffers (to power down the front end most of the time). And then build on top of that with a few "good" E-cores, and only one or two really fast cores for single-threaded apps.

For servers I think that having many good E-cores would be a better fit. Kind of similar to the direction AMD is taking with their Bergamo EPYC parts (though technically Bergamo is not an E-core, it gives more cores at the same TDP).

phire · on Jan 26, 2024

> I am under the impression that the design was heavily data driven, building on decades of industry experience in many different sectors and actually providing efficient instructions for the operations that are used the most in real code.

Yeah, that's the impression I get too. I also get the impression they were planning ahead for the very wide GBOoO designs (I think Apple had quite a bit of influence on the design and they were already working a very wide GBOoO microarch), so there is a bias towards a very dense fixed-width encoding, at the expense of increased decoding complexity.

ARM weren't even targeting the ultra low end, as they have a completely different -M ISA for that.

This is in contrast to RISC-V. Not only do they target the entire range from ultra low end to high performance, but the resulting ISA feels like it has a biased towards ultra low gate count designs (the way immediates are encoded are points towards this).

---------------

You might hate me for this, but I have to raise the question:

Does AArch64 actually count as a RISC ISA?

It might have the fixed width encoding and load/store arch we typically associate with RISC ISAs. But there is one major difference that arguably disqualifies it on a technicality.

All the historic RISC ISAs were designed in parallel with the first generation microarchitecture of the first CPU and where hyper-optimised for that microarchitecture (often to a fault, leaving limited room for expansion and introducing "mistakes" like branch delay slots). Such ISAs were usually very simple to decode, which lead to the famous problems that RISC had with code density.

I lean towards the opinion that this tight coupling between ISA and RISC microarchitecture is another fundamental aspect of a RISC ISA.

But AAarch64 was apparently designed by committee, independent of any single microarchitecture. And they apparently had a strong focus on code density.

The result is something that is notably different from any other RISC ISA.

You could make a similar argument about RISC-V, it was also designed by committee, independent of any single microarchitecture. But they also did so with an explicit intention make a RISC ISA, and the end result feels very RISC to me.

> that perhaps even uses something like software-aided decoding (which appears to be more power-efficient than pure hardware decoding)

At this point, I have very little hope for software-aided decoding. Transmeta tried it, Intel kind of tried it with software x86 emulation on Itanium, Nvidia bought the Transmeta patents and tried it again with Denver.

None of these attempts worked well, so I kind of have to conclude it's a flawed idea.

Though the flaw was probably the statically scheduled VLIW arch they were translating into. Maybe if you limited your software decoding to just taking off the rough edges of x86 instruction encoding it could be a net win.

mbitsnbites · on Jan 26, 2024

> ARM weren't even targeting the ultra low end, as they have a completely different -M ISA for that.

That's the brilliance of it all, IMO. They didn't have to target the ultra low end, since they already had an ISA that works perfectly in that segment (the -M is basically a dumbed down ARMv7, but most of the ecosystem came for free).

Unlike...

> This is in contrast to RISC-V.

...which is a mistake IMO. When you want to be good at everything, you're not the best at anything.

Footnote: I am convinced that we'll see a fork (of sorts) of RISC-V for the GBOoO segment. Since the ISA is free, companies will do what they see fit with it. Possible candidates for this are NVIDIA and Apple (both known for wanting full control of their HW and SW), or possibly some Chinese company that needs a strong platform now that the US is putting in all kinds of trade restrictions (incl. RISC-V https://techwireasia.com/10/2023/do-risc-v-restrictions-sign...).

> And they apparently had a strong focus on code density.

This is also very interesting. At first you'd think that they abandoned code density when they dropped thumb, went for fixed width 32-bit instructions, and even dropped ldm/stm, but when you look closer at the ISA you realize that AArch64 code is very dense. From what I've seen it's usually denser than x86 code. I attribute a large part of this to many key instructions that appear to have been designed for cracking (e.g. ldp x29, x30, [sp], 48 - looks suspiciously crackable, it's really 2-3 instructions in one, and it has three (!) outputs).

So...

> You might hate me for this, but I have to raise the question:

Not at all :-)

> Does AArch64 actually count as a RISC ISA?

No. It has a few obvious traits of a RISC (it's clearly a load-store ISA with many registers), but it also has instructions that clearly were not designed with a 1:1 mapping from architecture to microarchitecture (which I would expect in a RISC ISA).

> I lean towards the opinion that this tight coupling between ISA and RISC microarchitecture is another fundamental aspect of a RISC ISA.

Yes, you have a point. When instructions are designed to decode into a single "internal instruction" that occupies a single slot in the pipeline (one instruction = one clock), we have a RISC ISA. That does not say anything about pipeline topology, though (e.g. parallelism, pipeline lengths, branch delays, etc).

There are obviously edge cases and exceptions that make a precise definition tricky. For instance, I consider MRISC32 to be a RISC ISA, but an implementation may expand vector instructions into multiple operations that take several clock cycles to complete (not entirely different from ldm/stm in ARMv7).

I think that we need to accept a sliding scale, instead of requiring hard limits. Hence the term "RISC-style", rather than "RISC". Perhaps some kind of a Venn diagram would make better sense :-)

> None of these attempts worked well ... the flaw was probably the statically scheduled VLIW

I still think that there is potential in software-aided decoding/translation (heck, most of the software that we run on a daily basis is JIT-translated, so it can't be that bad).

I think that Transmeta failed largely because they initially aimed at a different segment (high performance) but had to adjust their targets to a segment that wasn't really mature at the time (low power), and they didn't have a backup for catering to high performance needs.

However... If software decoding is only used for small power efficient cores (and maybe they use something else than VLIW?) that live together with proper high end cores on a SoC, then I think that the situation would be completely different.

IIRC a main driver for Transmeta was to circumvent x86 licensing issues, so even if big/little would be feasible at the time and they had the skills and manpower to pull off a high end x86 core, that would probably not be an option.

phire · on Jan 26, 2024

> Footnote: I am convinced that we'll see a fork (of sorts) of RISC-V for the GBOoO segment.

Yeah, seems likely.

Qualcomm has been trying to push RISC-V to be better for GBOoO after ARM fucked them over with their Nuvia purchase. They have a high performance AArch64 core and no AArch64 licence for it.

They have been pushing to drop the compressed 16bit instruction extension from the core profile, and proposed a new extension improves code density by adding new addressing modes stolen from AArch64.

> For instance, I consider MRISC32 to be a RISC ISA, but an implementation may expand vector instructions into multiple operations that take several clock cycles to complete

ARM Inc takes this approach for vector instructions on their little cores (like the A53).

> I still think that there is potential in software-aided decoding/translation (heck, most of the software that we run on a daily basis is JIT-translated, so it can't be that bad).

Ironically, the prevalence of JITs in modern software is one of the major reasons why Project Denver had a hard time. It took a noticeable performance hit worse when executing JITTed code, not that it's performance on static code was great. This is despite the fact that Denver had a hardware translator so it didn't have to send all code though the software translator.

I suspect Transmeta fell into the classic trap of underestimating just how good of a performance advantage that a GBOoO gets from hiding the latency of memory ops with out-of-order execution. With hindsight, we know know that advantage is massive, but nobody really knew about it 20 years ago.

I'm not entirely sure what Denver's problem was. I understand they did aggressive memory prefetching to try and compensate. Maybe that just wasn't good enough. Or maybe it was just translation overhead issues, trying to schedule VILW code is a hard problem, and the same reason why Itanium failed.

> However... If software decoding is only used for small power efficient cores (and maybe they use something else than VLIW?

Yeah, might work. Well, more for a medium sized core than small.

I'm thinking maybe if you kept the instruction bundling from VLIW so your frontend is significantly simpler, but still use an out-of-order backend so you get the latency hiding advantage. And because it's only an efficiency on a heterogeneous SoC, the OS can identify code (or processes) that doesn't work well with software decoding and kick it to the performance cores.

> IIRC a main driver for Transmeta was to circumvent x86 licensing issues

But then Nvidia tried the same approach. Apparently there was a lawsuit which they lost, and project Denver was repurposed as an AArch64 core. It might have been a good product if it could run x86 code.

mbitsnbites · on Jan 24, 2024

BTW... Except for the indications from the die shots, one of the reasons that I don't think that uOPs can be as small as 32 bits is that studying fixed width ISAs and designing MRISC32 have made me appreciate the clever encoding tricks that go into fitting all instructions into 32 bits.

Many of the encoding tricks require compiler heuristics, and you don't want to do that in hardware. E.g. consider the AArch64 encoding of immediate values for bitwise operations.

Also, even if you manage to do efficient instruction encoding in hardware, you will probably end up in a situation where you need to add an advanced decoder after the uOP cache, which does not make much sense.

The main thing that x86 has going for it in this regard is that most instructions use destructive operands, which probably saves a bunch of bits in the uOP encoding space. But still, it would make much more sense to use more than 32 bits per uOP.

phire · on Jan 25, 2024

> designing MRISC32 have made me appreciate the clever encoding tricks that go into fitting all instructions into 32 bits.

Keep in mind that the average RISC ISA uses 5 bit registers IDs and uses three-arg form for most instructions, that's 15 bits gone. While AMD64 uses 4 bit register IDs and uses two-arg form for most instructions, which is only 8 bits.

Also, the encoding scheme that Agner describes is not a fixed width encoding. It's variable width with 16bit, 32bit, 48bit and 64bit uops. There are also some (hopefully rare) uops which don't fit in the uop cache's encoding scheme (forcing a fallback to the instruction decoders). Those two relief valves allow such an encoding to avoid the needs for the complex encoding tricks of a proper fixed width encoding.

So I find the scheme to be plausible, though what you say about decoders after the uOP cache is true.

mbitsnbites · on Jan 25, 2024

> Keep in mind that the average RISC ISA uses 5 bit registers IDs and uses three-arg form for most instructions, that's 15 bits gone. While AMD64 uses 4 bit register IDs and uses two-arg form for most instructions, which is only 8 bits.

Yes, I'm aware of that (that's what I meant with "destructive operands").

But then you have AVX-512 that has 32 architectural registers, and some instructions support three register operands (LEA, VEX-encoding, ...). So there you have the choice of having multiple encodings that will be more compact, or a few simple encodings that will require wider encoding.

Since x86 isn't known for having a small and consistent instruction set, any attempts to streamline the post-decode encoding will likely gravitate towards a wider encoding (i.e. worst case least common denominator).

It would be an interesting exercise to try to fit all the existing x86 instructions (post-cracking) into an encoding of 32, 48 or 64 bits, for instance. But I don't envy the Intel & AMD engineers.

phire · on Jan 26, 2024

I've had "attempt to implement a wide x86 instruction decoder on an FPGA" on my list of potential hobby projects for a long time. Though I was only planning to do the length decoder. My understanding is that length decoding is the critical path and any additional decoding after that is simple from a timing perspective (and very complex from every other perspective).

But now you are making want to experiment with compact uop encodings too.