Well if you're not memory bandwidth bound, you're compute bound by definition.

reroute22 · on Nov 10, 2023

Not really, there is no such definition that you're referring to.

Perf of a GPU can be limited by any one of the thousand little things within the micro-architectural organization of the GPU in question, any on-chip path can become the bottleneck:

1. DRAM bandwidth

2. ALU counts

3. Occupancy

4. Instruction issue port counts

5. Quality of Warp scheduling (the scheduling problem)

6. Operand delivery

7. Any given cache bandwidth

8. Register file bandwidth (SRAM port counts)

9. Head of the line blocking in one of the many queues / paths in the design whatever that path is responsible for:

- sending memory request from the instruction processing pipelines to the memory hierarchy - or sending the reply with the data payload back, - or doing the same but with the texture filtering block (rather than memory H), - or the path that parses GPU commands from command buffers created by the driver, - or the path that subsequently processes those already decoded commands and performs on-chip resource allocation, warp creation / tear down, all of which need to be able to spawn the work further down fast enough to keep the rest of the design fed;

and so on and so on and so on.

By the time a high quality design is fully finished, matured, and successful enough on the market to show up on everyone's radar outside of the hardware design space, due to the commonly occurring ratios of costs of solutions for these various problems above, it usually ends up being 1, 2, or 3, but that's experimental data + statistics + survivorship bias, there is no "definition" that that's the case.

Further, what's "commonly occurring" is changing over time as designs drift into different operational areas in the phase space of operating modes as they pick out low-hanging fruits, science and experience behind the micro-architecture grows, common workloads change in nature, and new process nodes with new properties become the norm. Doubling up of F32 ALUs in Ampere is a good example of that, that was done in a way that changed the typical ratios substantially. And now M3 threw a giant wrench into incumbent statistics of relationships between (3) and the rest of the limiters, as well as between (3) and what's actionable for a GPU program developer to do to mend it.

You can be low on DRAM bandwidth util and ALU util at the same time. How would that be if there were no other limiters?

Generally, a component X of a computer system needs to be a limiter Y% of the time where Y equals the portion of the total cost of the system X is responsible for.

The principle is the easiest to apply in a "calculus of variations" manner: if doubling the key performance metric of X results in an increase of the cost of the entire system as a whole by 5%, but how often X is the limiter drops from 10% of the time to 5% of the time, doing the doubling would bring the design quite close to proper balance wrt. X.

Things that are cheap to beef up are relatively rare a limiter in well-designed systems as a result. Things that are expensive to beef up are often the limiter. What is and isn't expensive to do depends heavily on where in the design space the current design is at and where the technology is at, all of which is changing over time.

FP32 was cheap to double up in Ampere, so they doubled it up, even though that provided only relativelyl small performance improvement. But now as a result, FP32 is very rarely a limiter (in Ampere and Ada). That doesn't automatically mean that these designs are "gimped" in DRAM bandwidth or anything of the kind. Rather, the whole perception that a good GPU design just gotta be ALU limited all the time is nothing but a mistaken perception, just like "it's either ALU limited or DRAM bandwidth limited by definition" also is just untrue. See "occupancy limited" for a prime example.