NVIDIA are really keen you understand their hardware, to the extent they will gi...

Remnant44 · on Feb 7, 2024

At least for x86, there's an incredible wealth of architectural details out there, both from the vendors themselves and from people who have worked tirelessly to characterize them.

Along the lines of another comment on this post, part of the problem is the GPU compute model is a lot more abstract that what is presented for the CPU.

That abstraction is really helpful for being able to simply write parallel code. But it also hides the tremendous differences in performance possible...

KeplerBoy · on Feb 7, 2024

Don't GPUs also have out of order execution and instruction level parallelism?

I think the reason why Nvidia publishes these resources is because the GPUs are worth nothing if people can't get a reasonable fraction of the advertisable FLOPs with reasonable effort. CUDA wouldn't have taken off, if it were harder than it absolutely needs to be.

ribit · on Feb 7, 2024

> Don't GPUs also have out of order execution and instruction level parallelism?

Not any contemporary mainstream GPU I am aware of. Sure, the way these GPUs are marketed does sound like they have superscalar execution, but if you dig a bit deeper this is either about interleaving execution of many programs (similar to SMT) or a SIMD-within-SIMD. Two examples:

1. Nvidia claims they have simultaneous execution of FP and INT operations. What this actually means is that they can schedule an FP and and INT operation simultaneously, but they have to come from different programs. What this actually actually means is that they only schedule one instruction per clock but it takes two clocks to actually issue, so it kind of looks like issuing two instructions per clock if you squint hard enough. The trick is that their ALUs are 16-wide, but they pretend that they are 32-wide. I hope this makes sense.

2. AMD claims they have superscalar execution, but what they really have is a packed instruction that can do two operations using a limited selection of arguments. Which is why RDNA3 performance improvements even on compute-dense code are much more modest. Since these packed instructions have limitations, the compiler is not always able to emit them.

KeplerBoy · on Feb 7, 2024

thanks for the clarification. i appreciate it.

wtracy · on Feb 7, 2024

My understanding was that GPU instruction level parallelism is quite limited compared to CPUs (since multiple "threads" are running on each hardware core) and I wasn't aware that GPUs had any meaningful OOO execution.

If I'm wrong, I'd be happy to learn more.

ajb · on Feb 7, 2024

This is arguable depending on your definition of ILP. CPUs try to extract parallelism from a single instruction stream in order to execute many instructions from the same stream in parallel. This is very costly in silicon area per instruction executed. GPUs don't need to do this because the programs run on them are "embarrassingly parallel" - they have lots of available parallelism and explicitly tell the GPU where it is. So GPUs execute many more instructions in parallel than CPUs, but they don't usually do any work to try and find implicit parallelism

qrios · on Feb 7, 2024

Don't now if NVIDIA is keen to understand their hardware, but they are obviously very interested in users understanding their hardware. I originally started i86 assembler and stopped after the i860, which as far as I remember was the first Intel processor with branch prediction. It's a nightmare for control freaks, especially on CISC processors with variable clock cycles.

GPU programming with CUDA and PTX feels like programming on a single core CPU without tasks and threads with deterministic behavior but in a multidimensional space. And every hour spent avoiding an 'if' pays off in terms of synchronization and therefore speed.