The memory wall (also known as the Von Neumann bottleneck) is still true. Token ...

Veserv · 2025-06-09T22:31:37 1749508297

You get a memory bound on GPUs because they have so much more compute per memory. The H100 has 144 SMs driving 4x32 threads per clock. That is 18,432 threads demanding memory.

Now to be fair, that is separated into 8 clusters which I assume are connected to their own memory so you actually only have 576 threads sharing memory bandwidth. But that is still way more compute than any single processing element could ever hope to have. You can drown any individual processor in memory bandwidth these days unless you somehow produce a processor clocked at multiple THz.

The problem does not seem to be memory bandwidth, but cost, latency, and finding the cost-efficient compute-bandwidth tradeoff for a given task.

ryao · 2025-06-15T00:07:00 1749946020

You can predict the token generation performance of a GPU or CPU by dividing the memory bandwidth by the size of the active parameters. By definition, that is a memory bandwidth bottleneck. I have no idea why you think it is not.

Anyone who has worked on inference code knows that memory bandwidth is the principal bottleneck for token generation. For example:

https://github.com/ryao/llama3.c