The memory wall (also known as the Von Neumann bottleneck) is still true. Token generation on Nvidia GPUs is memory bound, unless you do very large batch sizes to become compute bound.
That said, more exotic architectures from cerebras and groq get far less token per second performance than their memory bandwidth suggests they can, so they have a bottleneck elsewhere.
You get a memory bound on GPUs because they have so much more compute per memory. The H100 has 144 SMs driving 4x32 threads per clock. That is 18,432 threads demanding memory.
Now to be fair, that is separated into 8 clusters which I assume are connected to their own memory so you actually only have 576 threads sharing memory bandwidth. But that is still way more compute than any single processing element could ever hope to have. You can drown any individual processor in memory bandwidth these days unless you somehow produce a processor clocked at multiple THz.
The problem does not seem to be memory bandwidth, but cost, latency, and finding the cost-efficient compute-bandwidth tradeoff for a given task.
You can predict the token generation performance of a GPU or CPU by dividing the memory bandwidth by the size of the active parameters. By definition, that is a memory bandwidth bottleneck. I have no idea why you think it is not.
Anyone who has worked on inference code knows that memory bandwidth is the principal bottleneck for token generation. For example:
That said, more exotic architectures from cerebras and groq get far less token per second performance than their memory bandwidth suggests they can, so they have a bottleneck elsewhere.