This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], w...

This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], which I noted here[2]. The tested GPUs had an almost perfect linear relationship between generated token/s and GB/s memory bandwidth, except the 5090 where it dipped slightly.

I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.

[1]: https://www.phoronix.com/review/nvidia-rtx5090-llama-cpp

[2]: https://news.ycombinator.com/item?id=42847284