This was beautifully illustrated in the recent Phoronix 5090 LLM benchmark[1], which I noted here[2]. The tested GPUs had an almost perfect linear relationship between generated token/s and GB/s memory bandwidth, except the 5090 where it dipped slightly.
I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.
I guess the 5090 either started ever so slightly to become compute limited as well, or hit some overhead limitation.
[1]: https://www.phoronix.com/review/nvidia-rtx5090-llama-cpp
[2]: https://news.ycombinator.com/item?id=42847284