In March, vLLM picked up some of the improvements in the DeepSeek paper. Through these, vLLM v0.7.3's DeepSeek performance jumped to about 3x+ of what it was before [1].
What's exciting is that there's still so much room for improvement. We benchmark around 5K total tokens/s with the sharegpt dataset and 12K total token/s with random 2000/100, using vLLM and under high concurrency.
DeepSeek-V3/R1 Inference System Overview [2] quotes "Each H800 node delivers an average throughput of 73.7k tokens/s input (including cache hits) during prefilling or 14.8k tokens/s output during decoding."
Yes, DeepSeek deploys a different inference architecture. But this goes onto show just how much room there is for improvement. Looking forward to more open source!
What's exciting is that there's still so much room for improvement. We benchmark around 5K total tokens/s with the sharegpt dataset and 12K total token/s with random 2000/100, using vLLM and under high concurrency.
DeepSeek-V3/R1 Inference System Overview [2] quotes "Each H800 node delivers an average throughput of 73.7k tokens/s input (including cache hits) during prefilling or 14.8k tokens/s output during decoding."
Yes, DeepSeek deploys a different inference architecture. But this goes onto show just how much room there is for improvement. Looking forward to more open source!
[1] https://developers.redhat.com/articles/2025/03/19/how-we-opt...
[2] https://github.com/deepseek-ai/open-infra-index/blob/main/20...