These costs don't line up with my own experiments using vLLM on EKS for hosting small to medium sized models. For small (under 10B parameters) models on g5 instances, with prefix caching and an agent style workload with only 1 or a small number of turns per request, I saw on the order of tens of thousands of tokens/second of prefill (due to my common system prompts) and around 900 tokens/second of output.
I think this worked out to around $1/million tokens of output and orders of magnitude less for input tokens, and before reserved instances or other providers were considered.
Interesting, I think how the model runs makes a big difference and I plan to re-run this experiment with different models and different ways of running the model.
I think this worked out to around $1/million tokens of output and orders of magnitude less for input tokens, and before reserved instances or other providers were considered.