These costs don't line up with my own experiments using vLLM on EKS for hosting ...

These costs don't line up with my own experiments using vLLM on EKS for hosting small to medium sized models. For small (under 10B parameters) models on g5 instances, with prefix caching and an agent style workload with only 1 or a small number of turns per request, I saw on the order of tens of thousands of tokens/second of prefill (due to my common system prompts) and around 900 tokens/second of output.

I think this worked out to around $1/million tokens of output and orders of magnitude less for input tokens, and before reserved instances or other providers were considered.