With that said, this seems quite obvious - the type of customer that chooses Fly, seems like the last person to be spinning up dedicated GPU servers for extended periods of time. Seems much more likely they'll use something serverless which requires a ton of DX work to get right (personally I think Modal is killing it here). To compete, they would have needed to bet the company on it. It's way too competitive otherwise.
As someone who deploys a lot of models on rented GPU hardware, their pricing is not realistic for continous usage.
They're charging hyperscaler rates, and anyone willing to pay that much won't go with Fly.
For serverless usage they're only mildly overpriced compared to say Runpod, but I don't think of serverless as anything more than an onramp to renting dedicated machine, so it's not surprising to hear it's not taking off.
GPU workloads tend to have terrible cold-start performance by their nature, and without a lot of application specific optimizations it rarely ends up making financial sense to not take a cheaper continous option if you have an even mildly consistent workload. (and if you don't then you're not generating that much money for them)
My thing here is just: people self-hosting LLMs think about performance in tokens/sec, and we think about performance in terms of ms/rtt; they're just completely different scales. We don't really have a comparative advantage for developers who are comfortable with multisecond response times. And that's fine!
That reminds me when cloudflare launched their workers gpu product, it was specifically aimed at running models and the pricing was abstracted and based on model output. Did you look what they were doing when building gpu machines?
As the sibling comment points out, usually cold starts are optimized on the order of milliseconds, so 20 seconds is a while for a user to be sitting around with nothing streamed.
And with the premium for per-second GPUs hovering around 2x that for hourly/monthly rentals, it gets even harder for products with scale to justify.
You'd want to have a lot of time where you're scaled to 0, but that in turn maps to a lot of cold starts.
With that said, this seems quite obvious - the type of customer that chooses Fly, seems like the last person to be spinning up dedicated GPU servers for extended periods of time. Seems much more likely they'll use something serverless which requires a ton of DX work to get right (personally I think Modal is killing it here). To compete, they would have needed to bet the company on it. It's way too competitive otherwise.