I respect them for being public about this. With that said, this seems quite obv...

BoorishBears · 2025-02-15T00:01:38 1739577698

As someone who deploys a lot of models on rented GPU hardware, their pricing is not realistic for continous usage.

They're charging hyperscaler rates, and anyone willing to pay that much won't go with Fly.

For serverless usage they're only mildly overpriced compared to say Runpod, but I don't think of serverless as anything more than an onramp to renting dedicated machine, so it's not surprising to hear it's not taking off.

GPU workloads tend to have terrible cold-start performance by their nature, and without a lot of application specific optimizations it rarely ends up making financial sense to not take a cheaper continous option if you have an even mildly consistent workload. (and if you don't then you're not generating that much money for them)

tptacek · 2025-02-15T00:05:55 1739577955

My thing here is just: people self-hosting LLMs think about performance in tokens/sec, and we think about performance in terms of ms/rtt; they're just completely different scales. We don't really have a comparative advantage for developers who are comfortable with multisecond response times. And that's fine!

cmdtab · 2025-02-15T04:09:27 1739592567

That reminds me when cloudflare launched their workers gpu product, it was specifically aimed at running models and the pricing was abstracted and based on model output. Did you look what they were doing when building gpu machines?

https://blog.cloudflare.com/workers-ai/

Aeolun · 2025-02-15T01:41:49 1739583709

> GPU workloads tend to have terrible cold-start performance by their nature

My Fly machine loads from turned off to first inference complete in about 35 seconds.

If it’s already running, it’s 15 seconds to complete. I think that’s pretty decent.

BoorishBears · 2025-02-15T03:46:18 1739591178

As the sibling comment points out, usually cold starts are optimized on the order of milliseconds, so 20 seconds is a while for a user to be sitting around with nothing streamed.

And with the premium for per-second GPUs hovering around 2x that for hourly/monthly rentals, it gets even harder for products with scale to justify.

You'd want to have a lot of time where you're scaled to 0, but that in turn maps to a lot of cold starts.