This. IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input. These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user and perform hundreds of TB worth of computations per query.
How much would I have to charge for this? Are there any products where the users would actually get enough value out of it to pay what it costs?
Compare to the cost of a user session in a normal database backed web app. Even if that session fans out thousands of backend RPCs across a hundred services, each of those calls executes in milliseconds and requires only a fraction of the LLM's RAM. So I can support thousands of concurrent users per node instead of one.
> IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input.
The computations are not O(n^2) in terms of model weights (parameters), but linear. If it were quadratic, the number would be ludicrously large. Like, "it'll take thousands of years to process a single token" large.
(The classic transformers are quadratic on the context length, but that's a much smaller number. And it seems pretty obvious from the increases in context lengths that this is no longer the case in frontier models.)
> These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user
The parameters are static, not mutated during the query. That memory can be shared between the concurrent users. The non-shared per-query memory usage is vastly smaller.
> How much would I have to charge for this?
Empirically, as little as 0.00001 cents per token.
For context, the Bing search API costs 2.5 cents per query.
Interesting. There's obviously been a precipitous drop in the sticker price, but has there really been a concomitant efficiency increase? It's hard to believe the sticker price these companies are charging has anything to do with reality given how massively they're subsidized (free Azure compute, billions upon billions in cash, etc). Is this efficiency trend real? Do you know of any data demonstrating it?
I have personal anecdotal evidence that they're getting more efficient: I've had the same 64GB M2 laptop for three years now. Back in March 2023 it could just about run LLaMA 1, a rubbish model. Today I'm running Mistral Small 3 on the same hardware and it's giving me a March-2023-GPT-4-era experience and using just 12GB of RAM.
People who I trust in this space have consistently and credibly talked about these constant efficiency gains. I don't think this is a case of selling compute for less than it costs to run.
People are comparing the current rush to the investments made in the early days of the internet while (purposely?) forgetting how expensive access to it was back then. Not saying that AI companies should make a profit today, but I don't see or hear that AI usage is becoming essentials in any way or form.
Yeah that's the big problem. The Internet (e-commerce, specifically) is an obviously good idea. Technologies which facilitate it are profitable because they participate in an ecosystem which is self sustaining. Brick and mortar businesses have something to gain by investing in an online presence. As far as I can tell, there's nothing similar with AI. The speech to text technology in my phone that I'm using right now to write this post is cool but it's not a killer app in the same way that an online shopping cart is.
This. IIUC to serve an LLM is to perform an O(n^2) computation on the model weights for every single character of user input. These models are 40+GB so that means I need to provision about 40GB RAM per concurrent user and perform hundreds of TB worth of computations per query.
How much would I have to charge for this? Are there any products where the users would actually get enough value out of it to pay what it costs?
Compare to the cost of a user session in a normal database backed web app. Even if that session fans out thousands of backend RPCs across a hundred services, each of those calls executes in milliseconds and requires only a fraction of the LLM's RAM. So I can support thousands of concurrent users per node instead of one.