Hacker News new | past | comments | ask | show | jobs | submit login

GP's point is you can cache the state after the model processed the super long context but before it ingests your prompt.

If you are going to ask "then why don't OpenAI do it now", the answer is it takes a lot of storage (and IO) so it may not be worth it for shorter context, it adds significant complexity to the entire serving stack, and is incoherent with how OpenAI originally imagined where the "custom-ish" LLM serving game is going to - they bet on finetuning and dedicated instances, instead of long context.

The tradeoff can be reflected in the API and pricing, LLM APIs don't have to be like OpenAI's. What if you have an endpoint to generate a "cache" of your context (or really, a prefix of your prompt), billed as usual per token, then you can use your prompt prefix for a fixed price no matter how long it is?




Do you have examples of where this has been done? Based on my understanding you can do things like cache the embeddings to avoid the tokenization/embedding cost, but you will still need to do a forward pass through the model with the new user prompt and the cached context. That is where the naive O(N^2) complexity comes from and that is the cost that cannot be avoided (because the whole point is to present the next user prompt to the model along with the cached context).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: