> Each new interaction requires processing ALL previous context I was under the ...

blackbear_ · 2025-07-20T10:59:27 1753009167

You have to compute attention between all pairs of tokens at each step, making the naive implementation O(N^3). This is optimized by caching the previous attention values, so that for each step you only need to compute attention between the new token and all previous ones. That's much better but still O(N^2) to generate a sequence of N tokens.

_heimdall · 2025-07-20T10:23:43 1753007023

Caching would only help to keep the context around, but caching would only be needed if it still ultimately needs to read and process that cached context again.

Retr0id · 2025-07-20T10:34:03 1753007643

You can cache the whole inference state, no?

They don't go into implementation details but Gemini docs say you get a 75% discount if there's a context-cache hit: https://cloud.google.com/vertex-ai/generative-ai/docs/contex...

_heimdall · 2025-07-20T11:00:59 1753009259

It that just avoids having to send the full context for follow-up requests, right? My understanding is that caching helps to keep the context around but can't avoid the need to process that context over and over during inference.

bakugo · 2025-07-20T13:24:01 1753017841

The initial context processing is also cached, which is why there's a significant discount on the input token cost.

_heimdall · 2025-07-21T04:29:54 1753072194

What exactly is cached though? Each loop of token inference is effectively a recursive loop that takes in all context plus all previously inferred tokens, right? Are they somehow caching the previously inferred state and able to use that more efficiently than if they just cache the context then run it all through inference again?

Too · 2025-07-21T12:04:40 1753099480

When inference requires maxing out the memory of a gpu, where are you planning to keep this cache? Unless there is a way to compress the context into a more manageable snapshot, the cloud provider surely won’t keep a gpu idling just for holding a conversation in memory.

ilaksh · 2025-07-20T12:08:02 1753013282

Yes, prompt caching helps a lot with the cost. It still adds up if you have some tool outputs with long text. I have found that breaking those out into subtasks makes the overall cost much more reasonable.

csomar · 2025-07-20T10:52:28 1753008748

My understanding is that caching reduce computation but the whole input is still processed. I don’t think is fully disclosing how their cache works.

LLMs degrade with long input regardless of caching.

stpedgwdgfhgdd · 2025-07-20T11:25:58 1753010758

Compact the conversation (CC)