GP's point is you can cache the state *after* the model processed the super long...

localhost · on Feb 17, 2024

Do you have examples of where this has been done? Based on my understanding you can do things like cache the embeddings to avoid the tokenization/embedding cost, but you will still need to do a forward pass through the model with the new user prompt and the cached context. That is where the naive O(N^2) complexity comes from and that is the cost that cannot be avoided (because the whole point is to present the next user prompt to the model along with the cached context).