Just from the abstract, this is primarily for batching inference, for batched in...

liuliu · on Sept 14, 2023

Not only batching. It works by serving different requests as long as they share the prefix.

This basically enables KV cache reuse when there is a prefix matching (from my shallow understanding of how KV cache works).

I failed to see how this help for local deployed LLM, unless you consider the case you ask the same question or with the same prefix are high (like always starts with "please help me ..."?)

Tostino · on Sept 14, 2023

You also have fine-tuned models for specific tasks that may see very similar inputs for a variety of outputs. Think an LLM trained on pulling out specific types of information, no matter where it was stored within the file. E.g. "find the date of the shipment for product# 5432" and then you pass in 10k json documents with a similar shape.

liuliu · on Sept 14, 2023

Yeah, but I was under the impression that for the same prompt, implementations are already share the KV cache. This area is so new so these obvious ideas might not get implemented as widely as I thought.

bestcoder69 · on Sept 14, 2023

Maybe if you have a model with a large context window, you stuff a document in the prompt as a prefix, then ask a bunch of different questions about the document?

rdedev · on Sept 14, 2023

That would be pretty useful. I'm working on getting chatgpt to classify a dataset. So basically I use the same big prompt for a bunch of different small texts and ask chatgpt to generate the class label. Something like initializing the prompt state sounds good. Basically trade more processing time for more memory usage. Who know maybe openai is doing such optimization from their side

fredliu · on Sept 14, 2023

I might be wrong, but looks like this could help with speculative decoding which can already vastly improves the inference speed?