Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just from the abstract, this is primarily for batching inference, for batched inference using GPUs gives an order of magnitude speed increase so probably not something that usually makes sense to do on CPUs…


Not only batching. It works by serving different requests as long as they share the prefix.

This basically enables KV cache reuse when there is a prefix matching (from my shallow understanding of how KV cache works).

I failed to see how this help for local deployed LLM, unless you consider the case you ask the same question or with the same prefix are high (like always starts with "please help me ..."?)


You also have fine-tuned models for specific tasks that may see very similar inputs for a variety of outputs. Think an LLM trained on pulling out specific types of information, no matter where it was stored within the file. E.g. "find the date of the shipment for product# 5432" and then you pass in 10k json documents with a similar shape.


Yeah, but I was under the impression that for the same prompt, implementations are already share the KV cache. This area is so new so these obvious ideas might not get implemented as widely as I thought.


Maybe if you have a model with a large context window, you stuff a document in the prompt as a prefix, then ask a bunch of different questions about the document?


That would be pretty useful. I'm working on getting chatgpt to classify a dataset. So basically I use the same big prompt for a bunch of different small texts and ask chatgpt to generate the class label. Something like initializing the prompt state sounds good. Basically trade more processing time for more memory usage. Who know maybe openai is doing such optimization from their side


I might be wrong, but looks like this could help with speculative decoding which can already vastly improves the inference speed?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: