Hi, I had a quick question. Would it be correct to say the following? 1. For lon...

		0xjunhao 3 months ago \| parent \| context \| favorite \| on: Lossless LLM 3x Throughput Increase by LMCache Hi, I had a quick question. Would it be correct to say the following? 1. For long inputs and short outputs, the inference can be arbitrarily number of times faster, as it avoids repeated KV computation. 2. Conversely, for short inputs and long outputs, it might be slightly slower, since loading and storing the KV cache are on the critical path of the execution.

It is almost true for both. Although for the second case you can just skip storing in these cases where there is little improvement.