But the inference doesn't necessarily run at the quant precision.

wkat4242 · 2025-02-11T22:32:11 1739313131

As far as I understand it does if you quantify the K/V store as well (the context). And that's pretty standard now because it can increase maximum context size a lot.

Eisenstein · 2025-02-11T23:46:00 1739317560

It is available in most inference engines, but I wouldn't call it in standard use, as it can degrade quality tremendously.

wkat4242 · 2025-02-12T01:36:03 1739324163

Even at q8_0? I thought it wasn't bad just like the models itself. But very interested to hear.

And q8_0 already halves the memory usage compared to fp16.

One of the ollama Devs called the quality impact negligible at q8_0: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...

But perhaps quantifying the KV cache does not scale as gracefully as the model itself?

Eisenstein · 2025-02-12T02:20:50 1739326850

It highly depends on the model and the context use. A model like command-r for instance is practically unaffected by it, but Qwen will go nuts. As well, tasks highly dependent on context like translation or evaluation will be more impacted than say, code generation or creative output.

behohippy · 2025-02-12T10:49:22 1739357362

Qwen is a little fussy about the sampler settings, but it does run well quantized. If you were getting infinite repetition loops, try dropping the top_p a bit. I think qwen likes lower temps too

Eisenstein · 2025-02-12T14:05:28 1739369128

We are talking about dynamically quantizing KV cache, not the model weights.

behohippy · 2025-02-13T18:37:16 1739471836

I run the KV cache at Q8 even on that model. Is it not working well for you?

wkat4242 · 2025-02-12T03:21:03 1739330463

Interesting. I didn't know that. I thought it was basically 'free' space saving. Would you know how llama3.1 fares by any chance?