Can someone ELI5 what the difference is between using the "quantized version of ...

whereismyacc · 2025-01-20T22:24:04 1737411844

The weights are quantized down to fewer bits in order to save on memory. The quantization loss is going to result in worse generations.

ColonelPhantom · 2025-01-20T22:37:23 1737412643

Ollama serves multiple versions, you can get Q8_0 from it too:

ollama run deepseek-r1:8b-llama-distill-q8_0

The real value from the unsloth ones is that they were uploaded before R1 appeared on Ollama's model list.

AS04 · 2025-01-21T02:26:54 1737426414

Unsloth also works very diligently to find and fix tokenizer issues and many other problems as soon as they can. I have comparatively little trust on ollama following up and updating everything in a timely manner. Last I checked, there is little information on when the GGUFs and etc. on ollama were updated or what llama.cpp version / git commit did they use for it. As such, quality can vary and be significantly lower with the ollama versions for new models I believe.

dragonwriter · 2025-01-21T16:05:44 1737475544

They are probably the same model, unsloth does model quants and provides them to the community, AFAIK ollama doesn't, they just indexes publicly available models, whether full or quantized, for convenient use in their frontend.