Hacker News new | past | comments | ask | show | jobs | submit login

Can someone ELI5 what the difference is between using the "quantized version of the Llama 3" from unsloth instead of the one that's on ollama, i.e. `ollama run deepseek-r1:8b`?



The weights are quantized down to fewer bits in order to save on memory. The quantization loss is going to result in worse generations.


Ollama serves multiple versions, you can get Q8_0 from it too:

ollama run deepseek-r1:8b-llama-distill-q8_0

The real value from the unsloth ones is that they were uploaded before R1 appeared on Ollama's model list.


Unsloth also works very diligently to find and fix tokenizer issues and many other problems as soon as they can. I have comparatively little trust on ollama following up and updating everything in a timely manner. Last I checked, there is little information on when the GGUFs and etc. on ollama were updated or what llama.cpp version / git commit did they use for it. As such, quality can vary and be significantly lower with the ollama versions for new models I believe.


They are probably the same model, unsloth does model quants and provides them to the community, AFAIK ollama doesn't, they just indexes publicly available models, whether full or quantized, for convenient use in their frontend.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: