I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.
It's a bit harder when they've provided the safetensors in FP8 like for the DS3 series, but these smaller distilled models appear to be BF16, so the normal convert/quant pipeline should work fine.
Edit: Running the DeepSeek-R1-Distill-Llama-8B-Q8_0 gives me about 3t/s and destroys my system performance on the base m4 mini. Trying the Q4_K_M model next.
Not trivial as long as imatrix is concerned: we've found it substantially improves performance in Q4 for long Ukrainian contexts. I imagine, it's similarly effective in various other positions.