Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

curious, why the 30b MoE over the 32b dense for local coding?

I do not know much about the benchmarks but the two coding ones look similar.



The MoE version with 3b active parameters will run significantly faster (tokens/second) on the same hardware, by about an order of magnitude (i.e. ~4t/s vs ~40t/s)


> The MoE version with 3b active parameters

~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.


And vmem use?


~18.6 GiB, according to nvtop.

ollama 0.6.6 invoked with:

    # server
    OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

    # client
    ollama run --verbose qwen3:30b-a3b
~19.8 GiB with:

    /set parameter num_ctx 32768


Very nice, should run nicely on a 3090 as well.

TY for this.

update: wow, it's quite fast - 70-80t/s on LM Studio with a few other applications using GPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: