I have a Radeon 7900 XTX 24GB and have been using the deepseek-r1:14b for a couple days. It achieves about 45 tokens/s. Only after reading this article did I realize that the 32B model would also fit entirely (23GB used). And since Ollama [0] was already installed, it as as easy as running: ollama run deepseek-r1:32b
The 32B model achieves about 25 tokens/s, which is faster than I can read. However, the "thinking" time is mostly a lower quality overhead taking ~1-4 minutes before the Solution/Answer
You can view the model performance within ollama using the command: /set verbose
I wrote a similar post about a week ago, but for an [unsupported] Radeon RX 5500 with 4Gi RAM with ollama and fedora 41. Can only run llama:3.2 or deepseek-r1:1.5b, but they're pretty usable if you're ok with a small model and it's for personal use.
I didn't go into detail about how to setup openweb-ui, but there is documentation for the on the project's site.
A coprocessor available by mmap/ioctl over some special device files, slightly different from existing XDNA support due to different management interface (the actual platform has been sold for some time as part of high-end FPGAs, but "RyzenAI" has different integration interface on silicon)
Does that entire model fit in gpu memory? How's it run?
I tried running a model larger than ram size and it loads some layers into the gpu but offloads to the cpu also. It's faster than cpu alone for me, but not by a lot.
Nice, last time I tried out ROCm on Arch a few years ago it was a nightmare. Glad to see it's just one package install away these days, assuming you didn't do any setup beforehand.
The 32B model achieves about 25 tokens/s, which is faster than I can read. However, the "thinking" time is mostly a lower quality overhead taking ~1-4 minutes before the Solution/Answer
You can view the model performance within ollama using the command: /set verbose
[0] https://github.com/ollama/ollama