More

Patrick_Devine · 2025-12-03T00:38:47 1764722327

5 years is normal-ish depreciation time frame. I know they are gaming GPUs, but the RTX 3090 came out ~ 4.5 years before the RTX 5090. The 5090 has double the performance and 1/3 more memory. The 3090 is still a useful card even after 5 years.

cubefox · 2025-12-03T14:53:34 1764773614

RTX 3090 MSRP: 1500 USD

RTX 5090 MSRP: 2000 USD

Patrick_Devine · 2025-12-02T21:59:13 1764712753

The instruct models are available on Ollama (e.g. `ollama run ministral-3:8b`), however the reasoning models still are a wip. I was trying to get them to work last night and it works for single turn, but is still very flakey w/ multi-turn.

Patrick_Devine · 2025-11-08T17:50:31 1762624231

The default ones on Ollama are MXFP4 for the feed forward network and use BF16 for the attention weights. The default weights for llama.cpp quantize those tensors as q8_0 which is why llama.cpp can eek out a little bit more performance at the cost of worse output. If you are using this for coding, you definitely want better output.

You can use the command `ollama show -v gpt-oss:120b` to see the datatype of each tensor.

Patrick_Devine · 2025-08-15T17:05:39 1755277539

We uploaded gemma3:270m-it-q8_0 and gemma3:270m-it-fp16 late last night which have better results. The q4_0 is the QAT model, but we're still looking at it as there are some issues.

Patrick_Devine · 2025-08-05T19:54:50 1754423690

Ollama only uses llamacpp for running legacy models. gpt-oss runs entirely in the ollama engine.

You don't need to use Turbo mode; it's just there for people who don't have capable enough GPUs.

Patrick_Devine · 2025-05-16T06:40:03 1747377603

I worked on the text portion of gemma3 (as well as gemma2) for the Ollama engine, and worked directly with the Gemma team at Google on the implementation. I didn't base the implementation off of the llama.cpp implementation which was done in parallel. We did our implementation in golang, and llama.cpp did theirs in C++. There was no "copy-and-pasting" as you are implying, although I do think collaborating together on these new models would help us get them out the door faster. I am really appreciative of Georgi catching a few things we got wrong in our implementation.

Patrick_Devine · 2025-05-16T05:21:37 1747372897

Wait, what hosted APIs is Ollama wrapping?

Patrick_Devine · 2025-04-21T17:57:03 1745258223

The vision tower is 7GB, so I was wondering if you were loading it without vision?

Patrick_Devine · 2025-04-21T17:47:20 1745257640

Ollama has had vision support for Gemma3 since it came out. The implementation is not based on llama.cpp's version.

Patrick_Devine · 2025-03-12T09:00:01 1741770001

There are some fixes coming to uniformly speed up pulls. We've been testing that out but there are a lot of moving pieces with the new engine so it's not here quite yet.