> They don't, though. If you try to allocate too much VRAM it will either hard fail or everything suddenly runs like garbage due to the driver constantly swapping it / using shared memory.
What I don't understand is why it can't just check your VRAM and allocate by default. The allocation is not that dynamic AFAIK - when I run models it all happens basically upfront when the model loads. ollama even prints out how much VRAM it's allocating for model + context for each layer. But I still have to tune the layers manually, and any time I change my context size I have to retune.
Some GPUs has quirks that VRAM access slows down near the end or that GPU just crashes and disables display output if actually used. I think it's sort of sensible that they don't use GPU at all by default.
Wouldn't the sensible default be to use 80% of available VRAM, or total VRAM minus 2GB, or something along those lines. Something that's a tad conservative but works for 99% of cases, with tuning options for those who want to fly closer to the sun.
2GB is a huge amount - you'd be dropping a dozen layers. Saving a few MB should be sufficient, and a layer is generally going to be orders of megabytes, so unless your model fits perfectly into VRAM (using 100%) you're already going to be leaving at least a few MB / 10s of MBs/ 100s of MBs free.
Your window manager will already have reserved its vRAM upfront so it isn't a big deal to use ~all of the rest.
I think in the vast majority of cases the GPU being the default makes sense, and for the incredibly niche cases where it isn't there is already a tunable.
What I don't understand is why it can't just check your VRAM and allocate by default. The allocation is not that dynamic AFAIK - when I run models it all happens basically upfront when the model loads. ollama even prints out how much VRAM it's allocating for model + context for each layer. But I still have to tune the layers manually, and any time I change my context size I have to retune.