> They don't, though. If you try to allocate too much VRAM it will either hard f...

jmorgan · on Nov 30, 2023

This is a great point. Context size has a large impact on memory requirements and Ollama should take this into account (something to work on :)

insanitybit · on Nov 30, 2023

Thanks for the work you've done already :D

numpad0 · on Nov 30, 2023

Some GPUs has quirks that VRAM access slows down near the end or that GPU just crashes and disables display output if actually used. I think it's sort of sensible that they don't use GPU at all by default.

wongarsu · on Nov 30, 2023

Wouldn't the sensible default be to use 80% of available VRAM, or total VRAM minus 2GB, or something along those lines. Something that's a tad conservative but works for 99% of cases, with tuning options for those who want to fly closer to the sun.

insanitybit · on Nov 30, 2023

2GB is a huge amount - you'd be dropping a dozen layers. Saving a few MB should be sufficient, and a layer is generally going to be orders of megabytes, so unless your model fits perfectly into VRAM (using 100%) you're already going to be leaving at least a few MB / 10s of MBs/ 100s of MBs free.

Your window manager will already have reserved its vRAM upfront so it isn't a big deal to use ~all of the rest.

insanitybit · on Nov 30, 2023

I think in the vast majority of cases the GPU being the default makes sense, and for the incredibly niche cases where it isn't there is already a tunable.