And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 a...

danielhanchen · 2025-05-10T06:44:42 1746859482

OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!

ngxson · 2025-05-10T06:47:41 1746859661

Ah no I mean we can omit the whole "-ngl N" argument for now, as it is internally set to -1 by default in CPP code (instead of being 0 traditionally), and -1 meaning offload everything to GPU

I have no idea how to specify custom layer specs with multi GPU, but that is interesting!

danielhanchen · 2025-05-10T06:57:03 1746860223

WAIT so GPU offloading is on by DEFAULT? Oh my fantastic! For now I have to "guess" via a Python script - ie I sum sum up all the .gguf split files in filesize, then detect CUDA memory usage, and specify approximately how many GPUs ie --device CUDA0,CUDA1 etc

ngxson · 2025-05-10T07:07:50 1746860870

Ahhh no sorry I forgot that the actual code controlling this is inside llama-model.cpp ; sorry for the misinfo, the -ngl only set to max by default if you're using Metal backend

(See the code in side llama_model_default_params())

danielhanchen · 2025-05-10T07:24:01 1746861841

Oh no worries! I re-edited my comment to account for it :)