Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

And btw, -ngl is automatically set to max value now, you don't need to -ngl 99 anymore!

Edit: sorry this is only true on Metal. For CUDA or other GPU backends, you still need to manually specify -ngl



OH WHAT! So just -ngl? Oh also do you know if it's possible to auto do 1 GPU then the next (ie sequential) - I have to manually set --device CUDA0 for smallish models, and probs distributing it amongst say all GPUs causes communication overhead!


Ah no I mean we can omit the whole "-ngl N" argument for now, as it is internally set to -1 by default in CPP code (instead of being 0 traditionally), and -1 meaning offload everything to GPU

I have no idea how to specify custom layer specs with multi GPU, but that is interesting!


WAIT so GPU offloading is on by DEFAULT? Oh my fantastic! For now I have to "guess" via a Python script - ie I sum sum up all the .gguf split files in filesize, then detect CUDA memory usage, and specify approximately how many GPUs ie --device CUDA0,CUDA1 etc


Ahhh no sorry I forgot that the actual code controlling this is inside llama-model.cpp ; sorry for the misinfo, the -ngl only set to max by default if you're using Metal backend

(See the code in side llama_model_default_params())


Oh no worries! I re-edited my comment to account for it :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: