Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Usually you want to split each layer to run with tensor parallelism, which works optimally if you can assign each kv head to a specific GPU. All currently popular models have a power of 2 number of kv heads.


interesting, thank you for the pointers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: