Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

10 tokens / second is what you get running llama-30b entirely on the GPU. A 65b model will be slower than that since there's more compute involved.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: