Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a 50B model. (Mixtral 8x7b)


Oh, sorry, I assumed the 8 was for quantization. 8x7b is a new syntax for me.

Still, the NVIDIA chart shows Llama v2 70B at 750 tok/s, no?


I guess that's total throughput, rather than per user? You can increase total throughput by scaling horizontally. You can't increase throughput per user that way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: