Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That is my reading too, if you consider latency as the utmost inference metric, then you need all models in memory all the time.

What is you guys 70B configuration, do you guys try TP=8 for the 70B model for a fair comparison?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: