Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We don't think so, but you be the judge! I believe we quantize both Mixtral and Llama 2 in this way.


Is your confidence rooted in quantified testing, or just vibes? I'm sure you're right, just curious. (My reasoning: running inference at full fp16 is borderline wasteful. You can use q7 with almost no loss.)


I know some fancy benchmark says "almost no loss", but... subjectively, there is a clear quality loss. You can try for yourself, I can run Mixtral at 5.8bpw and there is an OBVIOUS difference between what I have seen from Groq and my local setup beside the sound barrier shattering speed of Groq. I didn't know Mixtral could output such nice code and I have used it A LOT locally.


Yes, but this gray area underperformance that lets them claim they are the cheapest and fastest appeals to people for whom qualitative (aka real) performance doesn’t matter.


What quantified testing would you like to see? We've had a lot of very good feedback from our users, particularly about Mixtral.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: