Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

To this day 8x7b Mixtral remains the best model you can run on a single 48GB GPU. This has the potential to become the best model you can run on two such GPUs, or on an MBP with maxed out RAM, when 4-bit quantized.



I am looking forward to the pricing of those dropping. It is a shame that high memory graphics cards are not mainstream.


I hope i get it to run on my 96gb m2 in q4.


It actually does, in case anybody wonders. But it seems as if it's not fine tuned to chat, or i'm doing it wrong at the moment. Getting a lot of duplicates and non useful answers.


They might've tweaked the prompt tokens.


My first thought was how much RAM? Will it work on 64GB M1?


It is ~260GB with presumably fp16 weights. Should fit into 64GB at 3-bit quantization (~49GB).

Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.


I wonder, can you quantize it yourself with some tool?



Thanks!!


Nope. Just the weights would take 88GB at 4 bit. 128GB MBP ought to be able to run it. If I were to guess, a version for Apple MLX should be available within a few days, for those of us fortunate enough to own such a thing.


It’s already available. I had it running yesterday morning in an M3 MAX 128GB. I get about 6tps.

https://www.reddit.com/r/LocalLLaMA/s/MSsrqWHYga




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: