To this day 8x7b Mixtral remains the best model you can run on a single 48GB GPU...

ryao · on April 11, 2024

I am looking forward to the pricing of those dropping. It is a shame that high memory graphics cards are not mainstream.

rspoerri · on April 10, 2024

I hope i get it to run on my 96gb m2 in q4.

rspoerri · on April 10, 2024

It actually does, in case anybody wonders. But it seems as if it's not fine tuned to chat, or i'm doing it wrong at the moment. Getting a lot of duplicates and non useful answers.

cyanydeez · on April 10, 2024

They might've tweaked the prompt tokens.

noman-land · on April 10, 2024

My first thought was how much RAM? Will it work on 64GB M1?

jwitthuhn · on April 10, 2024

It is ~260GB with presumably fp16 weights. Should fit into 64GB at 3-bit quantization (~49GB).

Edit: To add to this, I've had good luck getting solid output out of mixtral 8x7b at 3-bit, so that isn't small enough to completely kill the model's quality.

wkat4242 · on April 11, 2024

I wonder, can you quantize it yourself with some tool?

pja · on April 11, 2024

llama.cpp can quantize a model for you:

https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#qu...

wkat4242 · on April 11, 2024

Thanks!!

ein0p · on April 10, 2024

Nope. Just the weights would take 88GB at 4 bit. 128GB MBP ought to be able to run it. If I were to guess, a version for Apple MLX should be available within a few days, for those of us fortunate enough to own such a thing.

Art9681 · on April 11, 2024

It’s already available. I had it running yesterday morning in an M3 MAX 128GB. I get about 6tps.

https://www.reddit.com/r/LocalLLaMA/s/MSsrqWHYga