The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.
In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.
This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.
In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.
I agree the OP’s description is wrong. That said, I think his conclusions are right, in that a quant of this that fits in 512GB of RAM is going to run about 8x faster than a quant of a dense model that fits in the same RAM, esp. on Macs as they are heavily throughput bound.
For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.
Yes loaded from RAM and loaded to RAM are the big distinction here.
It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.
> while achieving comparable results to the new DeepSeek v3 on reasoning and coding
If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.
The upside being that it can be used on codebases without having to share any code with a LLM provider.
Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.
I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).
To clarify, you're still gonna want enough RAM for the entire model plus context. Scout being 109B params means 64GB at q4, but then your context and other applications will have about 9GB left to work with.
That would be great. I’ve been hacking at ROCm and using Ryzen iGPUs for industrial scenarios, and the HX chipsets look like a massive improvement over what you’d get from folk like AsRock Industrial.
I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.
We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.
I love the look of it and if I were in the market right now it would be high on the list, but I do understand the confusion here - is it just a cool product you wanted to make or does it somehow link to what I assumed your mission was - to reduce e-waste?
A big part of our mission is accessibility and consumer empowerment. We were able to build a smaller/simpler PC for gamers new to it that still leverages PC standards, and the processor we used also makes local interference of large models more accessible to people who want to tinker with them.
Considering the framework desktop or something like it for a combo homelab / home assistant / HTPC. The new gen of AMD APUs looks to be the sweet spot for a lot of really interesting products.
And given that some people are afraid of malicious software in some brands of mini-PCs on the market, to have some more trusted product around will also be an asset.
It’s an x86 PC with unified RAM based on AMD’s new AI cpus. Pretty unique offering. Similar to Mac studio but you can run Linux or Windows on it, and it’s cheaper too.
Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?
This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.
"That’s dynamically decided during training and not set before, right?"
^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)
It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.
I read somewhere that ryzen AI 370 chip can run gemma 3 14b at 7 tokens/second, so I would expect the performance to be somewhere in that range for llama 4 scout with 17b active
Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.
I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it.
Apple should've invested more in bandwidth, but it's Apple and has lost its visionary. Imagine having 512GB on M3 Ultra and not being able to load even a 70B model on it at decent context window.
At 4 bit quant (requires 64GB) the price of Mac (4.2K) is almost exactly the same as 2x5090 (provided we will see them in stock). But 2x5090 have 6x memory bandwidth and probably close to 50x matmul compute at int4.
Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)
Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.
Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.
> entire point...smaller download could not justify...
Q4_K_M has layers and layers of consensus and polling and surveying and A/B testing and benchmarking to show there's ~0 quality degradation. Built over a couple years.
Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.
That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.
Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so naturally they have reduced memory requirements when running quantized.
As for CPUs, Intel can only go down to FP16, so you’ll be doing some “unpacking”. But hopefully that is “on the fly” and not when you load the model into memory?
No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.
Unless I'm missing something, I don't really think it looks that attractive. They're comparing it to Mistral Small 24B and Gemma 3 27B and post numbers showing that is a little better than those models. But at 4x the memory footprint, is it worth it? (Personally, I was hoping to see Meta's version of a 24-32B dense model since that size is clearly very capable, or something like an updated version of Mixtral 8x7B.)
Yes, that's what I tried to express. Large prompts will probably be slow. I tried a 120k prompt once and it took 10min to process. But you still get a ton of world knowledge and fast response times, and smaller prompts will process fast.
In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.