Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The (smaller) Scout model is really attractive for Apple Silicon. It is 109B big but split up into 16 experts. This means that the actual processing happens in 17B. Which means responses will be as fast as current 17B models. I just asked a local 7B model (qwen 2.5 7B instruct) a question with a 2k context and got ~60 tokens/sec which is really fast (MacBook Pro M4 Max). So this could hit 30 token/sec. Time to first token (the processing time before it starts responding) will probably still be slow because (I think) all experts have to be used for that.

In addition, the model has a 10M token context window, which is huge. Not sure how well it can keep track of the context at such sizes, but just not being restricted to ~32k is already great, 256k even better.



> the actual processing happens in 17B

This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.

In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.


I agree the OP’s description is wrong. That said, I think his conclusions are right, in that a quant of this that fits in 512GB of RAM is going to run about 8x faster than a quant of a dense model that fits in the same RAM, esp. on Macs as they are heavily throughput bound.


For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.


Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.


It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.


Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?


That's why I said it's not too expensive.


Apologies, I misread your comment.


To add, they say about the 400B "Maverick" model:

> while achieving comparable results to the new DeepSeek v3 on reasoning and coding

If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.

The upside being that it can be used on codebases without having to share any code with a LLM provider.


Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.

[^1] https://news.ycombinator.com/item?id=43595888


I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).


Hmmm, I might be rounding off wrong? Or reading it wrong?

IIUC the data we have:

2K tokens / 12 seconds = 166 tokens/s prefill

120K tokens / (10 minutes == 600 seconds) = 200 token/s prefill


> The more context the slower

It seems the other way around?

120k : 2k = 600s : 10s


To clarify, you're still gonna want enough RAM for the entire model plus context. Scout being 109B params means 64GB at q4, but then your context and other applications will have about 9GB left to work with.


109B at Q6 is also nice for Framework Desktop 128GB.


Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.


Awesome, where can we find out the results?


We’ll likely post on our social accounts to start with, but eventually we plan to write more blog posts about using Framework Desktop for inference.


That would be great. I’ve been hacking at ROCm and using Ryzen iGPUs for industrial scenarios, and the HX chipsets look like a massive improvement over what you’d get from folk like AsRock Industrial.


Can’t wait.


Is the AMD GPU stack reliable for running models like llama these days?


Running yes, training is questionable


I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.


We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.


I love the look of it and if I were in the market right now it would be high on the list, but I do understand the confusion here - is it just a cool product you wanted to make or does it somehow link to what I assumed your mission was - to reduce e-waste?


A big part of our mission is accessibility and consumer empowerment. We were able to build a smaller/simpler PC for gamers new to it that still leverages PC standards, and the processor we used also makes local interference of large models more accessible to people who want to tinker with them.


Considering the framework desktop or something like it for a combo homelab / home assistant / HTPC. The new gen of AMD APUs looks to be the sweet spot for a lot of really interesting products.

Love what you guys are doing!!


And given that some people are afraid of malicious software in some brands of mini-PCs on the market, to have some more trusted product around will also be an asset.


Lenovo backdoors as preinstalled software, including their own TLS certificate authorities.

Name whom you're referring to every time!


Is that still a thing?


It’s an x86 PC with unified RAM based on AMD’s new AI cpus. Pretty unique offering. Similar to Mac studio but you can run Linux or Windows on it, and it’s cheaper too.


It's a lot slower than a Mac Studio. Significantly slower CPU, GPU, memory bandwidth.


interesting to know, thanks. any link to some concrete benchmarks to share?


Yes. Geekbench 6 for CPU. Notebookcheck for GPU. Youtube/X for LLM inference.


Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?


This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.

Queries are then also dynamically routed.


"That’s dynamically decided during training and not set before, right?"

^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)


It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.


Looks like 109B would fit in a 64GiB machine's RAM at 4-bit quantization. Looking forward to trying this.


I read somewhere that ryzen AI 370 chip can run gemma 3 14b at 7 tokens/second, so I would expect the performance to be somewhere in that range for llama 4 scout with 17b active


At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.


Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.


I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it.

Apple should've invested more in bandwidth, but it's Apple and has lost its visionary. Imagine having 512GB on M3 Ultra and not being able to load even a 70B model on it at decent context window.


Prompt preprocessing is heavily compute-bound, so relying significantly on processing capabilities. Bandwidth mostly affects token generation speed.


At 17B active params MoE should be much faster than monolithic 70B, right?


Imagine


At 4 bit quant (requires 64GB) the price of Mac (4.2K) is almost exactly the same as 2x5090 (provided we will see them in stock). But 2x5090 have 6x memory bandwidth and probably close to 50x matmul compute at int4.


2.8k-3.6k for a 64gb-128gb mac studio (m3 max).


If you go a gen or two back, you can get 3x3090 for the same price.


You can also buy cheaper second hand apple silicon macs with plenty of RAM. I only buy second hand m1 macs for what is worth.


Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)


Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.


Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.


Small point of order:

> entire point...smaller download could not justify...

Q4_K_M has layers and layers of consensus and polling and surveying and A/B testing and benchmarking to show there's ~0 quality degradation. Built over a couple years.


> Q4_K_M has ~0 quality degradation

Llama 3.3 already shows a degradation from Q5 to Q4.

As compression improves over the years, the effects of even Q5 quantization will begin to appear


Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.

That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.


Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so naturally they have reduced memory requirements when running quantized.

As for CPUs, Intel can only go down to FP16, so you’ll be doing some “unpacking”. But hopefully that is “on the fly” and not when you load the model into memory?


I just loaded two models of different quants into LM Studio:

qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory

qwen 2.5 coder 1.5b @ q8: 1.83 GB memory

I always assumed this to be the case (also because of the smaller download sizes) but never really thought about it.


No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.


Unless I'm missing something, I don't really think it looks that attractive. They're comparing it to Mistral Small 24B and Gemma 3 27B and post numbers showing that is a little better than those models. But at 4x the memory footprint, is it worth it? (Personally, I was hoping to see Meta's version of a 24-32B dense model since that size is clearly very capable, or something like an updated version of Mixtral 8x7B.)


Won’t prompt processing need the full model though, and be quite slow on a Mac?


Yes, that's what I tried to express. Large prompts will probably be slow. I tried a 120k prompt once and it took 10min to process. But you still get a ton of world knowledge and fast response times, and smaller prompts will process fast.


Not as fast as other 17B models if it has to attend to 10M context window.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: