The (smaller) Scout model is *really* attractive for Apple Silicon. It is 109B b...

refibrillator · 2025-04-05T19:08:21 1743880101

> the actual processing happens in 17B

This is a common misconception of how MoE models work. To be clear, 17B parameters are activated for each token generated.

In practice you will almost certainly be pulling the full 109B parameters though the CPU/GPU cache hierarchy to generate non-trivial output, or at least a significant fraction of that.

vessenes · 2025-04-05T19:57:06 1743883026

I agree the OP’s description is wrong. That said, I think his conclusions are right, in that a quant of this that fits in 512GB of RAM is going to run about 8x faster than a quant of a dense model that fits in the same RAM, esp. on Macs as they are heavily throughput bound.

p12tic · 2025-04-05T19:18:59 1743880739

For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.

TOMDM · 2025-04-05T19:34:10 1743881650

Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.

mlyle · 2025-04-05T21:39:13 1743889153

It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.

utopcell · 2025-04-06T04:06:59 1743912419

Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?

mlyle · 2025-04-06T04:10:28 1743912628

That's why I said it's not too expensive.

utopcell · 2025-04-06T14:31:41 1743949901

Apologies, I misread your comment.

terhechte · 2025-04-05T19:02:31 1743879751

To add, they say about the 400B "Maverick" model:

> while achieving comparable results to the new DeepSeek v3 on reasoning and coding

If that's true, it will certainly be interesting for some to load up this model on a private M3 Studio 512GB. Response time will be fast enough for interaction in Roo Code or Cline. Prompt processing is a bit slower but could be manageable depending on how much code context is given to the model.

The upside being that it can be used on codebases without having to share any code with a LLM provider.

anoncareer0212 · 2025-04-05T19:08:19 1743880099

Small point of order: bit slower might not set expectations accurately. You noted in a previous post in the same thread[^1] that we'd expect about a 1 minute per 10K tokens(!) prompt processing time with the smaller model. I agree, and contribute to llama.cpp. If anything, that is quite generous.

[^1] https://news.ycombinator.com/item?id=43595888

terhechte · 2025-04-05T19:12:25 1743880345

I don't think the time grows linearly. The more context the slower (at least in my experience because the system has to throttle). I just tried 2k tokens in the same model that I used for the 120k test some weeks ago and processing took 12 sec to first token (qwen 2.5 32b q8).

anoncareer0212 · 2025-04-05T20:12:38 1743883958

Hmmm, I might be rounding off wrong? Or reading it wrong?

IIUC the data we have:

2K tokens / 12 seconds = 166 tokens/s prefill

120K tokens / (10 minutes == 600 seconds) = 200 token/s prefill

kgwgk · 2025-04-05T20:01:31 1743883291

> The more context the slower

It seems the other way around?

120k : 2k = 600s : 10s

kristianp · 2025-04-05T22:04:27 1743890667

To clarify, you're still gonna want enough RAM for the entire model plus context. Scout being 109B params means 64GB at q4, but then your context and other applications will have about 9GB left to work with.

tuukkah · 2025-04-05T19:11:18 1743880278

109B at Q6 is also nice for Framework Desktop 128GB.

nrp · 2025-04-05T19:18:07 1743880687

Yes, this announcement was a nice surprise for us. We’re going to test out exactly that setup.

rubymamis · 2025-04-05T22:36:25 1743892585

Awesome, where can we find out the results?

nrp · 2025-04-05T22:45:02 1743893102

We’ll likely post on our social accounts to start with, but eventually we plan to write more blog posts about using Framework Desktop for inference.

rcarmo · 2025-04-06T10:28:29 1743935309

That would be great. I’ve been hacking at ROCm and using Ryzen iGPUs for industrial scenarios, and the HX chipsets look like a massive improvement over what you’d get from folk like AsRock Industrial.

rcarmo · 2025-04-06T10:20:27 1743934827

Can’t wait.

theptip · 2025-04-05T20:44:31 1743885871

Is the AMD GPU stack reliable for running models like llama these days?

rubatuga · 2025-04-05T22:02:24 1743890544

Running yes, training is questionable

echelon · 2025-04-05T19:14:01 1743880441

I don't understand Framework's desktop offerings. For laptops their open approach makes sense, but desktops are already about as hackable and DIY as they come.

nrp · 2025-04-05T19:17:25 1743880645

We took the Ryzen AI Max, which is nominally a high-end laptop processor, and built it into a standard PC form factor (Mini-ITX). It’s a more open/extensible mini PC using mobile technology.

kybernetikos · 2025-04-05T20:10:46 1743883846

I love the look of it and if I were in the market right now it would be high on the list, but I do understand the confusion here - is it just a cool product you wanted to make or does it somehow link to what I assumed your mission was - to reduce e-waste?

nrp · 2025-04-05T20:22:37 1743884557

A big part of our mission is accessibility and consumer empowerment. We were able to build a smaller/simpler PC for gamers new to it that still leverages PC standards, and the processor we used also makes local interference of large models more accessible to people who want to tinker with them.

bavell · 2025-04-06T01:57:34 1743904654

Considering the framework desktop or something like it for a combo homelab / home assistant / HTPC. The new gen of AMD APUs looks to be the sweet spot for a lot of really interesting products.

Love what you guys are doing!!

mdp2021 · 2025-04-05T19:46:00 1743882360

And given that some people are afraid of malicious software in some brands of mini-PCs on the market, to have some more trusted product around will also be an asset.

randunel · 2025-04-05T19:59:25 1743883165

Lenovo backdoors as preinstalled software, including their own TLS certificate authorities.

Name whom you're referring to every time!

kristianp · 2025-04-05T22:05:25 1743890725

Is that still a thing?

elorant · 2025-04-05T22:44:23 1743893063

It’s an x86 PC with unified RAM based on AMD’s new AI cpus. Pretty unique offering. Similar to Mac studio but you can run Linux or Windows on it, and it’s cheaper too.

aurareturn · 2025-04-06T01:08:48 1743901728

It's a lot slower than a Mac Studio. Significantly slower CPU, GPU, memory bandwidth.

tw1984 · 2025-04-06T04:42:38 1743914558

interesting to know, thanks. any link to some concrete benchmarks to share?

aurareturn · 2025-04-06T05:27:45 1743917265

Yes. Geekbench 6 for CPU. Notebookcheck for GPU. Youtube/X for LLM inference.

echoangle · 2025-04-05T18:54:45 1743879285

Is it public (or even known by the developers) how the experts are split up? Is it by topic, so physics questions go to one and biology goes to another one? Or just by language, so every English question is handled by one expert? That’s dynamically decided during training and not set before, right?

ianbutler · 2025-04-05T19:08:00 1743880080

This is a common misunderstanding. Experts are learned via gating networks during training that routes dynamically per parameter. You might have an expert on the word "apple" in one layer for a slightly lossy example.

Queries are then also dynamically routed.

refulgentis · 2025-04-05T18:58:13 1743879493

"That’s dynamically decided during training and not set before, right?"

^ right. I can't recall off the top of my head, but there was a recent paper that showed if you tried dictating this sort of thing the perf fell off a cliff (I presume there's some layer of base knowledge $X that each expert needs)

sshh12 · 2025-04-05T19:06:59 1743880019

It can be either but typically it's "learned" without a defined mapping (which guessing is the case here). Although some experts may end up heavily correlating with certain domains.

api · 2025-04-05T19:57:34 1743883054

Looks like 109B would fit in a 64GiB machine's RAM at 4-bit quantization. Looking forward to trying this.

tarruda · 2025-04-05T20:06:25 1743883585

I read somewhere that ryzen AI 370 chip can run gemma 3 14b at 7 tokens/second, so I would expect the performance to be somewhere in that range for llama 4 scout with 17b active

scosman · 2025-04-05T18:51:15 1743879075

At 109b params you’ll need a ton of memory. We’ll have to wait for evals of the quants to know how much.

terhechte · 2025-04-05T18:56:43 1743879403

Sure but the upside of Apple Silicon is that larger memory sizes are comparatively cheap (compared to buying the equivalent amount of 5090 or 4090). Also you can download quantizations.

behnamoh · 2025-04-05T19:29:56 1743881396

I have Apple Silicon and it's the worst when it comes to prompt processing time. So unless you want to have small contexts, it's not fast enough to let you do any real work with it.

Apple should've invested more in bandwidth, but it's Apple and has lost its visionary. Imagine having 512GB on M3 Ultra and not being able to load even a 70B model on it at decent context window.

1ucky · 2025-04-05T20:00:48 1743883248

Prompt preprocessing is heavily compute-bound, so relying significantly on processing capabilities. Bandwidth mostly affects token generation speed.

mirekrusin · 2025-04-05T19:59:26 1743883166

At 17B active params MoE should be much faster than monolithic 70B, right?

nathancahill · 2025-04-05T19:51:56 1743882716

Imagine

lostmsu · 2025-04-06T07:20:46 1743924046

At 4 bit quant (requires 64GB) the price of Mac (4.2K) is almost exactly the same as 2x5090 (provided we will see them in stock). But 2x5090 have 6x memory bandwidth and probably close to 50x matmul compute at int4.

freehorse · 2025-04-06T08:16:19 1743927379

2.8k-3.6k for a 64gb-128gb mac studio (m3 max).

lostmsu · 2025-04-07T05:07:17 1744002437

If you go a gen or two back, you can get 3x3090 for the same price.

freehorse · 2025-04-07T09:53:01 1744019581

You can also buy cheaper second hand apple silicon macs with plenty of RAM. I only buy second hand m1 macs for what is worth.

refulgentis · 2025-04-05T19:02:58 1743879778

Maybe I'm missing something but I don't think I've ever seen quants lower memory reqs. I assumed that was because they still have to be unpacked for inference. (please do correct me if I'm wrong, I contribute to llama.cpp and am attempting to land a client on everything from Android CPU to Mac GPU)

root_axis · 2025-04-05T19:11:09 1743880269

Quantizing definitely lowers memory requirements, it's a pretty direct effect because you're straight up using less bits per parameter across the board - thus the representation of the weights in memory is smaller, at the cost of precision.

jsnell · 2025-04-05T19:12:10 1743880330

Needing less memory for inference is the entire point of quantization. Saving the disk space or having a smaller download could not justify any level of quality degradation.

anoncareer0212 · 2025-04-05T20:10:49 1743883849

Small point of order:

> entire point...smaller download could not justify...

Q4_K_M has layers and layers of consensus and polling and surveying and A/B testing and benchmarking to show there's ~0 quality degradation. Built over a couple years.

acchow · 2025-04-06T20:11:13 1743970273

> Q4_K_M has ~0 quality degradation

Llama 3.3 already shows a degradation from Q5 to Q4.

As compression improves over the years, the effects of even Q5 quantization will begin to appear

vlovich123 · 2025-04-05T19:14:16 1743880456

Quantization by definition lower memory requirements - instead of using f16 for weights, you are using q8, q6, q4, or q2 which means the weights are smaller by 2x, ~2.7x, 4x or 8x respectively.

That doesn’t necessarily translate to the full memory reduction because of interim compute tensors and KV cache, but those can also be quantized.

acchow · 2025-04-05T19:47:48 1743882468

Nvidia GPUs can natively operate in FP8, FP6, FP4, etc so naturally they have reduced memory requirements when running quantized.

As for CPUs, Intel can only go down to FP16, so you’ll be doing some “unpacking”. But hopefully that is “on the fly” and not when you load the model into memory?

terhechte · 2025-04-05T19:08:42 1743880122

I just loaded two models of different quants into LM Studio:

qwen 2.5 coder 1.5b @ q4_k_m: 1.21 GB memory

qwen 2.5 coder 1.5b @ q8: 1.83 GB memory

I always assumed this to be the case (also because of the smaller download sizes) but never really thought about it.

michaelt · 2025-04-05T19:11:30 1743880290

No need to unpack for inference. As things like CUDA kernels are fully programmable, you can code them to work with 4 bit integers, no problems at all.

anon373839 · 2025-04-06T02:17:04 1743905824

Unless I'm missing something, I don't really think it looks that attractive. They're comparing it to Mistral Small 24B and Gemma 3 27B and post numbers showing that is a little better than those models. But at 4x the memory footprint, is it worth it? (Personally, I was hoping to see Meta's version of a 24-32B dense model since that size is clearly very capable, or something like an updated version of Mixtral 8x7B.)

manmal · 2025-04-05T18:54:10 1743879250

Won’t prompt processing need the full model though, and be quite slow on a Mac?

terhechte · 2025-04-05T18:58:26 1743879506

Yes, that's what I tried to express. Large prompts will probably be slow. I tried a 120k prompt once and it took 10min to process. But you still get a ton of world knowledge and fast response times, and smaller prompts will process fast.

tintor · 2025-04-06T02:30:02 1743906602

Not as fast as other 17B models if it has to attend to 10M context window.