More

mstaoru · 2026-02-28T22:59:05 1772319545

I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.

So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.

Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.

Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.

lm28469 · 2026-02-28T23:14:21 1772320461

> Wonder what am I doing wrong?

You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus

Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine

vlovich123 · 2026-02-28T23:26:43 1772321203

The hardware difference explains runtime performance differences, not task performance.

Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences

nl · 2026-03-01T03:01:32 1772334092

> Speculation is that the frontier models are all below 200B parameters

Some versions of some the models are around that size, which you might hit for example with the ChatGPT auto-router.

But the frontier models are all over 1T parameters. Source: watch interview with people who have left one of the big three labs and now work at the Chinese labs and are talking about how to train 1T+ models.

NamlchakKhandro · 2026-03-01T02:08:58 1772330938

> The hardware difference explains runtime performance differences, not task performance.

Yes it does.

MrDrMcCoy · 2026-03-01T05:11:29 1772341889

Care to elaborate?

BoredomIsFun · 2026-03-01T10:17:02 1772360222

Certainly not Opus. That beast feels very heavy - the coherence of longer form prose is usually a good marker, and it is able to spit 4000 words coherent short stories from a single shot.

827a · 2026-03-01T03:20:29 1772335229

He's running a 35B parameter model. Frontier models are well over a trillion parameters at this point. Parameters = smarts. There are 1T+ open source models (e.g. GLM5), and they're actually getting to the point of being comparable with the closed source models; but you cannot remotely run them on any hardware available to us.

Core speed/count and memory bandwidth determines your performance. Memory size determines your model size which determines your smarts. Broadly speaking.

regularfry · 2026-03-01T15:38:17 1772379497

The architecture is also important: there's a trade-off for MoE. There used to be a rough rule of thumb that a 35bxa3b model would be equivalent in smarts to an 11b dense model, give or take, but that's not been accurate for a while.

BoredomIsFun · 2026-03-01T10:18:16 1772360296

> There are 1T+ open source models (e.g. GLM5),

GLM-5 is ~750B model.

ses1984 · 2026-02-28T23:30:06 1772321406

Who would have thought ai labs with billions upon billions of r&d budget would have better models than a free alternative.

shlomo_z · 2026-03-01T06:22:08 1772346128

I'll add, AI Labs put a lot of resources into allowing the AI to search the web.. that makes a big difference

mstaoru · 2026-03-01T09:06:27 1772355987

I use search as well via openwebui + searxng.

delaminator · 2026-02-28T23:32:44 1772321564

Looks at the headline: Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

lm28469 · 2026-02-28T23:40:41 1772322041

Yes and Devstral 2 24b q4 is supposed to be 90% as good but it can't even reliably write to a file on my machine.

There are the benchmarks, the promises, and what everybody can try at home

8note · 2026-02-28T23:59:42 1772323182

maybe a harness problem?

SyneRyder · 2026-03-01T15:20:31 1772378431

Having tried the Mistral Vibe harness that was supposedly designed for Devstral, that thing is abysmal. I feel sorry for whatever they did to that model, it didn't deserve it.

The thing I most noticed was asking it for help with configuring local MCP servers in Mistral Vibe - something it supports, it literally shows how many MCP servers are connected on the startup screen - it then begins scanning my local machine for servers running "MineCraft Protocol".

I want Mistral to do well, and I use their Voxtral Transcribe 2, that one has been useful. I'd even like a well made Mistral Vibe (c'mon, "oui oui baguette" is a hilarious replacement for "thinking"). But Mistral are so far behind, and they don't seem to even know or accept that they are.

aspenmartin · 2026-02-28T23:08:48 1772320128

Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5

zozbot234 · 2026-02-28T23:15:20 1772320520

> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.

But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.

aspenmartin · 2026-02-28T23:50:38 1772322638

Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment

Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.

Lots of uses for small, medium, and larger models they all have important places!!

holoduke · 2026-03-01T06:51:25 1772347885

Your Gemini or Opus question got send to a Texas datacenter where it got queued and processed by a subunit of 80 h200 140gb 1000w cards running a many billion or trillion parameter model. It took less that 200ms to process a single request. Your Claude cliënt decided to spawn 30 sub agents and iterated in a total of 90 requests totalling about 45000ms. Now compare that to your 100b transistor cpu doing something similar. Yes that would be slow.

mstaoru · 2026-03-01T09:13:29 1772356409

Right, it was more of a rhetorical question :) With my point being - how are these local models really useful to me now? Is the Only Way ™ to sell my house and build a 8x5090 monster?.. How does that compare to $20/month Opus? (Privacy aside.)

The second order thought from this is... will we get a value-based price leveling soon? If the alternative to a hosted LLM is to build $10-20k+ machine with $500+ monthly energy bills, will hosted price asymptotically climb up to reflect this reality?

Something to think about.

regularfry · 2026-03-01T14:03:41 1772373821

Looked at from the other end of the telescope, the other factor is how fast low-end local models can gain capability. This 35b model is absolutely fine on a 4090 in a machine that was about £3000 when I bought it three years ago. Where will what you can run on a 4090, or a 5090, be in six months? That's the interesting question, but we're already well past the point where the uses to which you will be able to put a local model dramatically increase within the depreciation lifespan of the hardware.

etyhhgfff · 2026-03-01T10:13:30 1772360010

We would need a super high end AI accelerator with specialised cooling for less than 3k bucks to make it happen. Consumer gaming graphics card wont fit the bill. Problem is all TSMC capacity is already booked for years to come by the big players to build data center grade hardware with price tags and setup requirements out of consumer reach.

wolvoleo · 2026-03-01T01:10:27 1772327427

Well first of all you're running a long intense task on a thermally constrained machine. Your MacBook Pro is optimised for portability and battery life, not max performance under load. And apple's obsession with thinness overrules thermal performance for them. Short peaks will be ok but a 45 minute task will thoroughly saturate the cooling system.

Even on servers this can happen. At work we have a 2U sized server with two 250W class GPUs. And I found that by pinning the case fans at 100% I can get 30% more performance out of GPU tasks which translates to several days faster for our usecase. It does mean I can literally hear the fans screaming in the hallway outside the equipment room but ok lol. Who cares. But a laptop just can't compare.

Something with a desktop GPU or even better something with HBM3 would run much better. Local models get slow when you use a ton of context and the memory bandwidth of a MacBook Pro while better than a pc is still not amazing.

And yeah the heaviest tasks are not great on local models. I tend to run the low hanging fruit locally and the stuff where I really need the best in the cloud. I don't agree local models are on par, however I don't think they really need to be for a lot of tasks.

pamcake · 2026-03-01T01:41:20 1772329280

To your point, one can get a great performance boost by propping the laptop onto a roost-like stand in front of a large fan. Nothing like a cooling system actually built for sustained load but still.

meatmanek · 2026-03-01T03:54:16 1772337256

I've seen reports of qwen3.5-35b-a3b spending a ton of time reasoning if the context window is nearly empty-- supposedly it reasons less if you provide a long system prompt or some file contents, like if you use it in a coding agent.

I'm too GPU-poor to run it, but r/LocalLLaMa is full of people using it.

boutell · 2026-03-01T12:14:54 1772367294

Can confirm. I gave it a variant of the car wash question on a MacBook M4 with 32 GB of RAM. It produced output at a conversational speed, sure, but that started with 6 minutes of thinking output. 6 minutes.

On the plus side, it did figure out the question even without the first sentence that's intended as a bit of a giveaway.

regularfry · 2026-03-01T13:54:32 1772373272

There's definitely something wrong with the thinking mode on this one. I wouldn't be surprised if it gets fixed, either by qwen themselves or with a fine-tune.

adam_patarino · 2026-03-01T12:48:14 1772369294

The biggest gaps are not in hardware or model size. There is a lot of logical fallacy in the industry. Most people believe bigger is better. For model size, compute, tools, etc.

The reality in ML is that small models can perform better at a narrow problem set than large ones.

The key is the narrow problem set. Opus can write you a poem, create a shopping list, and analyze your massive code base.

We trained our model to only focus on coding with our specific agent harness, tools, and context engine. And it’s small enough to fit on an M2 16GB. It’s as good as sonnet 4.5 and way better than qwen3.5:35b-a3b

Our beta will be out soon / rig.ai

amritananda · 2026-03-02T00:44:42 1772412282

No benchmarks, no information about training methods/datasets, template placeholder vibe-coded website. Waste of time.

__mharrison__ · 2026-02-28T23:37:00 1772321820

Were you using mlx-lm? I've had good performance with that on Macs. (Sadly, the lead developer just left Apple.)

Admittedly, I haven't tried these models on my Mac, but I have on my DGX Spark, and they ran fine. I didn't see the slowdown you're mentioning.

mstaoru · 2026-03-01T09:09:11 1772356151

(I think) yes, via the latest openwebui + ollama.

CamperBob2 · 2026-02-28T23:56:39 1772322999

Try the 27B dense model. It will likely do much better than the 35b MoE with only 3B active experts.

Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.

regularfry · 2026-03-01T14:05:10 1772373910

Currently sat waiting for the unsloth fixed quants to drop, but I'm on the edge of my seat for this.

Balinares · 2026-03-01T21:56:51 1772402211

Wait, didn't they drop like two days ago?

regularfry · 2026-03-02T15:22:50 1772464970

The 35b did but not the 27b. Looks like the latter has been updated in the last half hour.

Balinares · 2026-03-02T22:22:16 1772490136

Neat! Thanks for correcting me there. I'll go and take a look.

zozbot234 · 2026-02-28T23:01:57 1772319717

Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.

stavros · 2026-02-28T23:09:49 1772320189

I can never see the point, though. Performance isn't anywhere near Opus, and even that gets confused following instructions or making tool calls in demanding scenarios. Open weights models are just light years behind.

I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.

vunderba · 2026-03-01T00:16:29 1772324189

I can't speak to using local models as agentic coding assistants, but I have a headless 128GB RAM machine serving llama.cpp with a number of local models that I use on a daily basis.

- Qwen3-VL picks up new images in a NAS, auto captions and adds the text descriptions as a hidden EXIF layer into the image, which is used for fast search and organization in conjunction with a Qdrant vector database.

- Gemma3:27b is used for personal translation work (mostly English and Chinese).

- Llama3.1 spins up for sentiment analysis on text.

stavros · 2026-03-01T00:34:47 1772325287

Ah yeah, self-contained tasks like these are ideal, true. I'm more using it for coding, or for running a personal assistant, or for doing research, where open weights models aren't as strong yet.

vunderba · 2026-03-01T00:58:23 1772326703

Understood. Research would make me especially leery; I’d be afraid of losing any potential gains as I'd feel compelled to always go and validate its claims (though I suppose you could mitigate it a little bit with search engine tooling like Kagi's MCP system).

andoando · 2026-02-28T23:18:45 1772320725

They're great for some product use cases where you dont need frontier models.

stavros · 2026-02-28T23:21:40 1772320900

Yeah, for sure, I just don't have many of those. For example, the only use I have for Haiku is for summarizing webpages, or Sonnet for coding something after Opus produces a very detailed plan.

Maybe I should try local models for home automation, Qwen must be great at that.

lm28469 · 2026-02-28T23:17:28 1772320648

They're like 6 months away on most benchmarks, people already claimed coding wad solved 6 months ago, so which is it? The current version is the baseline that solves everything but as soon as the new version is out it becomes utter trash and barely usable

zozbot234 · 2026-02-28T23:21:46 1772320906

That's very large models at full quantization though. Stuff that will crawl even on a decent homelab, despite being largely MoE based and even quantization-aware, hence reducing the amount and size of active parameters.

stavros · 2026-02-28T23:20:01 1772320801

That's just a straw man. Each frontier model version is better than the previous one, and I use it for harder and harder things, so I have very little use for a version that's six months behind. Maybe for simple scripts they're great, but for a personal assistant bot, even Opus 4.6 isn't as good as I'd like.

mstaoru · 2026-03-01T15:32:27 1772379147

So it's back to the original question, why spend $5-10k on the Studio, when it will still be 10x slower and half the intelligence vs. $20 Sonnet?.. What is the point (besides privacy) to use local models now for coding?

PS: I can understand that isolated "valuable" problems like sorting photo collection or feeding a cat via ESPHome can be solved with local models.

NorwegianDude · 2026-03-01T22:37:20 1772404640

At least for me, it's cheap. Even Claude Haiku 4.5 would cost over $60 each day for the same token amount, after accounting for electricity costs. I have the hardware for other reasons anyway, so why not use it, avoid privacy issues and save money.

Are the LLMs very useful? That is a whole other discussion...

zozbot234 · 2026-03-01T17:43:11 1772386991

You can't use a $20 Sonnet subscription for general agentic use cases, you have to pay for API use on a per-token basis. The $20 and $200 subscriptions are widely considered unsustainable as such. If anything, the real competition is third-party cheap inference providers.

wat10000 · 2026-03-01T00:04:35 1772323475

I have a laptop already, so that's what I'm going to use.

satvikpendem · 2026-02-28T23:38:57 1772321937

I can take a laptop on the train.

notreallya · 2026-02-28T23:04:33 1772319873

Sonnet 4.5 level isn't Opus 4.6 level, simple as

xtn · 2026-03-01T03:57:49 1772337469

I think knowledge of frontier research certainly scale with number of parameters. Also, US labs can pay more money to have researchers provide training data on these frontier research areas.

On the other hand, if indeed open source models and Macbooks can be as powerful as those SOTA models from Google, etc, then stock prices of many companies would already collapsed.

jonaustin · 2026-03-08T22:38:10 1773009490

Try with qwen 3.5 122b; it has more parameters so a larger corpus of knowledge to draw from than 35b.

gigatexal · 2026-03-01T03:53:54 1772337234

I have the exact same hardware. Was going to do the same thing with the 122B model … I’ll just keep paying Anthropic and he models are just that good. Trying out Gemini too. But won’t pay OpenAI as they’re going to be helping Pete Kegseth to develop autonomous killing machines.

muyuu · 2026-03-01T07:12:51 1772349171

Depending on the specificity of the research, having a model with fewer parameters will come with a higher penalty. If you want a model to perform better at something specific while staying smaller, generally it will take specific training to achieve that.

culi · 2026-02-28T23:08:31 1772320111

Well you can't run Gemini Pro or Opus 4.6 locally so are you comparing a locally run model to cloud platforms?

furyofantares · 2026-02-28T23:08:53 1772320133

Can you try asking Sonnet 4.5 the same question, since that is what this model is claimed to be on par with?

rienko · 2026-02-28T23:39:45 1772321985

use a larger model like Qwen3.5-122B-A10B quantized to 4/5/6 bits depending on how much context you desire, MLX versions if you want best tok/s on Mac HW.

if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.

andxor · 2026-02-28T23:55:49 1772322949

You're not doing anything wrong. The Chinese models are not as good as advertised. Surprise surprise!

mstaoru · 2026-02-28T08:34:10 1772267650

Every now and then I will google "books like Hyperion", read something, and conclude that it was nothing like Hyperion. Wonderful books, wonderful writer. A loss.

mstaoru · 2026-02-27T19:29:39 1772220579

Health insurance? Remote role taxed where? Would it even count for Portuguese PR?

Spent €10k on visa consultants?.. What did they do exactly?

I wouldn't also fully assume that Portugal is more stable than Turkey. EU is not what it used to be 10 years ago.

...

I would stay in Turkey.

Slaboli · 2026-02-27T21:07:02 1772226422

Thanks for the advice! The position is taxed in Portugal. Tt's a US company with a local base, (paying 1100 monthly lol) so everything is legally compliant regarding taxes and residency. I spent 10k EUR in total, including my stays in Lisbon, just to handle the paperwork (it took six months for them to process my residency card)

That is precious insight about stability. You're saying Portugal, while in a better position, isn't significantly more stable than Turkey in the long run. For EU, I agree that it probably won't improve much over the long term.

mstaoru · 2026-02-21T11:16:40 1771672600

I guess it says something about OAuth when you implement it "at scale" and still have multiple misconceptions (all very common though).

Most importantly, OAuth is an authorization framework, OIDC is an authentication extension built on top.

Refresh tokens are part of authorization, not authentication.

HTTP header is Authorization: Bearer..., not Authentication.

There's no such thing as "HMAC encryption", it's a message authentication code. RSA in OAuth is also typically used for signing, not encryption. Not much "encryption" encryption going on in OAuth overall TBH.

Nonce and client IDs are not "salts", but ok that's nitpicking :)

reactordev · 2026-02-21T13:53:03 1771681983

Baby steps my guy, baby steps. Yes, I don’t even mention OIDC, but I think the way I explained it was the middle schoolers version we all can understand (even if there are some minor mistakes in nomenclature).

The point I was trying to make at 2am is that it’s not scary or super advanced stuff and that you can get away with OAuth-like (as so many do). But yes, OAuth is authorization, OIDC is authentication. The refresh token is an authorization but it makes sense to people who have never done it to think of it as a “post-login marker”.

mstaoru · 2026-02-01T21:55:41 1769982941

We moved into a new flat with really bad lighting and I decided to buy those "AmazeFun" (or whatever generic named CN brand) "smart" LED ceiling lights. Bought one for each of four rooms.

Installed, tested them with the app, everything works, great!

Got out the remotes since pulling out the phone to use the app every time you want to turn on the light in the room is a bit much for me. Pressed Power, boom, the whole house is powered on. Dimmer, light temperature, everything syncs between all four lights. Power off turns them all off.

Wrote to "AmazeFun" support, turns out it's "normal behavior". Right.

paradox460 · 2026-02-02T00:46:31 1769993191

Fwiw, get bulbs that run something like wled. You can pair them with esp-now remotes, like the wiiz remote

https://www.athom.tech/blank-1/wled-15w-color-bulb

mstaoru · 2026-02-01T14:21:42 1769955702

Headscale is good. We're using to manage a two isolated networks of about 400 devices each. It just works. It's in China so official Tailscale DERPs do not work, but enabling built-in DERP was very easy.

mstaoru · 2025-11-28T09:18:31 1764321511

In a similar vein, but different concept - https://hazeover.com/ - dimming inactive windows.

mstaoru · 2025-11-21T09:08:35 1763716115

Social media (FB, Twitter, Instagram...) - never had any, never felt the need.

TikTok etc - I strongly believe these are brain cancer.

Crypto - never had any need or interest.

Smartwatches - never solved any need for me either. Same with tablets.

Apple ecosystem - I have a Macbook but all the other stuff IMHO is pretty bad or I don't need it.

Pokemon - no interest.

Home IoT - despite working for many years in (commercial) IoT, home IoT never clicked for me, it's all really clunky and useless, at least in my experience.

VR - we have Quest 3 but rarely play it, it's just not fun somehow after the initial novelty wears off, I much prefer PS5.

mstaoru · 2025-11-02T11:05:55 1762081555

I brought my WiFi 7-capable ASUS RT-BE96U to Germany (from China) and I proudly notice that my average download speed is up to ~105 Mbit from ~95 Mbit with the stock Vodafone router.

"Silicon Valley of Europe", my a*s.

mstaoru · 2025-10-06T11:27:19 1759750039

> Big Tech is spending $364B on infrastructure instead of fixing the code

You mean CrowdStrike still crashes? Spotlight still writes 26TB every night? (Which only happened in beta, AFAIK...) Of course, they are fixing the code. Conflating infrastructure spending is not helpful.

The bitter truth is that complex software will always contain some bugs, it's close to impossible to ship a completely mathematically perfect software. It's how we react to bugs and the report/fix/update pipeline that truly matters.