The first number I look at these days is the file size via Ollama, which for thi...

nico · 2025-05-21T21:32:42 1747863162

Any agentic dev software you could recommend that runs well with local models?

I’ve been using Cursor and I’m kind of disappointed. I get better results just going back and forth between the editor and ChatGPT

I tried localforge and aider, but they are kinda slow with local models

ynniv · 2025-05-22T00:28:37 1747873717

https://github.com/block/goose

zackify · 2025-05-22T03:26:30 1747884390

I used devstral today with cline and open hands. Worked great in both.

About 1 minute initial prompt processing time on an m4 max

Using LM studio because the ollama api breaks if you set the context to 128k.

elAhmo · 2025-05-22T10:08:46 1747908526

How is it great that it takes 1 minute for initial prompt processing?

zackify · 2025-05-28T01:45:18 1748396718

Haha great as in surprisingly good at some simple things that nothing has been able to do locally for me.

The 1 minute first token sucks and has me dreaming for the day of 3-4x the bandwidth

cheema33 · 2025-05-24T12:12:14 1748088734

That time is just for the very first prompt. It is basically the startup time for the model. Once it is loaded, it is much much faster in responding to your queries. Depending on your hardware of course.

nico · 2025-05-22T13:47:04 1747921624

Have you tried using mlx or Simon Wilson’s llm?

https://llm.datasette.io/en/stable/

https://simonwillison.net/tags/llm/

zackify · 2025-05-28T01:44:37 1748396677

On lm studio I was using mlx

asimovDev · 2025-05-22T06:28:14 1747895294

you can use ollama in VS Code's copilot. I haven't personally tried it but I am interested in how it would perform with devstral

jabroni_salad · 2025-05-21T21:50:00 1747864200

Do you have any other interface for the model? what kind of tokens/sec are you getting?

Try hooking aider up to gemini and see how the speed is. I have noticed that people in the localllama scene do not like to talk about their TPS.

nico · 2025-05-21T22:20:30 1747866030

The models feel pretty snappy when interacting with them directly via ollama, not sure about the TPS

However I've also ran into 2 things: 1) most models don't support tools, sometimes it's hard to find a version of the model that correctly uses tools, 2) even with good TPS, since the agents are usually doing chain-of-thought and running multiple chained prompts, the experience feels slow - this is even true with Cursor using their models/apis

segmondy · 2025-05-24T13:42:32 1748094152

People have all sorts of hardware, TPS is meaningless without the full spec of the hardware, and GPU is not the only thing, CPU, ram speed, memory channel, PCIe speed, inference software, partial CPU offload? RPC? even OS, all of these things add up. So if someone tells you TPS for a given model, it's meaningless unless you understand their entire setup.

ivanvanderbyl · 2025-05-23T10:35:09 1747996509

I’ve been playing around with Zed, supports local and cloud models, really fast, nice UX. It does lack some of the deeper features of VSCode/Cursor but very capable.

mrshu · 2025-05-22T17:00:03 1747933203

ra-aid works pretty well with Ollama (haven't tried it with Devstral yet though)

https://docs.ra-aid.ai/configuration/ollama/

lis · 2025-05-21T18:48:17 1747853297

Yes, I agree. I've just ran the model locally and it's making a good impression. I've tested it with some ruby/rspec gotchas, which it handled nicely.

I'll give it a try with aider to test the large context as well.

ericb · 2025-05-21T19:56:48 1747857408

In ollama, how do you set up the larger context, and figure out what settings to use? I've yet to find a good guide. I'm also not quite sure how I should figure out what those settings should be for each model.

There's context length, but then, how does that relate to input length and output length? Should I just make the numbers match? 32k is 32k? Any pointers?

lis · 2025-05-21T20:33:11 1747859591

For aider and ollama, see: https://aider.chat/docs/llms/ollama.html

Just for ollama, see: https://github.com/ollama/ollama/blob/main/docs/faq.md#how-c...

I’m using llama.cpp though, so I can’t confirm these methods.

nico · 2025-05-21T21:34:34 1747863274

Are you using it with aider? If so, how has your experience been?

zackify · 2025-05-22T03:28:48 1747884528

Ollama breaks for me. If I manually set the context higher. The next api call from clone resets it back.

And ollama keeps taking it out of memory every 4 minutes.

LM studio with MLX on Mac is performing perfectly and I can keep it in my ram indefinitely.

Ollama keep alive is broken as a new rest api call resets it after. I’m surprised it’s this glitched with longer running calls and custom context length.

davedx · 2025-05-22T08:15:28 1747901728

I couldn’t run it on my 16gb MBP (I tried, it just froze up, probably lots of swapping), they say it needs 32gb

ics · 2025-05-22T17:04:51 1747933491

I was able to run it on my M2 Air with 24GB. Startup was very slow but less than 10 minutes. After that responses were reasonably quick.

Edit: I should point out that I had many other things open at the time. Mail, Safari, Messages, and more. I imagine startup would be quicker otherwise but it does mean you can run with less than 32GB.

rahimnathwani · 2025-05-22T06:09:08 1747894148

Almost all models listed in the ollama model library have a version that's under 20GB. But whether that's a 4-bit quantization (as in this case) or more/fewer bits varies.

AFAICT they usually set the default tag to sa version around 15GB.