More

htsh · 2025-05-16T15:29:17 1747409357

I have been doing this with claude code and openai codex and/or cline. One of the three takes the first pass (usually claude code, sometimes codex), then I will have cline / gemini 2.5 do a "code review" and offer suggestions for fixes before it applies them.

htsh · 2025-04-28T22:03:51 1745877831

curious, why the 30b MoE over the 32b dense for local coding?

I do not know much about the benchmarks but the two coding ones look similar.

Casteil · 2025-04-28T22:10:06 1745878206

The MoE version with 3b active parameters will run significantly faster (tokens/second) on the same hardware, by about an order of magnitude (i.e. ~4t/s vs ~40t/s)

genpfault · 2025-04-28T23:06:14 1745881574

> The MoE version with 3b active parameters

~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.

tgtweak · 2025-04-29T02:21:39 1745893299

And vmem use?

genpfault · 2025-04-29T13:30:43 1745933443

~18.6 GiB, according to nvtop.

ollama 0.6.6 invoked with:

    # server
    OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

    # client
    ollama run --verbose qwen3:30b-a3b

~19.8 GiB with:

    /set parameter num_ctx 32768

tgtweak · 2025-04-29T13:58:19 1745935099

Very nice, should run nicely on a 3090 as well.

TY for this.

update: wow, it's quite fast - 70-80t/s on LM Studio with a few other applications using GPU.

htsh · 2025-02-16T16:20:50 1739722850

A lot of us have ryzen / nvidia combos... hopefully, soon, though.

bigbones · 2025-02-16T18:14:09 1739729649

Openvino runs fine on AMD last I checked

_carbyau_ · 2025-02-16T23:08:49 1739747329

Maybe it does, however the system requirements page makes it looks like it supports everything BUT AMD.

https://docs.openvino.ai/2024/about-openvino/release-notes-o...

7speter · 2025-02-17T02:16:02 1739758562

It supports AMD cpus because, if I understand correctly, AMD licenses x86 from Intel, so it shares the same bits needed to run openVINO as Intel’s cpus.

Go look at CPUs benchmarks on Phoronix; AMD Ryzen cpus regularly trounce Intel cpus using openVINO inference.

htsh · on Jan 21, 2025

assuming you want to run entirely in GPU, with 12gb vram, your sweet spot is likely the distill 14b qwen at a 4bit quant. so just run:

ollama run deepseek-r1:14b

generally, if the model file size < your vram, it is gonna run well. this file is 9gb.

if you don't mind slower generation, you can run models that fit within your vram + ram, and ollama will handle that offloading of layers for you.

so the 32b should run on your system, but it is gonna be much slower as it will be using GPU + CPU.

prob of interest: https://simonwillison.net/2025/Jan/20/deepseek-r1/

-h

jordiburgos · 2025-01-22T23:18:51 1737587931

Thank you!! I just loaded it a fits in memory as you said.

I am testing it now and seems quite fast giving the responses for a local model.

htsh · on Feb 27, 2024

As a longtime user of nodemailer, thank you.

I am gonna check out emailengine for future work.

htsh · on Feb 1, 2024

Dreambooth was kinda great?

That said, I agree that I wish there were more done post-research towards products with some of this stuff.

htsh · on Jan 27, 2024

Yes, offloading some layers to the GPU and VRAM should still help. And 11gb isn't bad.

If you're on linux or wsl2, I would run oobabooga with --verbose. Load a GGUF, start with a small number of GPU layers and creep up, keeping an eye on VRAM usage.

If you're on windows, you can try out LM Studio and fiddle with layers while you monitor VRAM usage, though windows may be doing some weird stuff sharing ram.

Would be curious to see the diffs. Specifically if there's a complexity tax in offloading that makes the CPU-alone faster but in my experience with a 3060 and a mobile 3080, offloading what I can makes a big diff.

macNchz · on Jan 27, 2024

> Specifically if there's a complexity tax in offloading that makes the CPU-alone faster

Anecdotal, but I played with a bunch of models recently on a machine with a 16GB AMD GPU and 64GB of system memory/12 core CPU. I found offloading to significantly speed things up when dealing with large models, but there was seemingly an inflection point as I tested models that approached the limits of the system, where offloading did seem to significantly slow things down vs just running on the CPU.

htsh · on Jan 27, 2024

openrouter, fireworks, together.

we use openrouter but have had some inconsistency with speed. i hear fireworks is faster, swapping it out soon.

htsh · on Jan 19, 2024

Can one enter their own opeanai URL and api-key? (so we can use openai-compatible things like openrouter or lm-studio)?

nullstyle · on Jan 19, 2024

Doesn't look like it: https://gitlab.com/literally-useful/voxos/-/blob/dev/voxos/s...

edit: shouldn't be hard to enable though

Falimonda · on Jan 19, 2024

Yes, you can define your own key in either the .env, CLI call on run.sh, or in your environment.

https://gitlab.com/literally-useful/voxos/-/blob/dev/.env?re...

nullstyle · on Jan 19, 2024

That doesn't let me send requests to my local litellm instance, though. You have to be able to configure the endpoint that requests are sent against as well.

Falimonda · on Jan 20, 2024

Nice. LiteLLM was just the thing I've been looking for and hoping to integrate.

nullstyle · on Jan 20, 2024

Hell yeah. Good luck!

Falimonda · on Jan 20, 2024

Do you know if there's anything out there like LiteLLM that includes OpenAI's whisper model? I took a look at the litellm package and it doesn't appear they support the audio module. :/

nullstyle · on Jan 20, 2024

I'm not sure if it is _fully_ openai compatible, but whispercpp has a server bundled that says it is "OAI-like": https://github.com/ggerganov/whisper.cpp/tree/master/example...

I don't have any direct experience with it... I've only played around with whisper locally, using scripts.

khimaros · on Jan 20, 2024

https://github.com/mudler/LocalAI

htsh · on Dec 10, 2023

That is what the RAG system does. The PDF is chunked and thrown into a vector store. And then when prompted, only the relevant bits are retrieved and stuffed into the context and sent to the LLM.

So yeah it's kinda smoke and mirrors. In some cases, for some long PDFs, it works really well. If it's a 500 page PDF with many disparate topics, it may do fine.

freedomben · on Dec 10, 2023

Indeed. Would only add, context windows are continually multiplying in size. Who knows how long Moore's Law will apply here, but it's a continually improving window.

saberience · on Dec 10, 2023

I've found that the longer context windows don't seem to be a linear improvement in responses though. It's like the longer the context window, the quality of the response is perhaps broader, but less sharp or accurate. I've been using GPT4-turbo with the longer context window for coding tasks but it doesn't seem to have improved the responses as much as you would think, it seems to be more "distracted" now, which perhaps makes some intuitive sense.

I can give gpt4-turbo many full code files to try and solve a complex coding task but despite the larger window it seems to fail more often or ignore parts of the context window or just doesn't really answer the question.

saberience · on Dec 10, 2023

That assumes that only one part of the PDF, which fits in the context window, is relevant to the prompt, which seems like a fairly big assumption.