ModelForge's comments

ModelForge · 2025-08-20T20:48:43 1755722923

No the compiled version is actually faster.

From that table, the A100 tok/sec (larger is faster) numbers are:

- Eager: 28

- Compiled: 128

And

- KV cache eager: 26

- KV cache compiled: 99

The reason that the KV cache is slower is likely because it's not GPU-optimized code. On CPU the KV cache is faster. To make it faster on GPU, you would pre-allocate the tensors on the device for example instead of `torch.cat`ting them on the fly

ladberg · 2025-08-20T22:54:42 1755730482

Ah yep read the labels backwards and meant that - ty for catching and for the explanation

ModelForge · 2025-08-20T20:44:55 1755722695

Could be an artifact of the small size not fully taking advantage of the GPU. For example, for the slightly larger Qwen3 0.6B model the A100 is faster (you can see it when scrolling to the bottom here: https://github.com/rasbt/LLMs-from-scratch/tree/main/ch05/11...)

ModelForge · 2025-08-20T20:39:59 1755722399

I'd say the common ones (besides educational) are

- private, on-device models (possibly with lower latency than models via web API); also edge devices

- algorithm research (faster and cheaper to prototype new ideas)

- cheap tasks, like classification/categorization; sure, you don't need a decoder-style LLM for that, but it has the advantage of being more free-form, which is useful in many scenarios; or maybe a sanity checker for grammar; or even a router to other model (GPT-5 style)

ModelForge · 2025-08-10T21:52:15 1754862735

I think GPT-4.5 was potentially the original GPT-5 model that was larger and pre-trained on more data. Too bad it was too expensive to deploy at scale so that we never saw the RL-ed version

ModelForge · 2025-08-10T21:50:47 1754862647

The ollama one uses even less (around 13 GB), which is nice. Apparently the gpt-oss team also shared the mxfp4 optimizations for metal

ModelForge · 2025-08-10T19:04:09 1754852649

Good point. LLMs lower the barrier to entry if someone has enough resources because those architectures are more robust to tweaks given one throws enough compute and data at them. You can even violate scaling laws and still get a good model (like Llama 3 showed back then)

ModelForge · 2025-08-10T19:00:30 1754852430

I’ve been using the ollama version (uses about 13 Gb RAM on macOS) and haven’t had that issue yet. I wonder if that’s maybe an issue of the llama.cpp port?

mhitza · 2025-08-10T19:05:13 1754852713

Never used ollama, only ready to go models via llamafile and llama.cpp.

Maybe ollama has some defaults it applies to models? I start testing models at 0 temp and tweak from there depending how they behave.