More

chadash · 2026-02-18T16:47:19 1771433239

> Will LLMs be cheaper than humans once the subsidies for tokens go away? At this point we have little visibility to what the true cost of tokens is now, let alone what it will be in a few years time. It could be so cheap that we don’t care how many tokens we send to LLMs, or it could be high enough that we have to be very careful.

We do have some idea. Kimi K2 is a relatively high performing open source model. People have it running at 24 tokens/second on a pair of Mac Studios, which costs 20k. This setup requires less than a KW of power, so the $0.8-0.15 being spent there is negligible compared to a developer. This might be the cheapest setup to run locally, but it's almost certain that the cost per token is far cheaper with specialized hardware at scale.

In other words, a near-frontier model is running at a cost that a (somewhat wealthy) hobbyist can afford. And it's hard to imagine that the hardware costs don't come down quite a bit. I don't doubt that tokens are heavily subsidized but I think this might be overblown [1].

[1] training models is still extraordinarily expensive and that is certainly being subsidized, but you can amortize that cost over a lot of inference, especially once we reach a plateau for ideas and stop running training runs as frequently.

embedding-shape · 2026-02-18T16:55:38 1771433738

> a near-frontier model

Is Kimi K2 near-frontier though? At least when run in an agent harness, and for general coding questions, it seems pretty far from it. I know what the benchmarks say, they always say it's great and close to frontier models, but is this other's impression in practice? Maybe my prompting style works best with GPT-type models, but I'm just not seeing that for the type of engineering work I do, which is fairly typical stuff.

crystal_revenge · 2026-02-18T17:17:01 1771435021

I’ve been running K2.5 (through the API) as my daily driver for coding through Kimi Code CLI and it’s been pretty much flawless. It’s also notably cheaper and I like the option that if my vibe coded side projects became more than side projects I could run everything in house.

I’ve been pretty active in the open model space and 2 years ago you would have had to pay 20k to run models that were nowhere near as powerful. It wouldn’t surprise me if in two more years we continue to see more powerful open models on even cheaper hardware.

vuldin · 2026-02-18T17:35:02 1771436102

I agree with this statement. Kimi K2.5 is at least as good as the best closed source models today for my purposes. I've switched from Claude Code w/ Opus 4.5 to OpenCode w/ Kimi K2.5 provided by Fireworks AI. I never run into time-based limits, whereas before I was running into daily/hourly/weekly/monthly limits all the time. And I'm paying a fraction of what Anthropic was charging (from well over $100 per month to less than $50 per month).

cadamsdotcom · 2026-02-18T23:13:40 1771456420

Saw you wrote that you moved away from Opus 4.5. If you haven’t tried Opus 4.6, there’s only one number different in the name, but the common experience is it’s significantly better.

Have you tried 4.6 as a comparison to Kimi K2.5?

giancarlostoro · 2026-02-18T18:51:09 1771440669

> OpenCode w/ Kimi K2.5 provided by Fireworks AI

Are you just using the API mode?

embedding-shape · 2026-02-18T17:40:28 1771436428

> it’s been pretty much flawless

So above and beyond frontier models? Because they certainly aren't "flawless" yet, or we have very different understanding of that word.

crystal_revenge · 2026-02-18T18:19:33 1771438773

I have increasingly changed my view on LLMs and what they're good for. I still strongly believe LLMs cannot replace software engineers (they can assist yes, but software engineering requires too much 'other' stuff that LLMs really can't do), but LLMs can replace the need for software.

During the day I am working on building systems that move lots of data around where context and understanding of the business problem is everything. I largely use LLMs for assistance. This is because I need the system to be robust, scalable, maintainable by other people and adaptable to large range of future needs. LLMs will never be flawless in a meaningful sense in this space (at least in my opinion).

When I'm using Kimi I'm using it for purely vibe coded projects where I don't look at the code (and if I do I consider this a sign I'm not thinking about the problem correctly). Are these programs robust, scalable, generalizable, adaptable to future use case? No, not at all. But they don't need to be, they need to serve a single user for exactly the purpose I have. There are tasks that used to take me hours that now run in the background while I'm at work.

In this latter sense I say "flawless" because 90% of my requests solve the problem on the first pass, and the 10% of the time where there is some error, it is resolved in a single request, and I don't have to ever look at the code. For me that "don't have to look at the code" is a big part of my definition of "flawless".

mhitza · 2026-02-18T18:39:36 1771439976

Your definition of flawless is fine for you and requires a big asterix. But without being called out on it look how your message would have read for someone that's not in the known of LLM limitations, and contributed further to the dissilusionment of the field and the gaslighting that's already going on by big comapnies.

varispeed · 2026-02-18T18:25:43 1771439143

Depends what you see as flawless. From my perspective even GPT 5.2 produces mostly garbage grade code (yes it often works, but it is not suitable for anywhere near production) and takes several iterations to get it to remotely workable state.

crystal_revenge · 2026-02-18T18:38:39 1771439919

> not suitable for anywhere near production

This is what I've been increasingly understanding is the wrong way to understand how LLMs are changing things.

I fully agree that LLMs are not suitable for creating production code. But the bigger question you need to ask is 'why do we need production code?' (and to be clear, there are and always will be cases where this is true, just increasingly less of them)

The entire paradigm of modern software engineering is fairly new. I mean it wasn't until the invention of the programmable microprocessor that we even had the concept of software and that was less than 100 years ago. Even if you go back to the 80s, a lot of software doesn't need to be distributed or serve a endless variety of users. I've been reading a lot of old Common Lisp books recently and it's fascinating how often you're really programming lisp for you and your experiments. But since the advent of the web and scaling software to many users with diverse needs we've increasingly needed to maintain systems that have all the assumed properties of "production" software.

Scalable, robust, adaptable software is only a requirement because it was previously infeasible for individuals to build non-trivial systems for solving any more than a one or two personal problems. Even software engineers couldn't write their own text editor and still have enough time to also write software.

All of the standard requirements of good software exist for reasons that are increasingly becoming less relevant. You shouldn't rely on agents/LLMs to write production code, but you also should increasingly question "do I need production code?"

varispeed · 2026-02-18T23:42:47 1771458167

> Scalable, robust, adaptable software is only a requirement because it was previously infeasible for individuals to build non-trivial systems for solving any more than a one or two personal problems. Even software engineers couldn't write their own text editor and still have enough time to also write software.

That's a wild assumption. I personally know engineers who _alone_ wrote things like compilers, emulators, editors, complex games and management systems for factories, robots. That was before internet was widely available and they had to use physical books to learn.

bspinner · 2026-02-18T19:05:32 1771441532

In terms of security: yes, everyone needs production code.

e12e · 2026-02-18T21:43:52 1771451032

In my mind, "yolo ai" application (throwaway code on one hand, unrestrained assistants on the other) - is a little like better spreadsheets and smart documents were in the 90s; just run macros! Everywhere! No need for developers - just Word an macros!

Then came macro viri - and practically - everyone cut back hard on distributing code via Word and Excel (in favour of web apps and we got the dot.com bubble).

fullstackchris · 2026-02-18T17:02:56 1771434176

regardless its been 3 years since the release of chatgpt. literally 3. imagine in just 5 more years how much low hanging (or even big breakthroughs) will get into the pricing, things like quantization, etc. no doubt in my mind the question of "price per token" will head towards 0

lambda · 2026-02-18T17:13:35 1771434815

You don't even need to go this expensive. An AMD Ryzen Strix Halo (AI Max+ 395) machine with 128 GiB of unified RAM will set you back about $2500 these days. I can get about 20 tokens/s on Qwen3 Coder Next at an 8 bit quant, or 17 tokens per second on Minimax M2.5 at a 3 bit quant.

Now, these models are a bit weaker, but they're in the realm of Claude Sonnet to Claude Opus 4. 6-12 months behind SOTA on something that's well within a personal hobby budget.

sosodev · 2026-02-18T19:37:31 1771443451

I was testing the 4-bit Qwen3 Coder Next on my 395+ board last night. IIRC it was maintaining around 30 tokens a second even with a large context window.

I haven't tried Minimax M2.5 yet. How do its capabilities compare to Qwen3 Coder Next in your testing?

I'm working on getting a good agentic coding workflow going with OpenCode and I had some issues with the Qwen model getting stuck in a tool calling loop.

lambda · 2026-02-18T20:35:16 1771446916

I've literally just gotten Minimax M2.5 set up, the only test I've done is the "car wash" test that has been popular recently: https://mastodon.world/@knowmadd/116072773118828295

Minimax passed this test, which even some SOTA models don't pass. But I haven't tried any agentic coding yet.

I wasn't able to allocate the full context length for Minimax with my current setup, I'm going to try quantizing the KV cache to see if I can fit the full context length into the RAM I've allocated to the GPU. Even at a 3 bit quant MiniMax is pretty heavy. Need to find a big enough context window, otherwise it'll be less useful for agentic coding. With Qwen3 Coder Next, I can use the full context window.

Yeah, I've also seen the occasional tool call looping in Qwen3 Coder Next, that seems to be an easy failure mode for that model to hit.

lambda · 2026-02-18T23:17:22 1771456642

OK, with MiniMax M2.5 UD-Q3_K_XL (101 GiB), I can't really seem to fit the full context in even at smaller quants. Going up much above 64k tokens, I start to get OOM errors when running Firefox and Zed alongside the model, or just failure to allocate the buffers, even going down to 4 bit KV cache quants (oddly, 8 bit worked better than 4 or 5 bit, but I still ran into OOM errors).

I might be able to squeeze a bit more out if I were running fully headless with my development on another machine, but I'm running everything on a single laptop.

So looks like for my setup, 64k context with an 8 bit quant is about as good as I can do, and I need to drop down to a smaller model like Qwen3 Coder Next or GPT-OSS 120B if I want to be able to use longer contexts.

nyrikki · 2026-02-18T18:20:06 1771438806

It is crazy to me that it is that slow, 4 bit quants don't lose much with Qwen3 coder next and unsloth/Qwen3-Coder-Next-UD-Q4_K_XL gets 32 tps with a 3090 (24gb) as a VM with 256k context size with llama.cpp

Same with unsloth/gpt-oss-120b-GGUF:F16 gets 25 tps and gpt-oss20b gets 195 tps!!!

The advantage is that you can use the APU for booting, and pass through the GPU to a VM, and have nice safer VMs for agents at the same time while using DDR4 IMHO.

lambda · 2026-02-18T18:28:32 1771439312

Yeah, this is an AMD laptop integrated GPU, not a discrete NVIDIA GPU on a desktop. Also, I haven't really done much to try tweaking performance, this is just the first setup I've gotten that works.

nyrikki · 2026-02-18T18:49:29 1771440569

The memory bandwidth of the Laptop CPU is better for fine tuning, but MoE really works well for inference.

I won’t use a public model for my secret sauce, no reason to help the foundation models on my secret sauce.

Even an old 1080ti works well for FIM for IDEs.

IMHO the above setup works well for boilerplate and even the sota models fail for the domain specific portions.

While I lucked out and foresaw the huge price increases, you can still find some good deals. Old gaming computers work pretty well, especially if you have Claude code locally churn on the boring parts while you work on the hard parts.

lambda · 2026-02-18T19:00:00 1771441200

Yeah, I have a lot of problems with the idea of handing our ability to write code over to a few big Silicon Valley companies, and also have privacy concerns, environmental concerns, etc, so I've refused to touch any agentic coding until I could run open weights models locally.

I'm still not sold on the idea, but this allows me to experiment with it fully locally, without paying rent to some companies I find quite questionable, and I can know exactly how much power I'm drawing and the money is already spent, I'm not spendding hundreds a month on a subscription.

And yes, the Strix Halo isn't the only way to run models locally for a relatively affordable price; it's just the one I happened to pick, mostly because I already needed a new laptop, and that 128 GiB of unified RAM is pretty nice even when I'm not using most of it for a model.

cowmix · 2026-02-18T17:25:11 1771435511

If you don't mind saying, what distro and/or Docker container are you using to bet Qwen3 Coder Next going?

lambda · 2026-02-18T18:37:02 1771439822

I'm running Fedora Silverblue as my host OS, this is the kernel:

  $ uname -a
  Linux fedora 6.18.9-200.fc43.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Feb  6 21:43:09 UTC 2026 x86_64 GNU/Linux

You also need to set a few kernel command line paramters to set it up to allow it to use most of your memory as graphics memory, I have the following in my kernel command line, those are each 110 GiB expressed in number of pages (I figure leaving 18 GiB or so for CPU memory is probably a good idea):

  ttm.pages_limit=28835840 ttm.page_pool_size=28835840

Then I'm running llama.cpp in the official llama.cpp Docker containers. The Vulkan one works out of the box. I had to build the container myself for ROCm, the llama.cpp container has ROCm 7.0 but I need 7.2 to be compatible with my kernel. I haven't actually compared the speed directly between Vulkan and ROCm yet, I'm pretty much at the point where I've just gotten everything working.

In a checkout of the llama.cpp repo:

  podman build -t llama.cpp-rocm7.2 -f .devops/rocm.Dockerfile --build-arg ROCM_VERSION=7.2 --build-arg ROCM_DOCKER_ARCH='gfx1151' .

Then I run the container with something like:

  podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  --model unsloth/MiniMax-M2.5-GGUF/UD-Q3_K_XL/MiniMax-M2.5-UD-Q3_K_XL-00001-of-00004.gguf --jinja --ctx-size 16384 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio

Still getting my setup dialed in, but this is working for now.

Edit: Oh, yeah, you had asked about Qwen3 Coder Next. That command was:

  podman run -p 8080:8080 --device /dev/kfd --device /dev/dri --security-opt seccomp=unconfined --security-opt label=disable \
    --rm -it -v ~/.cache/llama.cpp/:/root/.cache/llama.cpp/ -v ./unsloth:/app/unsloth llama.cpp-rocm7.2  -hf unsloth/Qwen3-Coder-Next-GGUF:UD-Q6_K_XL \
    --jinja --ctx-size 262144 --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --port 8080 --host 0.0.0.0 -dio

(as mentioned, still just getting this set up so I've been moving around between using `-hf` to pull directly from HuggingFace vs. using `uvx hf download` in advance, sorry that these commands are a bit messy, the problem with using `-hf` in llama.cpp is that you'll sometimes get surprise updates where it has to download many gigabytes before starting up)

nyrikki · 2026-02-18T18:20:52 1771438852

I can't answer for the OP but it works fine under llama.cpp's container.

consp · 2026-02-18T17:04:35 1771434275

20k for such a setup for a hobbyist? You can leave the somewhat away and go into sub 1% region globally. A kw of power is still 2k/year at least for me, not that I expect it will run continuously but still not negligible if you can do with 100-200 a year on cheap subscriptions.

dec0dedab0de · 2026-02-18T17:51:36 1771437096

There are plenty of normal people with hobbies that cost much more. Off the top of my head, recreational vehicles like racecars and motorcycles, but im sure there are others.

You might be correct when you say the global 1%, but that's still 83 million people.

markb139 · 2026-02-18T17:59:56 1771437596

I used to think photography was an expensive hobby until my wife got back into the horse world.

simonw · 2026-02-18T17:11:39 1771434699

"a (somewhat wealthy) hobbyist"

manwe150 · 2026-02-18T17:22:52 1771435372

Reminder to others that $20k is the one time startup cost, and is amortized perhaps 2-4k/year (plus power). That is in the realm of a mere gym membership around me for a family

vuggamie · 2026-02-18T18:19:28 1771438768

So 5-10 years to amortize the cost. You could get 10 years of Claude Max and your $20k could stay in the bank in case the robots steal your job or you need to take an ambulance ride in the US.

blibble · 2026-02-18T18:21:00 1771438860

> And it's hard to imagine that the hardware costs don't come down quite a bit.

have you paid any attention to the hardware situation over the last year?

this week they've bought up the 2026 supply of disks

lm28469 · 2026-02-18T19:55:24 1771444524

90% of companies would go bankrupt in a year if you replaced their engineering team with execs talking to k2...

trentnix · 2026-02-18T20:21:20 1771446080

Most execs I've worked with couldn't tell their engineering team what they wanted with any specificity. That won't magically get any better when they talk to an LLM.

If you can't write requirements an engineering team can use, you won't be able to write requirements for the robots either.

newsoftheday · 2026-02-18T17:04:19 1771434259

> a cost that a (somewhat wealthy) hobbyist can afford

$20,000 is a lot to drop on a hobby. We're probably talking less than 10%, maybe less than 5% of all hobbyists could afford that.

charcircuit · 2026-02-18T17:19:09 1771435149

You can rent computer from someone else to majorly reduce the spend. If you just pay for tokens it will be cheaper than buying the entire computer outright.

xboxnolifes · 2026-02-18T19:13:24 1771442004

Up front, yeah. But people with hobbies on the more expensive end can definitely put out 4k a year. Im thinking like people who have a workshop and like to buy new tools and start projects.

msp26 · 2026-02-18T18:11:22 1771438282

Horrific comparison point. LLM inference is way more expensive locally for single users than running batch inference at scale in a datacenter on actual GPUs/TPUs.

AlexandrB · 2026-02-18T18:15:38 1771438538

How is that horrific? It sets an upper bound on the cost, which turns out to be not very high.

qaq · 2026-02-18T17:29:25 1771435765

If I remember correctly Dario had claimed that AI inference gross profit margins are 40%-50%

gjk3 · 2026-02-18T19:13:02 1771441982

Why do you people trust what he has to say? Like omg dude. These folks play with numbers all the time to suit their narrative. They are not independently audited. What do you think scares them about going public? Things like this. They cannot massage the numbers the same way they do in the private market.

The naivete on here is crazy tbh.

PlatoIsADisease · 2026-02-18T17:11:31 1771434691

>24 tokens/second

this is marketing not reality.

Get a few lines of code and it becomes unusable.

chadash · 2026-02-05T13:31:05 1770298265

The problem is that OpenClaw is kind of like a self driving car that works 90% of the time. As we have seen, that last 10% (and billions of dollars) is the difference between Waymo today and prototypes 10 years ago.

Being Apple is just a structural disadvantage. Everyone knows that open claw is not secure, and it’s not like I blame the solo developer. He is just trying to get a new tool to market. But imagine that this got deployed by Apple and now all of your friends, parents and grandparents have it and implicitly trust it because Apple released it. Having it occasionally drain some bank accounts isn’t going to cut it.

This is not to say Apple isn’t behind. But OpenClaw is doing stuff that even the AI labs aren’t comfortable touching yet.

chadash · 2025-10-28T12:33:25 1761654805

"job losses" is BBC editorializing. They do not use that term in their letter: https://www.aboutamazon.com/news/company-news/amazon-workfor...

darrenf · 2025-10-28T12:51:24 1761655884

I sincerely suspect the BBC would only ever use "fired"/"firings" if the employees were being dismissed for conduct reasons, since that's the common usage in British English. I've been let go -- indeed, I've lost my job (it's the employees who suffer job losses, not the employer) -- but I've never been fired.

acatnamedjoe · 2025-10-28T13:15:14 1761657314

"Firing" is becoming a bit more common in Britain, but still sounds like an Americanism to my British ears.

I would use "sacking" for performance related termination, and "losing ones job" in all other cases. I suspect BBC would use the same.

cjrp · 2025-10-28T13:22:05 1761657725

"Made redundant" is another term for the latter.

ghaff · 2025-10-28T13:41:14 1761658874

Which, at least in American English, comes across like corporate jargon/weasel words. Lost their job is literally true and would probably take a bunch more words to describe the precise reasons.

darrenf · 2025-10-28T14:07:08 1761660428

Both things can be literally true. I've lost my job by being made redundant, twice. In Britain redundancy is a very specific thing, where your role no longer exists and you must be let go in a fair way according to employment law. It's quite the opposite of jargon or weasel words here: https://www.gov.uk/redundancy-your-rights

hackeraccount · 2025-10-28T16:23:22 1761668602

Synergized is the term I typically hear.

pjc50 · 2025-10-28T12:59:50 1761656390

I think we may be hitting an issue in translation between English and American; in British English "fired" implies "for cause", while a "blameless" process of headcount reduction is known as "redundancy". "Job losses" is a perfectly reasonable neutral phrase here. Indeed, under UK law and job contracts you generally cannot just chuck someone out of their job without either notice or cause or, for large companies, a statutory redundancy process.

People like to make too much out of active/passive word choices. Granted it can get very propagandistic ("officer-involved shooting"), but if you try to make people say "unhoused" instead of "homeless" everyone just back-translates it in their head.

iamacyborg · 2025-10-28T13:11:23 1761657083

> Indeed, under UK law and job contracts you generally cannot just chuck someone out of their job without either notice or cause or, for large companies, a statutory redundancy process.

This is only true when an employee has worked for a company for 2 or more years

nsxwolf · 2025-10-28T13:03:33 1761656613

I think American English is the same colloquially. “I got fired” means I didn’t perform or did something wrong. “I got laid off” is our “I was made redundant”.

“Fired” is also a technical term for both cases, in academic/economist speak.

pseudalopex · 2025-10-28T14:17:53 1761661073

Fired means terminated for any reason to many Americans. And academics, economists, and lawyers avoid it in my experience.

thaumasiotes · 2025-10-28T13:37:12 1761658632

> in British English "fired" implies "for cause", while a "blameless" process of headcount reduction is known as "redundancy"

OK. I was fired for no stated cause in a process that didn't involve headcount reduction, or the firing of anyone except me specifically. (The unstated cause seems to have been that I had been offered a perk by the manager who hired me that the new manager didn't want to honor after the original guy was promoted.)

How would you describe that, in British English?

pjc50 · 2025-10-28T14:58:44 1761663524

"Breach of contract"? "Sacked" would probably work colloquially.

xnorswap · 2025-10-28T12:37:18 1761655038

Indeed, Amazon use the euphemism, "making organizational changes".

mc32 · 2025-10-28T13:07:55 1761656875

It’s equivalent to “restructuring” which doesn’t directly mean reduction in force but it does mean that I directly.

jalapenos · 2025-10-28T12:43:24 1761655404

Makes it sound like they shuffled desks and gave everyone new team names. How fun!

(Not like, you know, some people getting divorced soon, some people biting a revolver soon)

sgt · 2025-10-28T12:38:31 1761655111

And by applying these organizational changes, each person can become more load bearing and have so much more scope and impact. This is not a loss, it's a great win for everyone! /s

chadash · 2025-10-20T20:06:30 1760990790

The math is wrong:

> Cost: $1,000 Case 1 (90%): OpenAI goes bankrupt. Return: $0 Case 2 (9%): OpenAI becomes a big successful company and goes 10x. Return: $1,000 + 5% interest = $1,050 Case 3 (1%): OpenAI becomes the big new thing and goes 100x. Return: $1,000 + 5% interest = $1,050

The actual math is that if OpenAI succeeds, then there's a nod and a wink that JPM will land the lead role in the IPO or any mergers/acquisitions, which translates into huge fees.

addicted · 2025-10-20T20:08:44 1760990924

This is correct.

This isn't a financial transaction. This is a "relationship" transaction.

JamesBarney · 2025-10-20T20:13:57 1760991237

Not to mention the risks that OpenAI even if it does goes bankrupt sells for less than 4b is not anywhere close to 90%.

shmatt · 2025-10-20T20:22:48 1760991768

a company with 800 million weekly active users, and only losing $10B-$15B before implementing ads - which IMO is coming fast and soon to the LLM world - i would never calculate a 90% chance their shares end up at $0 before an exit option

This is the easiest money and best relationship JPM could imagine

kibwen · 2025-10-20T21:41:57 1760996517

> a company with 800 million weekly active users

Wow, that's slightly more than Yahoo has. Well, had.

HaZeust · 2025-10-21T07:00:31 1761030031

Yahoo is a disingenuous parallel here. Yahoo lost because they didn't correctly embrace their market position in what's otherwise the very ripe industry of search engines. Search engines created the 4th most valuable company in the world (Google).

We don't know how ripe OpenAI's industry or market position is, yet. Yahoo knew what they had lost pretty early onto its spiral.

empath75 · 2025-10-20T20:22:09 1760991729

Also, if OpenAI goes bankrupt, you _much_ prefer to have loaned them money to having bought shares in the company. People who own shares in a bankruptcy only recover anything after all the people that loaned them money are paid back in full.

chadash · 2025-03-26T21:15:53 1743023753

so if i'm reading this correctly, it's essentially prompt engineering here and there's no guarantee for the output. Why not enforce a guaranteed output structure by restricting the allowed logits at each step (e.g. what outlines library does)?

canyon289 · 2025-03-26T22:01:26 1743026486

So in short there's no guarantee for any output from any LLM whether its Gemma or any other (ignoring some details like setting a random seed or parameters like temperature to 0). Like you mentioned though libraries like outlines can constrain the output, whereas hosted models often already include this in their API, but they can do so because its a model + some server side code.

With Gemma, or any open model, you can use the open libraries in conjunction to get what you want. Some inference frameworks like Ollama include structured output as part of their functionality.

But you mentioned all of this already in your question so I feel like I'm missing something. Let me know!

But I think you already mentioned all this in your response so I might be missing the question?

programmarchy · 2025-03-27T12:24:03 1743078243

With OpenAI models, my understanding is that token output is restricted so that each next token must conform to the specified grammar (ie json schema) so you’re guaranteed to get either a function call or an error.

Edit: per simonw’s sibling comment, ollama also has this feature.

canyon289 · 2025-03-27T14:46:59 1743086819

Ah, There's a distinction here with model vs model framework. The ollama inference framework supports token output restriction. Gemma in AI Studio also does, as does Gemini, there's a toggle in the right hand panel, but that's because both those models are being served in the API where the functionality is present in the server.

The Gemma model by itself does not though, nor does any "raw" model, but many open libraries exist for you to plug into whatever local framework you decide to use.

simonw · 2025-03-26T22:59:32 1743029972

If you run Gemma via Ollama (as recommended in the Gemma docs) you get exactly that feature, because Ollama provides that for any model that they run for you: https://ollama.com/blog/structured-outputs

Under the hood, it is using the llama.cpp grammars mechanism that restricts allowed logits at each step, similar to Outlines.

refulgentis · 2025-03-27T01:47:03 1743040023

I've been working on tool calling in llama.cpp for Phi-4 and have a client that can switch between local models and remote for agentic work/search/etc., I learned a lot about this situation recently:

- We can constrain the output of a JSON grammar (old school llama.cpp)

- We can format inputs to make sure it matches the model format.

- Both of these combined is what llama.cpp does, via @ochafik, in inter alia, https://github.com/ggml-org/llama.cpp/pull/9639.

- ollama isn't plugged into this system AFAIK

To OP's question, specifying a format in the model unlocks training the model specifically had on functions calling: what I sometimes call an "agentic loop", i.e. we're dramatically increasing the odds we're singing in the right tune for the model to do the right thing in this situation.

anon373839 · 2025-03-27T07:22:40 1743060160

Do you have thoughts on the code-style agents recommended by huggingface? The pitch for them is compelling, since structuring complex tasks in code is something very natural for LLMs. But then, I don’t see as much about this approach outside of HF.

chadash · on Dec 25, 2024

Why is there a 5 year gap on your resume? It sounds like you didn't sit around twiddling your thumbs... you built stuff during that time. On your resume, treat it like you were working a job and talk about what you built. Highlight your open source contributions and if possible, tie them to your resume. Sure, some people will treat an ex-founder as a negative, but many will see it positive. You only need one job.

There is definitely ageism in tech, but 39 isn't old. I'd be happy to take a look at your resume and provide advice if you give me a way to contact you. But it sounds like the issue isn't your resume... indeed you are getting lots of interviews, so maybe it's something you are doing in the interview process. Do you have a sense of where things go wrong? Are you often getting to the final stage before hearing no?

whoshiring57392 · on Dec 25, 2024

Of those 100 interviews I'd estimate:

- ~70% never go past the recruiter screen

- ~20% rejection after the technical screen (i.e., leetcoding)

- ~10% after a full round of interviews

Presumably I do or say something in interviews that is working against me. I also think that my background gives people the expectation that I should be able to reach a very high bar, and perhaps I set their expectations much higher than I can actually achieve (at least in interviews). Given my real-life interactions, I can confidently say I'm a fairly gregarious and affable person, so I doubt it's my personality that's the issue. But who knows, perhaps my ego is the problem.

jarsin · on Dec 25, 2024

I get the impression from a quick skim of your resume that you are all in on Rust.

Do you mainly apply to rust jobs? I feel like I read something here recently with others really struggling to get Rust jobs.

seabass-labrax · on Dec 26, 2024

I'm a bit confused - where did the OP post their résumé?

jarsin · on Dec 26, 2024

They must have edited it out.

chadash · on Dec 19, 2024

> Ten years is a very long time in tech.

Well, they certainly _might_ be functioning in ten years from now. Conservatively, you get 5 years of use out of this, which isn't bad for $15-20, depending on your use case.

jonhohle · on Dec 19, 2024

I replace AirTag batteries every 6 months to be safe, so $15 (plus batteries) isn’t significantly more expensive over 5 years.

chadash · on Oct 10, 2024

Obviously in practice, this isn't always true, but in general, each employee should output more than their pay, or they wouldn't be there. As an example, Exxon Mobile supposedly has profits of $899,000/employee [1]. Their average pay is probably significantly lower than that, but let's say it's $300k, so a 20% boost in productivity increases profit per employee by $180k. An increase of 50% in salary (and I don't think it takes that much to get people to work in an office) costs them $150k.

[1] https://www.lifehealth.com/top-25-us-companies-ranked-by-pro...

chadash · on Sept 16, 2024

I think that software engineering is about two things: building things the right way and building the right things.

The second one is more important than the first one. If you don't build the right product, it doesn't matter how well it scales or how it has amazing test coverage or wonderful documentation. To that end, I think that too many managers (and companies) do too much shielding of engineers from customers. If you are just given a figma mockup and told "build this", it's easy to get bogged down for a week with the details of building a search bar at the bottom of the page only to realize that the stakeholders would have been OK with a dropdown select. Better to understand the problem you are solving and the only way to really do this is to have some kind of interaction with customers. As an engineering manager, I try to encourage engineers to get on sales calls and see product demos. When you see it from a high level, you a) almost always notice things that need fixing or can be improved and b) see where the piece that you are working on fits into the larger picture.

That said, I find that many engineers don't want to get on customer calls, and usually there's room for those engineers in an organization as well. For example, "build a new video conferencing service for artists to collaborate" would be a very challenging problem (I think) that is not well defined and therefore requires deep customer understanding. "Make Google searches run with 10% fewer CPU milliseconds" is arguably a much harder problem to solve, but it's so well defined that it really doesn't need customer understanding (setting aside the initial decision about whether it makes sense to work on).

lostemptations5 · on Sept 16, 2024

As a fellow engineering manager -- I 100% agree. The more your engineers know as much as possible about the customers, the more they will code in the right direction just be understanding.

spacecadet · on Sept 16, 2024

You are given a figma that wasn't already researched and validated against requirements? If it takes a week for a team to fiddle around with a design asset only to learn customers/clients would be fine with a simpler approach, everyone failed the assignment. This was intended as a rhetorical question... I know many teams let designers waste tons of time in a vacuum and PMs are off in lalaland focused on the wrong activities, when they should be focused "building the right thing" and carefully validating that and communicating outcomes to the team and customers. Im all for letting Engineers along for the ride, but too often they (more the jr mid-level ones) are checked out during that process, not asking implementation questions or contributing to research process, until its all hindsight.

ffsm8 · on Sept 16, 2024

> but too often they (more the jr mid-level ones) are checked out during that process

Reading this kinda threw me into a loop. Yes, jr and regulars are usually not able to do things generally attributed to senior developers.

Expecting someone that neither knows the implementation nor the language in-depth to be able to do that is kinda monkas

spacecadet · on Sept 16, 2024

I disagree, it's great practice and starting early in a career will only pay dividends later on. It requires the desire to actually want to participate and contribute professionally and not just hide behind a keyboard all day. I get that some people just want to hide behind a keyboard all day- fine. These people often complain the most later on... in my experience.

chadash · on Sept 12, 2024

Only a small fraction of all future AI projects have even gotten started. So they aren't only fighting over what's out there now, they're fighting over what will emerge.

phillipcarter · on Sept 12, 2024

This is true, and yet, many orgs who have experimented with OpenAI and are likely to return to them when a project "becomes real". When you google around online for how to do XYZ thing using LLMs, OpenAI is usually in whatever web results you read. Other models and APIs are also now using OpenAI's API format since it's the apparent winner. And for anyone who's already sent out subprocessor notifications with them as a vendor, they're locked in.

This isn't to say it's only going to be an OpenAI market. Enterprise worlds move differently, such as those in G Cloud who will buy a few million $$ of Vertex expecting to "figure out that gemini stuff later". In that sense, Google has a moat with those slices of their customers.

But I believe that when people think OpenAI has no moat because "the models will be a commodity", I think that's (a) some wishful thinking about the models and (b) doesn't consider the sociological factors that matter a lot more than how powerful a model is or where it runs.