More

andy99 · 2026-01-20T21:05:41 1768943141

Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.

Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.

Id be careful is all.

candiddevmike · 2026-01-20T21:09:01 1768943341

One point in favor of smaller/self-hosted LLMs: more consistent performance, and you control your upgrade cadence, not the model providers.

I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.

blharr · 2026-01-21T20:01:12 1769025672

Where can I find information on self-hosting models success stories? All of it seems like throwing tens of thousands away on compute for it to work worse than the standard providers. The self-hosted models seem to get out of date, too. Or there ends up being good reasons (improved performance) to replace them

andy99 · 2026-01-20T21:22:59 1768944179

How much you value control is one part of the optimization problem. Obviously self hosting gives you more but it costs more, and re evals, I trust GPT, Gemini, and Claude a lot more than some smaller thing I self host, and would end up wanting to do way more evals if I self hosted a smaller model.

(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)

jmathai · 2026-01-20T21:30:18 1768944618

You may also be getting a worse result for higher cost.

For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.

Still waiting on human evaluation to confirm the LLM Judge was correct.

lorey · 2026-01-20T22:16:54 1768947414

That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well.

andy99 · 2026-01-20T21:52:19 1768945939

You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.

vercaemert · 2026-01-23T10:38:23 1769164703

You just need a robust benchmark. As long as you understand your benchmark, you can trust the results.

We have a hard OCR problem.

It's very easy to make high-confidence benchmarks for OCR problems (just type out the ground truth by hand), so it's easy to trust the benchmark. Think accuracy and token F1. I'm talking about highly complex OCR that requires a heavyweight model.

Scout (Meta), a very small/weak model, is outperforming Gemini Flash. This is highly unexpected and a huge cost savings.

Some problems aren't so easily benchmarked.

jmathai · 2026-01-21T20:12:18 1769026338

Volume and statistical significance? I'm not sure what kind of narrative I would trust beyond the actual data.

It's the hard part of using LLMs and a mistake I think many people make. The only way to really understand or know is to have repeatable and consistent frameworks to validate your hypothesis (or in my case, have my hypothesis be proved wrong).

You can't get to 100% confidence with LLMs.

lorey · 2026-01-20T21:30:54 1768944654

You're right. We did a few use cases and I have to admit that while customer service is easiest to explain, its where I'd also not choose the cheapest model for said reasons.

andy99 · 2026-01-19T20:01:00 1768852860

Is anyone aware of a more thorough argument for why this must be the case? Is it a commonly held view? It sounds realistic, but not necessarily and immutable law, I’d like to know what thought has been given to this.

lurk2 · 2026-01-19T20:13:36 1768853616

It’s an incentive problem. If even one party defects in a society of pacifists, the pacifists have no real method of recourse besides refusing to interact with the defector, and how many people are going to do that if the defector starts killing people to enforce compliance?

Some subscribe to a soft pacifism where non-destructive violent resistance like disarming the defector or disabling the defector using less-lethal technologies like a tazer would be fine. Pure pacifists who don’t believe in any kind of physical resistance whatsoever are almost exclusively religious practitioners who don’t ascribe a high degree of value to life in this world because they believe non-resistance will bear spiritual fruit in the next world.

hackable_sand · 2026-01-19T21:13:40 1768857220

Under scrutiny I'm sure your comment falls apart, but it is accurate from orbit.

I happen to hold this philosophy under different words.

mystraline · 2026-01-19T21:56:58 1768859818

Its also appropriate to remember that MLK was friends with Malcolm X, and both chose their own means to support the same end goal.

MLK chose nonviolent shows of force, whereas Malcolm X chose more direct forms of violence.

Governments could save face by negotiating with MLK, as he used nonviolent means. They couldn't negotiate with Malcolm X because thats the whole "we cannot negotiate with criminals and terrorists".

marklemay · 2026-01-20T02:14:33 1768875273

In Malcolms auto biography they it was explicit that they were not friends.

peppersghost93 · 2026-01-19T20:08:27 1768853307

It's because people in positions of power can safely ignore nonviolence. They can't ignore the other option. Nonviolence on it's own is not productive.

mystraline · 2026-01-19T22:06:44 1768860404

Thats what disturbs me about yesteryears's protest marches, like MLK's March on Washington, compared to 50501 and No Kings.

MLK wanted a non-violent showing of force as to stay "legal", but a strong implicit threat of "well, you know, theres a LOT of us. We're peaceful for now". The bus boycotts almost bankrupted down in Atlanta, so money attacks also work.

But now, we have No Kings and 50501. The whole idea of mass protest as a 'nonviolent but imminent threat' is completely gone. Protests were a prelude to something to be done. Now, its more of a political action rally, with not much of anything to follow up the initial energy.

Which is also why the protests; pussy hat rebellion, 50501, No Kings - they've all failed. Theres no goals. Its just chanting and some signs.

andy99 · 2026-01-19T22:22:38 1768861358

Imo this is what happened once protests became a “right”. I know most people here won’t agree with the Canada trucker protest, but I remember when it happened, people were saying “ok, you’ve had your protest and exercised your rights, you’ve been heard, you can go home now” - framing it just like that, as a rally to show an opinion rather than a threat. It felt to me like “the establishment” just treats them as peformative, because as you say they usually are, and then doesn’t know what to do when it’s actually something they have to react to.

HNisCIS · 2026-01-19T20:15:01 1768853701

A commonly cited example is during the Battle of Seattle the cops wanted to beat the shit out of a nonviolent sit in and the black bloc protected them through a combination of strength and diversion. The non violent people are there for the optics and the violent people are there assuring that any move made on the nonviolent protesters will be rewarded swiftly.

The important part is that the violence mostly doesn't start until someone tries to hurt those who are there peacefully. Good was there peacefully so retaliation is becoming a possibility.

shimman · 2026-01-19T20:49:20 1768855760

Yeah, the thorough argument is that people in power don't want people to rise up and challenge their authority.

It's absolutely not realistic. Every right we have was fought for and people died trying to get it. This is especially true in America where a fifth of the population was enslaved at inception. Nothing has never been given to us it had to be taken from abusers of power and there have always been abusers of power in this country.

I mean Trump is no different than Washington. Washington routinely ignored laws, he tried to have his lackeys go get his "property" from free states while never willing to go to court (a provision of the fugitive slave act).

John Adam's called Shays's Resistance terrorists because they had the audacity to close down courts to stop foreclosures of farms (fun fact, that was the first time since the revolution where Americans fired artillery at other Americans (and it was a paid mercenary army by Boston merchants killing over credit)).

You can go down the list, it's always been there but luckily there were always people fighting against it trying to better society against those that simply dragged us down.

andy99 · 2026-01-19T14:58:14 1768834694

Yes, just clicked on the video, instantly nauseous. I get motion sick generally.

andy99 · 2026-01-19T14:34:28 1768833268

There was a COBOL LLM eval benchmark published a few years ago, looks like it hasn’t been maintained: https://github.com/zorse-project/COBOLEval

At least I think that’s the repo, there was an HN discussion at the time but the link is broken now: https://news.ycombinator.com/item?id=39873793

andy99 · 2026-01-18T18:20:29 1768760429

I recently learned Bear blog (a small blogging platform, posts on which often appear on HN) has a “discover” section with a front page style ranking. Their algorithm is on the page

  This page is ranked according to the following algorithm:
  Score = log10(U) + (S / (B * 86,400))

  Where,
  U = Upvotes of a post
  S = Seconds since Jan 1st, 2020
  B = Buoyancy modifier (currently at 14)

See https://bearblog.dev/discover/

andy99 · 2026-01-18T14:46:01 1768747561

There would be competition from API wrappers, if you want to pay there will always be lots of options to chat without ads. I hate to think what they and others might come up with to try and thwart this.

MontyCarloHall · 2026-01-18T14:50:08 1768747808

I think ads will take the form of insidious but convincing product placement invisibly woven into model outputs. This will both prevent any blocking of ad content, and also be much more effective: after all, we allude to companies and products all the time in regular human conversation, and the best form of marketing is organic word-of-mouth.

andy99 · 2026-01-18T14:55:58 1768748158

I just saw a sibling post about Kagi, maybe this is how the industry will end up, with a main provider like OpenAI and niche wrappers on top (I know Kagi is not just a google wrapper but at least they used to return google search results that they paid for).

nothrabannosir · 2026-01-18T15:09:02 1768748942

I thought you were going to say “that comment recommending Kagi is exactly what those ads would look like: native responses making product recommendations as if they’re natural responses in the conversation”

MontyCarloHall · 2026-01-18T15:16:12 1768749372

Ding ding ding. Look at all the brands mentioned in just this thread. From a cursory look, I see:

* WSJ

* Bloomberg

* Financial Times

* Cartier

* Kagi

* Protonmail

* Coca-Cola

* HBO

* Windex

* Netflix

* Azure

* AWS

We are all ourselves advertisers, we just don't realize it. It is inevitable that chatbots will be RLHF-trained in our footsteps.

fph · 2026-01-18T16:17:30 1768753050

That is a weird definition of advertising. It's not an ad if I mention (or even recommend) a product in a post, without going off-topic and without getting any financial benefit.

MontyCarloHall · 2026-01-18T16:30:46 1768753846

The New American Oxford Dictionary defines "advertisement" as "a notice or announcement in a public medium promoting a product, service, or event." By that definition, anything that mentions a product in a neutral light (thereby building brand awareness) or positive light (explicitly promotional) is an ad. The fact that it may not be paid for is irrelevant.

A chatbot tuned to casually drop product references like in this thread would build a huge amount of brand awareness and be worth an incredible amount. A chatbot tuned to be insidiously promotional in a surgically targeted way would be worth even more.

I took a quick look at your comment history. If OpenAI/Anthropic/etc. were paid by JuliaHub/Dan Simmons' publisher/Humble Bundle to make these comments in their chatbots, we would unambiguously call them ads:

https://news.ycombinator.com/item?id=46279782:

   Precisely; today Julia already solves many of those problems.

   It also removes many of Matlab's footguns like `[1,2,3] + [4;5;6]`, or also `diag(rand(m,n))` doing two different things depending on whether m or n are 1.

(for the sake of argument, pretend Julia is commercial software like Matlab.)

https://news.ycombinator.com/item?id=46067423:

   I wasn't expecting to read a Hyperion reference in this thread, such a great book.

https://news.ycombinator.com/item?id=45921788:

   > Name a game distribution platform that respects its customers
   Humble Bundle.

You seem like a pretty smart, levelheaded person, and I would be much more likely to check out Julia, read Hyperion, or download a Humble Bundle based on your comments than I would be from out-of-context advertisements. The very best advertising is organic word-of-mouth, and chatbots will do their damndest to emulate it.

techblueberry · 2026-01-18T15:10:48 1768749048

I don’t know how subtle or stealth you can be in text. In movies, there’s a lot of stuff going on, I may not particularly notice, I’m going to notice “Susie, while at home drinking her delicious ice cold coca-cola….”

baby_souffle · 2026-01-18T15:34:37 1768750477

> I’m going to notice “Susie, while at home drinking her delicious ice cold coca-cola….”

It will be much more subtle. Asking an LLM to help you sift through reviews before you spend $250 on some appliance or what good options are for hotels on your next trip…

Basically the same queries people throw into google but then have to manually open a bunch of tabs and do their own comparison except now the llm isn’t doing a neutral evaluation, it’s going to always suggest one particular hotel despite it not being best for your query.

986aignan · 2026-01-18T17:32:04 1768757524

Not all answers are conducive to such subtle manipulation, though. If the user asks for an algorithm to solve the knapsack problem, it's kind of hard to stealthily go "now let's see how many Coca Colas will fit in the knapsack". If the user asks for a cyberpunk story, "the decker prepared his Microsoft Cyberdeck" would sound off, too.

Biasing actual buying advice would be feasible, but it would have to be handled very carefully to not be too obvious.

techblueberry · 2026-01-18T15:36:24 1768750584

Right, I just don’t see how it can be subtle, maybe it will be the opposite where I assume things are ads that aren’t, but any time I see a specific brand or solution I will assume it’s an ad.

It’s not like a movie where I’m engrossed by the narrative or acting and only subliminally see the can of coke on the table (though even then)

Maybe image generation ads will be a bit more subtle.

nneonneo · 2026-01-18T14:52:31 1768747951

You have no guarantee the API models won’t be tampered with to serve ads. I suspect ads (particularly on those models) will eventually be “native”: the models themselves will be subtly biased to promote advertisers’ interests, in a way that might be hard to distinguish from a genuinely helpful reply.

lelanthran · 2026-01-18T20:37:44 1768768664

> You have no guarantee the API models won’t be tampered with to serve ads. I suspect ads (particularly on those models) will eventually be “native”: the models themselves will be subtly biased to promote advertisers’ interests, in a way that might be hard to distinguish from a genuinely helpful reply.

I admit I don't see how that will happen. What are they gonna do? Maintain a model (LoRA, maybe) for every single advertiser?

When both Pepsi and Coke pay you to advertise, you advertise both. The minute one reduces ad-spend, you need to advertise that less.

This sort of thing is computationally fast currently - ad-space is auctioned off in milliseconds. How will they do introduce ads into the content returned by an LLM while satisfying the ad-spend of the advertiser?

overfeed · 2026-01-18T17:16:53 1768756613

Retraining models every time a advertiser wins a bid on a keyword is unwieldy. Most likey solution is training the model to emit tokens represent ontological entries that are used by the Ad platform so that "<SODA>" can be bid on by PepsiCo/Coca-Cola under food > beverage > chilled > carbonated. Auction cycles have to match ad campaign durations for quicker price discovery, and more competition among bidders

KellyCriterion · 2026-01-18T14:56:35 1768748195

you mean the API response then will contain the Ad display code?

clifdweller · 2026-01-18T15:33:26 1768750406

More akin to something like the twitter verified program where companies can bid for relevance in the training set to buy a greater weight so the model will be trained to prefer them. Would be especially applicable for software if azure and aws start bidding on whose platform it should recommend. Or something like when Convex just came out to compete with depth of supabase/firebase training in current model they could be offered to retrain the model giving a higher weight to their personally selected code bases given extra weight for a mere $Xb.

KellyCriterion · 2026-01-18T15:51:31 1768751491

But this is upfront, during training?

How does X then change "on the fly" if ad deals are changing? Constantly re-training with whatever advertiser is the current highest paying on?

In google ad times, this was realtime bidding in the background - for AI ads this does not work, if Im right?

yunwal · 2026-01-18T16:14:00 1768752840

Companies pay for entire sports stadiums for brand recognition. That’s also not something you can change on the fly, it’s a huge upfront cost and takes a significant effort to change. That doesn’t stop it from happening it’s just a different ad model.

WD-42 · 2026-01-18T15:37:57 1768750677

The llm output will just contain ads directly. It’s going to be super hard to tell them apart from normal output.

likium · 2026-01-18T15:32:52 1768750372

Companies will pay OpenAI to prioritize more of their content during training. The weights for the product category will now be nudged more towards your product. Gartner Magic Quadrant for all businesses!

rileymat2 · 2026-01-18T14:59:11 1768748351

Or worse subtly integrate companies that pay them into the answers.

pseudalopex · 2026-01-18T14:59:38 1768748378

The generated text will contain advertisements.

andy99 · 2026-01-16T02:03:54 1768529034

Everything about this is so stupid, from the fact that these pills are expected to change population weight so much to the silly napkin math that some analyst has pulled out of their ass to turn it into airline EPS.

ggm · 2026-01-16T02:14:43 1768529683

Given some airlines (Hawaiian?) do sometimes weigh passengers on their smaller craft, I could believe fuel load would drop if pax weight dropped but I don't actually believe either fuel cost or weight really strongly informs pricing. It's used as an excuse. A350 are significantly more efficient than older craft, but if you do more business suites you carry less and make even more money. The price does not drop, profit rises.

(The weight is about balance and total load as I understand it)

andy99 · 2026-01-16T00:43:19 1768524199

What if the only moat is domains where it’s hard to judge (non superficial) quality?

Code generation, you don’t see what’s wrong right away, it’s only later in project lifecycle that you pay for it. Writing looks good to skim, is embarrassingly bad once you start reading it.

Some things (slides apparently) you notice right away how crappy they are.

I don’t think it’s just better training data, I think LLMs apply largely the same kind of zeal to different tasks. It’s the places where coherent nonsense ends up being acceptable.

I’m actually a big LLM proponent and see a bright future, but believe a critical assessment of how they work and what they do is important.

aero142 · 2026-01-16T01:20:40 1768526440

If had to answer this question 2 years ago, I wouldn't have said software was a "don't see it's bad until later" category, with compilers and it needing to actually do something very specific. However, business slides are full of exacting facts and definitely never contains generic business speak masquerading as real insight /s.

This feels like telling a story after the fact to make it fit.

crabmusket · 2026-01-16T08:36:01 1768552561

I agree, and by all accounts the success of coding agents is due to code being amenable to very fast feedback (tests, screenshots) so you can immediately detect bad code.

That's in terms of functionality, not necessarily quality though. But linters can provide some quick feedback on that in limited ways.

andy99 · 2026-01-15T21:48:25 1768513705

  it only gleans logic from human language

This isn’t really true, at least how I interpret the statement, little if any of the “logic” or appearance of such is learned from language. It’s trained in with reinforcement learning as pattern recognition.

Point being it’s deliberate training, not just some emergent property of language modeling. Not sure if the above post meant this, but it does seem a common misconception.

andy99 · 2026-01-15T20:31:22 1768509082

No, this describes the common understanding of LLMs and adds little to just calling it AI. The search is the more accurate model when considering their actual capabilities and understanding weaknesses. “Lossy compression of human knowledge” is marketing.

XenophileJKO · 2026-01-15T20:43:50 1768509830

It is fundamentally and provably different than search because it captures things on two dimensions that can be used combinatorially to infer desired behavior for unobserved examples.

1. Conceptual Distillation - Proven by research work that we can find weights that capture/influence outputs that align with higher level concepts.

2. Conceptual Relations - The internal relationships capture how these concepts are related to each other.

This is how the model can perform acts and infer information way outside of it's training data. Because if the details map to concepts then the conceptual relations can be used to infer desirable output.

(The conceptual distillation also appears to include meta-cognitive behavior, as evidenced by Anthropic's research. Which manes sense to me, what is the most efficient way to be able to replicate irony and humor for an arbitrary subject? Compressing some spectrum of meta-cognitive behavior...)

kylecazar · 2026-01-15T22:51:26 1768517486

Aren't the conceptual relations you describe still, at their core, just search (even if that's extremely reductive)? We know models can interpolate well, but it's still the same probabilistic pattern matching. They identify conceptual relationships based on associations seen in vast training data. It's my understanding that models are still not at all good at extrapolation, handling data "way outside" of their training set.

Also, I was under the impression LLM's can replicate irony and humor simply because that text has specific stylistic properties, and they've been trained on it.

XenophileJKO · 2026-01-16T00:23:19 1768522999

I don't know honestly, I think really the only big hole the current models have is if you have tokens that never get exposed enough to have a good learned embedding value. Those can blow the system out of the water because they cause activation problems in the low layers.

Other than that the model should be able to learn in context for most things based on the component concepts. Similar to how you learn in context.

There aren't a lot of limits in my experience. Rarely you'll hit patterns that are too powerful where it is hard for context to alter behavior, but those are pretty rare.

The models can mix and match concepts quite deeply. Certainly, if it is a completely novel concept that can't be described by a union or subtraction between similar concepts, than the model probably wouldn't handle it. In practice, a completely isolated concept is pretty rare.