Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
One point in favor of smaller/self-hosted LLMs: more consistent performance, and you control your upgrade cadence, not the model providers.
I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.
Where can I find information on self-hosting models success stories? All of it seems like throwing tens of thousands away on compute for it to work worse than the standard providers.
The self-hosted models seem to get out of date, too. Or there ends up being good reasons (improved performance) to replace them
How much you value control is one part of the optimization problem. Obviously self hosting gives you more but it costs more, and re evals, I trust GPT, Gemini, and Claude a lot more than some smaller thing I self host, and would end up wanting to do way more evals if I self hosted a smaller model.
(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)
You may also be getting a worse result for higher cost.
For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.
Still waiting on human evaluation to confirm the LLM Judge was correct.
That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well.
You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.
You just need a robust benchmark. As long as you understand your benchmark, you can trust the results.
We have a hard OCR problem.
It's very easy to make high-confidence benchmarks for OCR problems (just type out the ground truth by hand), so it's easy to trust the benchmark. Think accuracy and token F1. I'm talking about highly complex OCR that requires a heavyweight model.
Scout (Meta), a very small/weak model, is outperforming Gemini Flash. This is highly unexpected and a huge cost savings.
Volume and statistical significance? I'm not sure what kind of narrative I would trust beyond the actual data.
It's the hard part of using LLMs and a mistake I think many people make. The only way to really understand or know is to have repeatable and consistent frameworks to validate your hypothesis (or in my case, have my hypothesis be proved wrong).
You're right. We did a few use cases and I have to admit that while customer service is easiest to explain, its where I'd also not choose the cheapest model for said reasons.
Is anyone aware of a more thorough argument for why this must be the case? Is it a commonly held view? It sounds realistic, but not necessarily and immutable law, I’d like to know what thought has been given to this.
It’s an incentive problem. If even one party defects in a society of pacifists, the pacifists have no real method of recourse besides refusing to interact with the defector, and how many people are going to do that if the defector starts killing people to enforce compliance?
Some subscribe to a soft pacifism where non-destructive violent resistance like disarming the defector or disabling the defector using less-lethal technologies like a tazer would be fine. Pure pacifists who don’t believe in any kind of physical resistance whatsoever are almost exclusively religious practitioners who don’t ascribe a high degree of value to life in this world because they believe non-resistance will bear spiritual fruit in the next world.
Its also appropriate to remember that MLK was friends with Malcolm X, and both chose their own means to support the same end goal.
MLK chose nonviolent shows of force, whereas Malcolm X chose more direct forms of violence.
Governments could save face by negotiating with MLK, as he used nonviolent means. They couldn't negotiate with Malcolm X because thats the whole "we cannot negotiate with criminals and terrorists".
It's because people in positions of power can safely ignore nonviolence. They can't ignore the other option. Nonviolence on it's own is not productive.
Thats what disturbs me about yesteryears's protest marches, like MLK's March on Washington, compared to 50501 and No Kings.
MLK wanted a non-violent showing of force as to stay "legal", but a strong implicit threat of "well, you know, theres a LOT of us. We're peaceful for now". The bus boycotts almost bankrupted down in Atlanta, so money attacks also work.
But now, we have No Kings and 50501. The whole idea of mass protest as a 'nonviolent but imminent threat' is completely gone. Protests were a prelude to something to be done. Now, its more of a political action rally, with not much of anything to follow up the initial energy.
Which is also why the protests; pussy hat rebellion, 50501, No Kings - they've all failed. Theres no goals. Its just chanting and some signs.
Imo this is what happened once protests became a “right”. I know most people here won’t agree with the Canada trucker protest, but I remember when it happened, people were saying “ok, you’ve had your protest and exercised your rights, you’ve been heard, you can go home now” - framing it just like that, as a rally to show an opinion rather than a threat. It felt to me like “the establishment” just treats them as peformative, because as you say they usually are, and then doesn’t know what to do when it’s actually something they have to react to.
A commonly cited example is during the Battle of Seattle the cops wanted to beat the shit out of a nonviolent sit in and the black bloc protected them through a combination of strength and diversion. The non violent people are there for the optics and the violent people are there assuring that any move made on the nonviolent protesters will be rewarded swiftly.
The important part is that the violence mostly doesn't start until someone tries to hurt those who are there peacefully. Good was there peacefully so retaliation is becoming a possibility.
Yeah, the thorough argument is that people in power don't want people to rise up and challenge their authority.
It's absolutely not realistic. Every right we have was fought for and people died trying to get it. This is especially true in America where a fifth of the population was enslaved at inception. Nothing has never been given to us it had to be taken from abusers of power and there have always been abusers of power in this country.
I mean Trump is no different than Washington. Washington routinely ignored laws, he tried to have his lackeys go get his "property" from free states while never willing to go to court (a provision of the fugitive slave act).
John Adam's called Shays's Resistance terrorists because they had the audacity to close down courts to stop foreclosures of farms (fun fact, that was the first time since the revolution where Americans fired artillery at other Americans (and it was a paid mercenary army by Boston merchants killing over credit)).
You can go down the list, it's always been there but luckily there were always people fighting against it trying to better society against those that simply dragged us down.
I recently learned Bear blog (a small blogging platform, posts on which often appear on HN) has a “discover” section with a front page style ranking. Their algorithm is on the page
This page is ranked according to the following algorithm:
Score = log10(U) + (S / (B * 86,400))
Where,
U = Upvotes of a post
S = Seconds since Jan 1st, 2020
B = Buoyancy modifier (currently at 14)
There would be competition from API wrappers, if you want to pay there will always be lots of options to chat without ads. I hate to think what they and others might come up with to try and thwart this.
I think ads will take the form of insidious but convincing product placement invisibly woven into model outputs. This will both prevent any blocking of ad content, and also be much more effective: after all, we allude to companies and products all the time in regular human conversation, and the best form of marketing is organic word-of-mouth.
I just saw a sibling post about Kagi, maybe this is how the industry will end up, with a main provider like OpenAI and niche wrappers on top (I know Kagi is not just a google wrapper but at least they used to return google search results that they paid for).
I thought you were going to say “that comment recommending Kagi is exactly what those ads would look like: native responses making product recommendations as if they’re natural responses in the conversation”
That is a weird definition of advertising. It's not an ad if I mention (or even recommend) a product in a post, without going off-topic and without getting any financial benefit.
The New American Oxford Dictionary defines "advertisement" as "a notice or announcement in a public medium promoting a product, service, or event." By that definition, anything that mentions a product in a neutral light (thereby building brand awareness) or positive light (explicitly promotional) is an ad. The fact that it may not be paid for is irrelevant.
A chatbot tuned to casually drop product references like in this thread would build a huge amount of brand awareness and be worth an incredible amount. A chatbot tuned to be insidiously promotional in a surgically targeted way would be worth even more.
I took a quick look at your comment history. If OpenAI/Anthropic/etc. were paid by JuliaHub/Dan Simmons' publisher/Humble Bundle to make these comments in their chatbots, we would unambiguously call them ads:
Precisely; today Julia already solves many of those problems.
It also removes many of Matlab's footguns like `[1,2,3] + [4;5;6]`, or also `diag(rand(m,n))` doing two different things depending on whether m or n are 1.
(for the sake of argument, pretend Julia is commercial software like Matlab.)
> Name a game distribution platform that respects its customers
Humble Bundle.
You seem like a pretty smart, levelheaded person, and I would be much more likely to check out Julia, read Hyperion, or download a Humble Bundle based on your comments than I would be from out-of-context advertisements. The very best advertising is organic word-of-mouth, and chatbots will do their damndest to emulate it.
I don’t know how subtle or stealth you can be in text. In movies, there’s a lot of stuff going on, I may not particularly notice, I’m going to notice “Susie, while at home drinking her delicious ice cold coca-cola….”
> I’m going to notice “Susie, while at home drinking her delicious ice cold coca-cola….”
It will be much more subtle. Asking an LLM to help you sift through reviews before you spend $250 on some appliance or what good options are for hotels on your next trip…
Basically the same queries people throw into google but then have to manually open a bunch of tabs and do their own comparison except now the llm isn’t doing a neutral evaluation, it’s going to always suggest one particular hotel despite it not being best for your query.
Not all answers are conducive to such subtle manipulation, though. If the user asks for an algorithm to solve the knapsack problem, it's kind of hard to stealthily go "now let's see how many Coca Colas will fit in the knapsack". If the user asks for a cyberpunk story, "the decker prepared his Microsoft Cyberdeck" would sound off, too.
Biasing actual buying advice would be feasible, but it would have to be handled very carefully to not be too obvious.
Right, I just don’t see how it can be subtle, maybe it will be the opposite where I assume things are ads that aren’t, but any time I see a specific brand or solution I will assume it’s an ad.
It’s not like a movie where I’m engrossed by the narrative or acting and only subliminally see the can of coke on the table (though even then)
Maybe image generation ads will be a bit more subtle.
You have no guarantee the API models won’t be tampered with to serve ads. I suspect ads (particularly on those models) will eventually be “native”: the models themselves will be subtly biased to promote advertisers’ interests, in a way that might be hard to distinguish from a genuinely helpful reply.
> You have no guarantee the API models won’t be tampered with to serve ads. I suspect ads (particularly on those models) will eventually be “native”: the models themselves will be subtly biased to promote advertisers’ interests, in a way that might be hard to distinguish from a genuinely helpful reply.
I admit I don't see how that will happen. What are they gonna do? Maintain a model (LoRA, maybe) for every single advertiser?
When both Pepsi and Coke pay you to advertise, you advertise both. The minute one reduces ad-spend, you need to advertise that less.
This sort of thing is computationally fast currently - ad-space is auctioned off in milliseconds. How will they do introduce ads into the content returned by an LLM while satisfying the ad-spend of the advertiser?
Retraining models every time a advertiser wins a bid on a keyword is unwieldy. Most likey solution is training the model to emit tokens represent ontological entries that are used by the Ad platform so that "<SODA>" can be bid on by PepsiCo/Coca-Cola under food > beverage > chilled > carbonated. Auction cycles have to match ad campaign durations for quicker price discovery, and more competition among bidders
More akin to something like the twitter verified program where companies can bid for relevance in the training set to buy a greater weight so the model will be trained to prefer them. Would be especially applicable for software if azure and aws start bidding on whose platform it should recommend. Or something like when Convex just came out to compete with depth of supabase/firebase training in current model they could be offered to retrain the model giving a higher weight to their personally selected code bases given extra weight for a mere $Xb.
Companies pay for entire sports stadiums for brand recognition. That’s also not something you can change on the fly, it’s a huge upfront cost and takes a significant effort to change. That doesn’t stop it from happening it’s just a different ad model.
Companies will pay OpenAI to prioritize more of their content during training. The weights for the product category will now be nudged more towards your product. Gartner Magic Quadrant for all businesses!
Everything about this is so stupid, from the fact that these pills are expected to change population weight so much to the silly napkin math that some analyst has pulled out of their ass to turn it into airline EPS.
Given some airlines (Hawaiian?) do sometimes weigh passengers on their smaller craft, I could believe fuel load would drop if pax weight dropped but I don't actually believe either fuel cost or weight really strongly informs pricing. It's used as an excuse. A350 are significantly more efficient than older craft, but if you do more business suites you carry less and make even more money. The price does not drop, profit rises.
(The weight is about balance and total load as I understand it)
What if the only moat is domains where it’s hard to judge (non superficial) quality?
Code generation, you don’t see what’s wrong right away, it’s only later in project lifecycle that you pay for it. Writing looks good to skim, is embarrassingly bad once you start reading it.
Some things (slides apparently) you notice right away how crappy they are.
I don’t think it’s just better training data, I think LLMs apply largely the same kind of zeal to different tasks. It’s the places where coherent nonsense ends up being acceptable.
I’m actually a big LLM proponent and see a bright future, but believe a critical assessment of how they work and what they do is important.
If had to answer this question 2 years ago, I wouldn't have said software was a "don't see it's bad until later" category, with compilers and it needing to actually do something very specific. However, business slides are full of exacting facts and definitely never contains generic business speak masquerading as real insight /s.
This feels like telling a story after the fact to make it fit.
I agree, and by all accounts the success of coding agents is due to code being amenable to very fast feedback (tests, screenshots) so you can immediately detect bad code.
That's in terms of functionality, not necessarily quality though. But linters can provide some quick feedback on that in limited ways.
This isn’t really true, at least how I interpret the statement, little if any of the “logic” or appearance of such is learned from language. It’s trained in with reinforcement learning as pattern recognition.
Point being it’s deliberate training, not just some emergent property of language modeling. Not sure if the above post meant this, but it does seem a common misconception.
No, this describes the common understanding of LLMs and adds little to just calling it AI. The search is the more accurate model when considering their actual capabilities and understanding weaknesses. “Lossy compression of human knowledge” is marketing.
It is fundamentally and provably different than search because it captures things on two dimensions that can be used combinatorially to infer desired behavior for unobserved examples.
1. Conceptual Distillation - Proven by research work that we can find weights that capture/influence outputs that align with higher level concepts.
2. Conceptual Relations - The internal relationships capture how these concepts are related to each other.
This is how the model can perform acts and infer information way outside of it's training data. Because if the details map to concepts then the conceptual relations can be used to infer desirable output.
(The conceptual distillation also appears to include meta-cognitive behavior, as evidenced by Anthropic's research. Which manes sense to me, what is the most efficient way to be able to replicate irony and humor for an arbitrary subject? Compressing some spectrum of meta-cognitive behavior...)
Aren't the conceptual relations you describe still, at their core, just search (even if that's extremely reductive)? We know models can interpolate well, but it's still the same probabilistic pattern matching. They identify conceptual relationships based on associations seen in vast training data. It's my understanding that models are still not at all good at extrapolation, handling data "way outside" of their training set.
Also, I was under the impression LLM's can replicate irony and humor simply because that text has specific stylistic properties, and they've been trained on it.
I don't know honestly, I think really the only big hole the current models have is if you have tokens that never get exposed enough to have a good learned embedding value. Those can blow the system out of the water because they cause activation problems in the low layers.
Other than that the model should be able to learn in context for most things based on the component concepts. Similar to how you learn in context.
There aren't a lot of limits in my experience. Rarely you'll hit patterns that are too powerful where it is hard for context to alter behavior, but those are pretty rare.
The models can mix and match concepts quite deeply. Certainly, if it is a completely novel concept that can't be described by a union or subtraction between similar concepts, than the model probably wouldn't handle it. In practice, a completely isolated concept is pretty rare.
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
Id be careful is all.