More

codeflo · 2026-02-24T08:14:33 1771920873

All of the examples on the linked page seem to be "good" outputs. Attribution sounds most useful to me in cases where an LLM produces the typical kind of garbage response: wrong information in the training data, hallucinations, sycophancy, over-eagerly pattern matching to unasked but similar, well-known questions. Can you give an example of a bad output, and show what the attribution tells us?

adebayoj · 2026-02-24T08:32:29 1771921949

You got it exactly right. Guilty as charged. Over the coming weeks, we will be showcasing exactly how you can debug all of these examples.

I agree that attribution is most useful for debugging and auditing. This is a prime usecase for us. We have a post with exciting results lined up to do this. Should be out in a week, we wanted to even just get the initial model out :)

Grimblewald · 2026-02-24T11:25:02 1771932302

What I am reading here is that when the model is wrong, it still (at least sometimes) confidently attributes the answer to some knwoledge base, is that correct? If that is the case, how is this different to simply predicting the vibe of a given corpus and assinging provenance to it? Much less impressive imo and something most models can do without explicit training. All precision no recall as it were.

gchamonlive · 2026-02-24T11:58:17 1771934297

I think this was answered before, with the constraints of the architecture of the model. You can't expect something fundamentally different from an LLM, because that's how they work. It's different from other models because they were not designed for this. Maybe you were expecting more, but that's not OP's fault or demerit.

Grimblewald · 2026-02-24T12:28:06 1771936086

What you're saying fits my understanding/expectations. However the post and the user I am replying to seem to imply different. This makes me wonder, is my understanding incomplete or is this post marketing hype dressed up as insight? So I am asking for transparency.

adebayoj · 2026-02-24T13:56:06 1771941366

It is not hype. You can try the model on huggingface yourself to see its capabilities. My reply here was clarifying that the examples we showed were ones where the model didn't make a mistake. This is intentional, because over the next few weeks, we will show how the concepts, and attribution we enable can allow you to fix this mistakes more easily. All the claims in the post are supported by evidence, no marketing here.

gchamonlive · 2026-02-24T13:36:26 1771940186

We are probably at the point where hype and insight aren't that much distinguishable other than what would bear fruit in the future, but I agree with you

codeflo · 2026-02-06T08:11:26 1770365486

This is often quoted, but I wonder whether it's actually strictly true, at least if you keep to a reasonable definition of "works". It's certainly not true in mechanical engineering.

bestham · 2026-02-06T09:42:47 1770370967

The definition of a complex system is the qualifier for the quote. Many systems that are designed, implemented and found working are not complex systems. They may be complicated systems. To paraphrase Dr. Richard I. Cook’s ”How Complex Systems Fail” where he claims that complex systems are inherently hazardous, operate near the edge of failure and cannot be understood by analyzing individual components. These systems are not just complicated (like a machine with fixed parts) but dynamic, constantly evolving, and prone to multiple, coincidental failures.

A system of services that interact, where many of them are depending on each other in informal ways may be a complex system. Especially if humans are also involved.

Such a system is not something you design. You just happen to find yourself in it. Like the road to hell, the road to a complex system is paved with good intentions.

codeflo · 2026-02-06T09:56:45 1770371805

Then what precisely is the definition of complex? If "complex" just means "not designed", then the original quote that complex systems can't be designed is true but circular.

If the definition of "complex" is instead something more like "a system of services that interact", "prone to multiple, coincidental failures", then I don't think it's impossible to design them. It's just very hard. Manufacturing lines would be examples, they are certainly designed.

estearum · 2026-02-06T12:04:00 1770379440

The manufacturing lines would be designed, and they'd be designed in an attempt to affect the "design" of the ultimate resulting supply chain they're a part of. But the relationship between the design of some lines and the behavior of the larger supply chain is non-linear, hard to predict, and ultimately undesigned, and therefore complex.

The design of the manufacturing lines and the resulting supply chain are not independent of each other -- you can trace features from one to the other -- but you cannot take apart the supply chain and analyze the designs of its constituent manufacturing lines and actually predict the behavior of the larger system.

AFAIK there's not a great definition of a complex system, just a set of traits that tend to indicate you're looking at one. Non-linearity, feedbacks, lack of predictability, resistance to analysis (the "you can't take it apart to reason about the whole" characteristic mentioned above"). All of these traits are also kind of the same things... they tend to come bundled with one another.

marcosdumay · 2026-02-06T17:29:04 1770398944

A complex system is one that has chaotic behavior.

(And no, this is not "my" definition, it's how it's defined in the systems-related disciplines.)

ted_bunny · 2026-02-06T20:51:17 1770411077

What's considered chaotic? Multiple causes, hard to track?

jandrewrogers · 2026-02-06T21:50:22 1770414622

Consider systems that require continuous active stabilization to not fail because the system has no naturally stable equilibrium state even in theory. Some of our most sophisticated engineering systems have this property e.g. the flight control systems that allow a B-2 bomber to fly. In a software context you see these kinds of design problems in large-scale data infrastructure systems.

The set of system designs that exhibit naturally stable behavior doesn't overlap much with the set of system designs that deliver maximum performance and efficiency. The capability gap between the two can be large but most people choose easy/simple.

There is an enormous amount of low-hanging opportunity here but most people, including engineers, struggle with systems thinking.

marcosdumay · 2026-02-07T15:30:34 1770478234

https://en.wikipedia.org/wiki/Chaos_theory

narag · 2026-02-06T11:14:56 1770376496

IMHO, the key is where you add complexity. In software you have different abstraction layers. If you make a layer too fat, it becomes unwieldly. A simple system evolves well if you're adding the complexity in the right layer, avoiding making a layer responsible for task outside its scope. It still "works" if you don't, but it's increasingly difficult to maintain it.

The law is maybe a little too simplistic in its formulation, but it's fundamentally true.

direwolf20 · 2026-02-06T11:23:19 1770376999

You built this gear using the knowledge from your last gear. You didn't start with no knowledge, read a manual on operating a lathe, grab a hunk of metal and make a perfect gear the first time.

laserlight · 2026-02-06T11:19:11 1770376751

> It's certainly not true in mechanical engineering.

Care to exemplify?

codeflo · 2026-02-04T08:14:09 1770192849

Why the hell would you need a blockchain for automatic payments? Bots that performed financial transactions existed long before “crypto”.

direwolf20 · 2026-02-04T10:29:44 1770200984

The ordinary financial system is constrained by a lot of regulation and it won't let an AI open an account.

BoppreH · 2026-02-04T11:20:55 1770204055

Good, KYC exists for a reason. Why does AI need to open an account, anyway? Just give it a debit card with a limit, not a whole new account and contract with a bank.

direwolf20 · 2026-02-04T11:52:42 1770205962

The limit of a debit card is the money in your account.

The bank would argue that an AI using your account on your behalf is fraud.

BoppreH · 2026-02-04T11:58:11 1770206291

Those are much easier problems to solve, and surely already solved by some fintechs, than bringing cryptocurrencies to the minimum legal compliance and meeting performance requirements.

direwolf20 · 2026-02-04T13:05:45 1770210345

They don't intend to meet the legal compliance requirements. That's the reason for using cryptocurrencies — avoiding compliance.

Kwpolska · 2026-02-04T12:11:18 1770207078

My debit card has specific limits, far less than all my funds. There also exist pre-paid cards, ideal for things like this.

Spooky23 · 2026-02-04T12:08:32 1770206912

How unreasonable that I can’t make my computer pretend to be a fiduciary.

Those awful regulations won’t let me say the “computer ate my homework”. Imagine.

codeflo · 2026-02-02T21:48:21 1770068901

"We promise that we will not enforce" is perhaps a funny way not to grant a license, but making it sound like they do. This seems almost purposefully designed to look open-source to laypeople, while being carefully written in a way that ensures it will be vetoed by any corporate lawyer vetting the license.

rlpb · 2026-02-03T18:03:01 1770141781

Is "we promise not to enforce" legally binding I wonder, in which case it is de-facto a license? IANAL but it's an interesting concept.

codeflo · 2026-01-29T16:21:03 1769703663

This is intentional? I think delivering lower quality than what was advertised and benchmarked is borderline fraud, but YMMV.

TedDallas · 2026-01-29T16:38:26 1769704706

Per Anthropic’s RCA linked in Ops post for September 2025 issues:

“… To state it plainly: We never reduce model quality due to demand, time of day, or server load. …”

So according to Anthropic they are not tweaking quality setting due to demand.

rootnod3 · 2026-01-29T16:44:42 1769705082

And according to Google, they always delete data if requested.

And according to Meta, they always give you ALL the data they have on you when requested.

entropicdrifter · 2026-01-29T17:13:33 1769706813

>And according to Google, they always delete data if requested.

However, the request form is on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard'.

groundzeros2015 · 2026-01-29T17:19:57 1769707197

What would you like?

AlexandrB · 2026-01-29T17:37:33 1769708253

An SLA-style contractually binding agreement.

edmundsauto · 2026-01-29T19:02:30 1769713350

I bet this is available in large enterprise agreements. How much are you willing to pay for it?

Onavo · 2026-01-29T20:42:29 1769719349

Priced in.

cmrdporcupine · 2026-01-29T16:58:18 1769705898

I guess I just don't know how to square that with my actual experiences then.

I've seen sporadic drops in reasoning skills that made me feel like it was January 2025, not 2026 ... inconsistent.

quadrature · 2026-01-29T17:40:35 1769708435

LLMs sample the next token from a conditional probability distribution, the hope is that dumb sequences are less probable but they will just happen naturally.

mattmanser · 2026-01-29T18:57:49 1769713069

Funny how those probabilities consistently at 2pm UK time when all the Americans come online...

tempaccount420 · 2026-01-29T18:48:00 1769712480

It's more like the choice between "the" and "a" than "yes" and "no".

root_axis · 2026-01-29T17:29:45 1769707785

I wouldn't doubt that these companies would deliberately degrade performance to manage load, but it's also true that humans are notoriously terrible at identifying random distributions, even with something as simple as a coin flip. It's very possible that what you view as degradation is just "bad RNG".

cmrdporcupine · 2026-01-29T17:31:56 1769707916

yep stochastic fantastic

these things are by definition hard to reason about

chrisjj · 2026-01-29T17:45:03 1769708703

That's about model quality. Nothing about output quality.

stefan_ · 2026-01-29T18:16:42 1769710602

Thats what is called an "overly specific denial". It sounds more palatable if you say "we deployed a newly quantized model of Opus and here are cherry picked benchmarks to show its the same", and even that they don't announce publicly.

mcny · 2026-01-29T16:27:57 1769704077

Personally, I'd rather get queued up on a long wait time I mean not ridiculously long but I am ok waiting five minutes to get correct it at least more correct responses.

Sure, I'll take a cup of coffee while I wait (:

lurking_swe · 2026-01-29T16:38:18 1769704698

i’d wait any amount of time lol.

at least i would KNOW it’s overloaded and i should use a different model, try again later, or just skip AI assistance for the task altogether.

direwolf20 · 2026-01-29T16:46:08 1769705168

They don't advertise a certain quality. You take what they have or leave it.

bpavuk · 2026-01-29T16:28:37 1769704117

> I think delivering lower quality than what was advertised and benchmarked is borderline fraud

welcome to the Silicon Valley, I guess. everything from Google Search to Uber is fraud. Uber is a classic example of this playbook, even.

denysvitali · 2026-01-29T16:27:51 1769704071

If there's no way to check, then how can you claim it's fraud? :)

chrisjj · 2026-01-29T16:26:12 1769703972

There is no level of quality advertised, as far as I can see.

pseidemann · 2026-01-29T18:25:22 1769711122

What is "level of quality"? Doesn't this apply to any product?

chrisjj · 2026-01-29T18:43:04 1769712184

In this case, it is benchmark performance. See the root post.

copilot_king · 2026-01-29T16:33:25 1769704405

[flagged]

rootnod3 · 2026-01-29T16:45:24 1769705124

That number is a sliding window, isn't it?

codeflo · 2026-01-27T11:06:21 1769511981

I'm not fully onboard with the logic that we just have to live with a certain type of criminal behavior because the technology that could prevent it can be misused to enable another type of criminal behavior. We should aim to stop any kind of criminal behavior.

xpe · 2026-01-27T12:17:46 1769516266

> I'm not fully onboard with the logic that we just have to live with a certain type of criminal behavior because the technology that could prevent it can be misused to enable another type of criminal behavior. We should aim to stop any kind of criminal behavior.

I don’t think anyone is making a claim that we should live with this according to first principles. I think people are saying this trade-off currently exists because it doesn’t seem to be economically or technologically feasible to solve both well.

How do you propose making an improvement to tracking technology that reduces theft while at the same time not assisting stalking?

One idea: if you report your AirTag as stolen, then it can continue to track the item, but you lose the ability to see where it is. In so doing you hand off tracking capability to some authority. This could be an improvement to the extent that the authority is trustworthy and well behaved. Unfortunately, such properties are not guaranteed across the globe. This would create more incentives for bribery for example.

eloisant · 2026-01-27T16:06:16 1769529976

Even in most first world countries the police won't help for the theft of an item of small value like a bag or even a bike.

jon-wood · 2026-01-27T12:29:50 1769516990

We should, but also we should prioritise more harmful behaviour being prevented over less harmful behaviour, and stalking/harassment is in my opinion more harmful than property theft.

sejje · 2026-01-27T15:04:26 1769526266

Not on Earth, no.

It would be if stalking happened at the same frequency as property theft, but the rates are ridiculously lopsided.

So much property theft happens that we don't bother reporting almost any of it.

threetonesun · 2026-01-27T15:42:45 1769528565

Frequency isn't really an issue here. I don't care that much if someone steals my luggage. I'd be a little mad if someone took my bike, but I have redundant protection for it, along with other things of more importance, or I keep them on me.

But I'd really, really not like to find out someone was following me around.

Teever · 2026-01-27T17:07:05 1769533625

If society didn't have to spend the amount of resources that it does dealing with the consequences of personal theft then it would have more resources to direct towards issues like stalking.

I bet Apple could produce some really interesting data from these tools and others that could be used to proactively target stalkers and investigate them before their actions escalate to violence.

sejje · 2026-01-28T14:51:25 1769611885

Hell yeah, thoughtcrime!

Let's get Tom Cruise in here and whoop some ass!

sneak · 2026-01-27T17:42:45 1769535765

Now try traveling with $30k of equipment in your luggage, like millions do every year.

threetonesun · 2026-01-27T18:30:52 1769538652

You're well beyond the scope of an Airtag at that point. Either you've insured the gear, or you ship it in some more secure fashion, or you have a satellite tracker in it, or whatever other mitigation you can do here. Airtags are great things you might misplace more than anything.

throw0101a · 2026-01-27T19:18:48 1769541528

> It would be if stalking happened at the same frequency as property theft, but the rates are ridiculously lopsided.

But the impact of the two activities is also lopsided:

* https://en.wikipedia.org/wiki/Risk_matrix

Stalking can potentially result in rape and death, even if there's a low probability of stalking happening in general.

codeflo · 2026-01-26T16:43:01 1769445781

In my experience, there seems to be a limitless supply of newly crowned "AI shamans" sprouting from the deepest corners of LinkedIn. All of them make the laughable claim that hallucinations can be fixed by prompting. And of course it's only their prompt that works -- don't listen to the other shamans, those are charlatans.

If you disagree with them by explaining how LLMs actually work, you get two or three screenfuls of text in response, invariably starting with "That's a great point! You're correct to point out that..."

Avoid those people if you want to keep your sanity.

codeflo · 2026-01-23T07:43:19 1769154199

> Not sure whether the obfuscation is fully synchronous, i.e waiting for the server response before continuing.

The people who designed SSH aren't idiots, and also, you can answer this question by simple observation: When you connect to a server with ~200ms ping, which is somewhat common in the scenarios you describe and which I've done many times, it does not take 20 seconds to show a keystroke.

codeflo · 2026-01-10T22:15:36 1768083336

Postponed.

codeflo · 2026-01-10T09:14:10 1768036450

> apparently

When someone takes the time to explain undergrad-level concepts in a comment, responding with "are you an expert?" is a level of skepticism that's bordering on hostile. The person you're responding to is correct, it's rare that the theorem statement itself is particularly hard to formalize. Whatever you read likely refers to the difficulty of formalizing a proof.

freehorse · 2026-01-10T12:06:55 1768046815

To be fair, the comment did not explain any concept that I can see, or why this statement is simple. It gave the statement and said it was simple to formalise. It does seem simple enough to me (basic arithmetic statement with a few variables and a bunch of quantifiers) but if somebody has no expertise/intuition, I think it is a fair question, without any hostile intent assumed.

IsTom · 2026-01-10T09:58:59 1768039139

> it's rare that the theorem statement itself is particularly hard to formalize

That's very dependent on the problem area. For example there's a gap between high school explanation of central limit theorem and actual formalization of it. And when dealing with turing machines sometimes you'll say that something grows e.g. Omega(n), but what happens is that there's some subsequence of inputs for which it does. Generally for complexity theory plain-language explanations can be very vague, because of how insensitive the theory is to small changes and you need to operate on a higher level of abstraction to have a chance to explain a proof in reasonable time.

zozbot234 · 2026-01-10T12:26:49 1768048009

Yes, if the theorem statement itself is "hard to formalize" even given our current tools, formal foundations etc. for this task, this suggests that the underlying math itself is still half-baked in some sense, and could be improved to better capture the concepts we're interested in. Much of analysis-heavy math is in that boat at present, compared to algebra.

kelipso · 2026-01-10T14:53:01 1768056781

Lol it’s weird seeing high school redditors saying gatekeeping and are you an expert in the same thread as university professors, all talking about the same topic. But I guess that’s HN for you.

phyzome · 2026-01-10T13:26:49 1768051609

I think it was a fine question to ask in the context of a discussion of epistemology.