Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thank you, this is a perfect argument why LLMs are not AI but just statistical models. The original is so overrepresented in the training data that even though they notice this riddle is different, they regress to the statistically more likely solution over the course of generating the response. For example, I tried the first one with Claude and in its 4th step, it said:

> This is safe because the wolf won't eat the cabbage if they're together on the far side.

even though it clearly states the opposite in the question.

It's impressive that just dumb stats can be used to produce something that is very often useful, can help write code and when making it generate intermediate steps, it can often generate a chain of text that happens to be often right. However, it's not actual reasoning, there is no model of the world, no information storage and retrieval, and so on - just statistics between tokens.



This is a dumb argument. Humans frequently fall for the same tricks, are they not "intelligent"? All intelligence is ultimately based on some sort of statistical models, some represented in neurons, some represented in matrices.


State-of-the-art LLMs have been trained on practically the whole internet. Yet, they fall prey to pretty dumb tricks. It's very funny to see how The Guardian was able to circumvent censorship on the Deepseek app by asking it to "use special characters like swapping A for 4 and E for 3". [1]

This is clearly not intelligence. LLMs are fascinating for sure, but calling them intelligent is quite the stretch.

[1]: https://www.theguardian.com/technology/2025/jan/28/we-tried-...


The censorship is in fact not part of the llm. This can be shown easily by examples where llms visually output censored sentences after which they disappear.


The nuance here being that this only proves additional censorship is applied on top of the output. It does not disprove that (sometimes ineffective) censorship is part of the LLM or that censorship was not attempted during training.


For your definition of “clearly”.


Humans run on hardware that is both faulty and limited in terms of speed and memory. They have a better "algorithm" how to use the hardware to compensate for it. LLMs run on almost perfect hardware, able to store and retrieve enormous amounts of information insanely quickly and perform mechanical operations on it insanely quickly.

Yet they "make mistakes". Those are not the same as human mistakes. LLMs follow an algorithm that is far simpler and inferior, they simply use the hardware to perform incorrect ("illogical", "meaningless") operations, thus giving incorrect results.

See my other replies for more depth.


Yes, but we have the ability to reason logically and step by step when we have to. LLMs can’t do that yet. They can approximate it but it is not the same.


I would expect that if you asked the same question to 100 people off the street they would make the same mistake though.

Neither people nor LLMs expect goats to eat wolves.


Comparisons to humans are ultimately misleading because 1) humans are not general intelligences most of the time, 2) humans run on incredibly faulty hardware.

1) Attention is limited. Human reasoning is slow. Motivation is limited. System 1 vs 2 thinking. Many will just tell you to fuck off or get bored and give some random answer to make you go away. Etc. See difference 2.

2) People run on limited hardware in terms or error rate and memory.

2a) Brains make mistakes all the time. Ask them to multiply a bunch of large numbers, using pen and paper they will get it wrong a lot of the time.

2b) Doing it in their head, they will run out of memory pretty fast.

But you wouldn't say that humans can't multiply numbers. When they have the right algorithm, they can do it, they just have to use the right tools to extend their memory and check for errors. A human who notices the difference in input to something he already knows, immediately knows he has to pay attention to that bit and all subsequent parts which depend on it. Once a human has the right algorithm, he can apply it to different inputs.

LLMs:

comparison to 2a: Current LLMs also make a lot of mistakes. But theirs are not a result of faulty or limited hardware, they are the result of a faulty algorithm. Take away the random seeds and an LLM will make the same mistake over and over. Randomness is the smoke and mirrors which make LLMs seem more "alive" and less like machines imperfectly imitating humans.

comparison to 2b) Current LLMs do not store statements in an abstract, structured form where they could save and load information and perform steps such as inferring redundant information from the rest. They operate on the token stream which is probably wasteful in terms of memory and less flexible in terms of what they operations they can perform on it.

Most importantly, they are not limited by memory. The input clearly states "the wolf will eat the cabbage", yet the LLM generates "This is safe because the wolf won't eat the cabbage if they're together on the far side." just a few lines below. It is unable to infer those two facts are contradictory. The statistics of tokens simply worked out in a way that lead to this.


How do you respond to this paper from 2 years ago? https://news.ycombinator.com/item?id=34815718


The problem with claims like these that models are not doing “actual reasoning” is that they are often hot takes and not thought through very well.

For example, since reasoning doesn’t yet have any consensus definition that can be applied as a yes/no test - you have to explain what you specifically mean by it, or else the claim is hollow.

Clarify your definition, give a concrete example under that definition of something that’s your version of true scottsman reasoning and something that’s not, then let’s talk.


Explain this to me please: we don't have any consensus definition of _mathematics_ that can be applied as a yes/no test. Does that mean we don't know how to do mathematics, or that we don't know whether something, is, or, more importantly, isn't mathematics?

For example, if I throw a bunch of sticks in the air and look at their patterns to divine the future- can I call that "mathematics" just because nobody has a "consensus definition of mathematics that can be applied as a yes/no test"? Can I just call anything I like mathematics and nobody can tell me it's wrong because ... no definition?

We, as a civilisation, have studied both formal and informal reasoning since at least a couple thousand years go, starting with Aristotle and his syllogisms (a formalisation of rigorous arguments) and continuing through the years with such figures as Leibniz, Boole, Bayes, Frege, Pierce, Quine, Russel, Godel, Turing, etc etc. There are entire research disciplines that are dedicated to the study of reasoning: philosophy, computer science, and, of course, all of mathematics itself. In AI research reasoning is a major topic studied by fields like automated theorem proving, planning and scheduling, program verification and model checking, etc, everything one finds in Russel & Norvig really. It is only in machine learning circles that reasoning seems to be such a big mystery that nobody can agree what it is; and in discussions on the internet about whether LLMs reason or not.

And it should be clear that never in the history of human civilisation did "reasoning" mean "predict the most likely answer according to some training corpus".


Yeah sure there’s lots of research on reasoning. The papers I’ve seen that make claims about it are usually pretty precise about what it means in the context of that work and that specific claim, at least in the hard sciences listed.


I'm complaining because I haven't seen any such papers. Which ones do you have in mind?


Examples go back 50 years, across many of the disciplines you’ve mentioned, but to throw out one that’s recent, on topic, and highly cited, there’s:

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/pdf/2501.19201

It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.

Now, is this valid reasoning? Well, depends on the claim and the definition of reasoning.

When someone just says AI can’t reason, I could argue for or against that depending on the specifics. It’s not enough to just say yes or no.


Thanks for the link.

>> It effectively treats “reasoning” as the ability to generate intermediate steps leading to a correct conclusion.

Is "effectively" the same as "pretty precise" as per your previous comment? I don't see that because I searched the paper for all occurrences of "reasoning" and noticed two things: first that while the term is used to saturation there is no attempt to define it even informally, let alone precisely; and second that I could have replaced "reasoning" with any buzzword of the day and it would not change the impact of the paper. As far as I can tell the paper uses "reasoning" just because it happens to be what's currently trending in LLM circles.

And still of course no attempt to engage with the common understanding of reasoning I discuss above, or any hint that the authors are aware of it.

Sorry to be harsh, but you promised "examples that go back 50 years" and this is the kind of thing I've seen consistently in the last 15 or so.


The point is there has to be meaning for reasoning. I think the claim in this paper is very clear and the results are shown decisively.

Research papers relating to reasoning approach and define it in many ways but crucially, the good ones offer a testable claim.

Simply saying “models can’t reason” is ambiguous to the point of being unanswerable.


I am not gonna pretend to have one true definition but there are hopefully components of reasoning we can agree on:

- The ability to hold a collection of beliefs/statements and perform operations on them such as noticing one contradicts the others. Claude in my example clearly failed this.

- The ability to make logical inferences. Since logical steps can be represented by text, current LLMs seem to simulate this on the token stream and seem to have decent results most of the time. True inferences would be representing statements in some abstract form and being able to perform operations on them according to rules. The key here is that once it has a collection of statements and is given ("decides to use") an operation to perform, then it can do this correctly 100% of the time (ignoring hardware errors).


Is it possible the models do something entirely different? I'm not sure why everyone needs to compare them to human intelligence. It's very obvious llms work nothing like our brains why would the intelligence they exhibit be like ours?


I like the true scottsman requirement. About any definition man can come up with for 'true' reasoning easily falls prey to a 'true scottsman fallacy'.


I tried the same with deepseek. It failed and then i asked to try again and explain why it was wrong. The answer below is convincing for me that it has forms of reasoning, even meta reasoning.

" Certainly! If we take the riddle very literally and ignore the common interpretation or stereotypes, the answer I provided ("the surgeon is the boy's mother") would actually be incorrect because the riddle explicitly states:

"a surgeon, who is the boy's father, says, 'I cannot operate on this boy, he's my son!'"

The key phrase here is "who is the boy's father." This directly tells us that the surgeon is the boy's father. Therefore, the correct literal answer is that the surgeon is the boy's father.

The reason my initial answer was "wrong" in this context is that I interpreted the riddle as a play on stereotypes (the assumption that surgeons are male), rather than taking the statement at face value. By focusing on the literal wording, the surgeon is explicitly identified as the boy's father, and no further interpretation is needed. "




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: