This paper, among other things, shows that LLMs have dramatically worse performa...

aithrowawaycomm · on Oct 11, 2024

The problem here is that throwing in little gotchas like that is a tactic used by math and physics educators to ensure that students actually understand the topic by reasoning through new problems, rather than mindlessly turning the crank from learning the "surface structure" of earlier problem sets. The argument here is that the LLM is not reasoning, it's mindlessly turning a crank.

I don't think this exact question would be out of place on a 6th grade math test. I distinctly remember being taught this skill in "word problems," learning to identify information that actually pertains to the question rather than being distracted by red herrings the teacher threw in.

aguaviva · on Oct 11, 2024

Indeed, and the ability to make heads or tails of slightly-slippery problems of this sort is an extremely important real-world math skill. It's not extraneous at all.

And their poor performance on these tasks highlights deficits in exactly the kind of higher-order, off-the-page reasoning skills -- i.e. to not just reason based on the apparent objects in the stream (the kiwis and the numbers in this case), but to reason about the token stream itself: "okay, these tokens are important, but these others I can leave out", efficiently and seamlessly (like humans do) -- that the models are supposed to develop.

This whole attention business, they're calling it.

aithrowawaycomm · on Oct 11, 2024

In particular the fact that humans sometimes don't do this, taking the bait with extraneous distractions, is almost always a fairly shallow psychological thing rather than an actual cognitive deficit, e.g. OP hypothetically assuming the question had a typo and trying to read the examiner's mind. In education the gotchas really can be unfair if the (human) student has been conditioned to bark answers but the teacher changes things drastically on an exam. I don't think that's an accurate characterization of this study; even if it was that would be a problem with shallow LLM training, not mean-spirited evaluation. But I suspect that "barking answers according to surface characteristics" is as far as transformers can go. It certainly is possible that we just need to train transformers better... but there have been some theoretical results suggesting otherwise. [E.g. transformer LLMs + chain-of-thought is pretty good at O(n) problems but struggles with O(n^2), even if the O(n^2) task is an obvious combination of two O(n) tasks it is able to do.]

That leads to a serious annoyance I have with discussing LLMs - humans' capacity for boredom / cynicism / distraction / laziness being used to excuse away what seems to be deep-rooted limitations in LLMs. It simultaneously misunderstands what a human is and what a machine is. ("Sometimes humans also refuse to work" would be a bad excuse from an auto dealer.)

pishpash · on Oct 11, 2024

Psychology is cognitive. Doesn't seem principled to discard that at all.

aithrowawaycomm · on Oct 12, 2024

That’s why I specified “fairly shallow psychological thing.”

woopwoop · on Oct 11, 2024

My argument is not that slippery problems are unimportant or extraneous, it's that this paper does not convincingly demonstrate that these models are actually especially bad at this kind of reasoning.

aithrowawaycomm · on Oct 12, 2024

To be clear the paper's argument isn't that they're "bad at" the reasoning problems, so much as they're not using reasoning to solve them. In terms of getting the answer "turning the crank" with a canned solution can be more effective than reasoning through on deeper principles.

aguaviva · on Oct 11, 2024

Noted, and thanks for clarifying. BTW when I get questions with typos/inversions (that are supposed to be logical or mathy questions), I tend to throw them back at the person asking, rather than simply ploughing forward. But I guess I'm the kind of person who does that sort of thing.

swatcoder · on Oct 11, 2024

Real discourse has tons of irrelevant information for all sorts of reasons.

There are some contexts, academic or professional, where questions are posed carefully and specifically, but these are narrow contexts.

A useful general purpose assistant needs to be able to find what's relevant among what's irrelevant.

Excellence at just solving math problems that are especially well specified can be a useful domain assistant (no small win!), but is not the same thing.

That said, if you've got a hundred billion dollars betting on your AI project achieving AGI, you benefit a lot by conflating those contexts. In that case, grinding on formal SAT, LSAT, GRE, etc problems amounts to tuning for microbenchmarks rather than real world use cases.

woopwoop · on Oct 11, 2024

Real discourse is also full of typos which accidentally invert the meaning of things, asking the wrong question for deep reasons, asking the wrong question for shallow reasons, and all of the other things that justify subtracting the below average size kiwis from the final answer.

nosianu · on Oct 12, 2024

> Real discourse has tons of irrelevant information for all sorts of reasons.

Real discourse was not carefully crafted to test you.

So, when something is off in real discourse you can usually dismiss it or apply a correction yourself, but when you find it in a test you have to understand the person writing the test and what their intention was.

In a real discourse You can also go back and forth with the other person to get clarification, and errors don't matter because they are temporary on both sides.

.

I hate academic problems because too often the answer depends on how you interpret that intention. Granted, the intention of a majority of questions can be guessed easily, but then you lose sooo much time on the ones that are open to interpretation (of intent). Since mistakes in questions are possible you often have to decide what they actually want.

Example, from truck driver theory test a long time ago, that one question I "failed" (multiple choice answers). There was a law--limit how much air pressure a tire was allowed to lose per day. I knew that limit. Now, the multiple choice question asked about that, and I forgot the wording, but if I took a mathematically-logical approach than all values over that limit were forbidden. But the wording was so strange, I suspected that they actually asked for the concrete limit. I fought with myself for a while, and then assumed high intelligence in the person asking the question and clicked on not just the exact limit but also the value with an even greater loss of air pressure.

There is also the problem that those academic questions want to steer you down some narrow corridor. The more you know about the problem and its complexities the harder it is to answer some of those questions! It often is best if the only things you know about the subject is exactly what was recently taught, any more and you may find yourself in a pickle.

Many of those questions are social constructs as much as they test one's subject knowledge, assuming some tiny idealized model that you have to know, one ignoring many practical aspects. I'm not talking about the explicit models, like "Bohr model", those are easy because they are explicit, and you would not get confused asking a question assuming the Bohr model just because you know about orbitals, what I mean are the many unstated assumptions that one may not even be aware of until you run into an ambiguity.

meroes · on Oct 11, 2024

Irrelevant info is taught in grade skill and is a skill for the SAT for example.

Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.

The point is having an answer that you can defend logically and most people would agree.

If the model said “I’m not sure if this portion is a typo”, I guarantee you the model creators would take the RLHF in a different direction, because that is somewhat reasonable and defensible. However in your specific question, I personally think there is a singular objective answer—but that isn’t always the case to be fair for misleading/irrelevant prompts. The models are being fooled however based on how they respond.

I say this as a RLHF’er who sees and is told to write similar questions at times.

At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.

sottol · on Oct 11, 2024

I think this is valid though. Transformer models don't explicitly do logic but implicitly "vibe" out the answer from the input sequence (using the attention mechanism) and learnt knowledge - they're predicting text sequences after all. So adding more irrelevant context to the input would quite likely influence the the output.

I could see attention possibly being able to overcome this, but if not that would be a pretty big gotcha for real-world applications and reliability in real-world scenarios where, as others have said, it's not immediately clear what is relevant info. These models would be a lot less useful if a human had to decide which information to feed them and the output would be dependent on human judgement. I understand it's where we're at right now and that they are quite useful already but the valuations hint at investors expecting more imo.

jfrbfbreudh · on Oct 11, 2024

I think it’s an important result because filtering signal from noise is just as, if not more, important than forming conclusions from signal.

hggigg · on Oct 11, 2024

That's not even the problem I encounter. They literally crap out on stupidly simple tasks. Recent ones:

1. Bing was gaslighting me into 9.11 being greater than 9.9

2. ChatGPT said that 7x7/7+7/7+7/7 was 24.

3. When expanding (x+1)^2 the output was 2x^2+2.

Regardless of any level of interpretation and irrelevant information if it can't deterministically understand correctness and the semantics of the operations in question then it's fucking useless.

What is worse in an educational context is that it is actively harmful.

MVissers · on Oct 11, 2024

Most average humans can’t do any of these things either. Try asking people on the street. Or in an average US college student.

For deterministic calculations you obviously want to allow LLMs to use tools to do math. Just like you’d want to allow humans to use calculators.

So yeah, you shouldn’t ask LLMs to do math just like you shouldn’t ask average people to do math. They both suck at it.

hggigg · on Oct 11, 2024

So, what exactly is the point of the LLM if it can't exceed an average person and produces results which are not trustworthy?

HeatrayEnjoyer · on Oct 12, 2024

"The average person" has a job. Those jobs can now be performed by machine. The societal implications are profound.

rafaelmn · on Oct 12, 2024

If those jobs consisted of solving small isolated puzzles and math questions you'd maybe have a point.

The societal impact so far has been mostly increasing noise (generating irrelevant content you have to filter out) and burning resources.

Fundamentally AI models need a better way to learn and use memory if they want to replace entry level human jobs - RAG and fine tuning ain't it.

echoangle · on Oct 12, 2024

The "average Person" which struggles with the given questions probably has a physical work job though.

pornel · on Oct 12, 2024

I think it's necessary to remember that they're not a general artificial intelligence, but language models. For example, they're pretty good (not perfect) at translating things, including translating arbitrary instructions into code or machine-readable forms.

buneskamin · on Oct 13, 2024

Seriously? those are pretty simple HS math questions, I find it a bit hard to believe most average people cant solve them. Don't most people graduate HS?

mdp2021 · on Oct 11, 2024

> LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information

"Attention is all you need" /

(It is part of the general problem solving process to evaluate what is relevant and what is not.)

moffkalast · on Oct 11, 2024

Differential attention that filters out noise is all you need :)

andoando · on Oct 11, 2024

Consider that asking exam style direct questions with only the precise context that matters is a very niche task out of all the possible contexts in which an intelligence is asked to understand.

WhitneyLand · on Oct 11, 2024

I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.

Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?

capkutay · on Oct 11, 2024

I agree that it's not particularly surprising that if you try to trick an LLM with irrelevant text will make it perform worse.

I don't see this as an material limitation of LLMs but rather something that can be addressed at the application level to strip out irrelevant information.

wslh · on Oct 11, 2024

It's interesting that I use deliberately artificial remarks to encourage more "creative" or random outputs from LLMs. In this approach, I'm not seeking an exact or precise response to prompts, but rather something more open-ended.