This paper, among other things, shows that LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information. The examples are things like "John picked 43 kiwis on Monday, 24 kiwis on Tuesday. On Wednesday, 5 of the kiwis he picked were smaller than usual. Altogether, on Monday, Tuesday, and Wednesday, John picked 87 kiwis. How many kiwis did John pick on Wednesday?" In this question, the remark about some of the kiwis on Wednesday being small is irrelevant, but adding things like this reduces performance on a popular benchmark from 95% to 77% for GPT-4o, for example.
I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.
The problem here is that throwing in little gotchas like that is a tactic used by math and physics educators to ensure that students actually understand the topic by reasoning through new problems, rather than mindlessly turning the crank from learning the "surface structure" of earlier problem sets. The argument here is that the LLM is not reasoning, it's mindlessly turning a crank.
I don't think this exact question would be out of place on a 6th grade math test. I distinctly remember being taught this skill in "word problems," learning to identify information that actually pertains to the question rather than being distracted by red herrings the teacher threw in.
Indeed, and the ability to make heads or tails of slightly-slippery problems of this sort is an extremely important real-world math skill. It's not extraneous at all.
And their poor performance on these tasks highlights deficits in exactly the kind of higher-order, off-the-page reasoning skills -- i.e. to not just reason based on the apparent objects in the stream (the kiwis and the numbers in this case), but to reason about the token stream itself: "okay, these tokens are important, but these others I can leave out", efficiently and seamlessly (like humans do) -- that the models are supposed to develop.
This whole attention business, they're calling it.
In particular the fact that humans sometimes don't do this, taking the bait with extraneous distractions, is almost always a fairly shallow psychological thing rather than an actual cognitive deficit, e.g. OP hypothetically assuming the question had a typo and trying to read the examiner's mind. In education the gotchas really can be unfair if the (human) student has been conditioned to bark answers but the teacher changes things drastically on an exam. I don't think that's an accurate characterization of this study; even if it was that would be a problem with shallow LLM training, not mean-spirited evaluation. But I suspect that "barking answers according to surface characteristics" is as far as transformers can go. It certainly is possible that we just need to train transformers better... but there have been some theoretical results suggesting otherwise. [E.g. transformer LLMs + chain-of-thought is pretty good at O(n) problems but struggles with O(n^2), even if the O(n^2) task is an obvious combination of two O(n) tasks it is able to do.]
That leads to a serious annoyance I have with discussing LLMs - humans' capacity for boredom / cynicism / distraction / laziness being used to excuse away what seems to be deep-rooted limitations in LLMs. It simultaneously misunderstands what a human is and what a machine is. ("Sometimes humans also refuse to work" would be a bad excuse from an auto dealer.)
My argument is not that slippery problems are unimportant or extraneous, it's that this paper does not convincingly demonstrate that these models are actually especially bad at this kind of reasoning.
To be clear the paper's argument isn't that they're "bad at" the reasoning problems, so much as they're not using reasoning to solve them. In terms of getting the answer "turning the crank" with a canned solution can be more effective than reasoning through on deeper principles.
Noted, and thanks for clarifying. BTW when I get questions with typos/inversions (that are supposed to be logical or mathy questions), I tend to throw them back at the person asking, rather than simply ploughing forward. But I guess I'm the kind of person who does that sort of thing.
Real discourse has tons of irrelevant information for all sorts of reasons.
There are some contexts, academic or professional, where questions are posed carefully and specifically, but these are narrow contexts.
A useful general purpose assistant needs to be able to find what's relevant among what's irrelevant.
Excellence at just solving math problems that are especially well specified can be a useful domain assistant (no small win!), but is not the same thing.
That said, if you've got a hundred billion dollars betting on your AI project achieving AGI, you benefit a lot by conflating those contexts. In that case, grinding on formal SAT, LSAT, GRE, etc problems amounts to tuning for microbenchmarks rather than real world use cases.
Real discourse is also full of typos which accidentally invert the meaning of things, asking the wrong question for deep reasons, asking the wrong question for shallow reasons, and all of the other things that justify subtracting the below average size kiwis from the final answer.
> Real discourse has tons of irrelevant information for all sorts of reasons.
Real discourse was not carefully crafted to test you.
So, when something is off in real discourse you can usually dismiss it or apply a correction yourself, but when you find it in a test you have to understand the person writing the test and what their intention was.
In a real discourse You can also go back and forth with the other person to get clarification, and errors don't matter because they are temporary on both sides.
.
I hate academic problems because too often the answer depends on how you interpret that intention. Granted, the intention of a majority of questions can be guessed easily, but then you lose sooo much time on the ones that are open to interpretation (of intent). Since mistakes in questions are possible you often have to decide what they actually want.
Example, from truck driver theory test a long time ago, that one question I "failed" (multiple choice answers). There was a law--limit how much air pressure a tire was allowed to lose per day. I knew that limit. Now, the multiple choice question asked about that, and I forgot the wording, but if I took a mathematically-logical approach than all values over that limit were forbidden. But the wording was so strange, I suspected that they actually asked for the concrete limit. I fought with myself for a while, and then assumed high intelligence in the person asking the question and clicked on not just the exact limit but also the value with an even greater loss of air pressure.
There is also the problem that those academic questions want to steer you down some narrow corridor. The more you know about the problem and its complexities the harder it is to answer some of those questions! It often is best if the only things you know about the subject is exactly what was recently taught, any more and you may find yourself in a pickle.
Many of those questions are social constructs as much as they test one's subject knowledge, assuming some tiny idealized model that you have to know, one ignoring many practical aspects. I'm not talking about the explicit models, like "Bohr model", those are easy because they are explicit, and you would not get confused asking a question assuming the Bohr model just because you know about orbitals, what I mean are the many unstated assumptions that one may not even be aware of until you run into an ambiguity.
Irrelevant info is taught in grade skill and is a skill for the SAT for example.
Basically any kind of model (not just LLMs/ML) has to distill out irrelevant info.
The point is having an answer that you can defend logically and most people would agree.
If the model said “I’m not sure if this portion is a typo”, I guarantee you the model creators would take the RLHF in a different direction, because that is somewhat reasonable and defensible. However in your specific question, I personally think there is a singular objective answer—but that isn’t always the case to be fair for misleading/irrelevant prompts. The models are being fooled however based on how they respond.
I say this as a RLHF’er who sees and is told to write similar questions at times.
At the end of the day, this is how the Model creators want their models to predict language. And anyone using them is in for their ride.
I think this is valid though. Transformer models don't explicitly do logic but implicitly "vibe" out the answer from the input sequence (using the attention mechanism) and learnt knowledge - they're predicting text sequences after all. So adding more irrelevant context to the input would quite likely influence the the output.
I could see attention possibly being able to overcome this, but if not that would be a pretty big gotcha for real-world applications and reliability in real-world scenarios where, as others have said, it's not immediately clear what is relevant info. These models would be a lot less useful if a human had to decide which information to feed them and the output would be dependent on human judgement. I understand it's where we're at right now and that they are quite useful already but the valuations hint at investors expecting more imo.
That's not even the problem I encounter. They literally crap out on stupidly simple tasks. Recent ones:
1. Bing was gaslighting me into 9.11 being greater than 9.9
2. ChatGPT said that 7x7/7+7/7+7/7 was 24.
3. When expanding (x+1)^2 the output was 2x^2+2.
Regardless of any level of interpretation and irrelevant information if it can't deterministically understand correctness and the semantics of the operations in question then it's fucking useless.
What is worse in an educational context is that it is actively harmful.
I think it's necessary to remember that they're not a general artificial intelligence, but language models. For example, they're pretty good (not perfect) at translating things, including translating arbitrary instructions into code or machine-readable forms.
Seriously? those are pretty simple HS math questions, I find it a bit hard to believe most average people cant solve them. Don't most people graduate HS?
Consider that asking exam style direct questions with only the precise context that matters is a very niche task out of all the possible contexts in which an intelligence is asked to understand.
I agree it wasn’t that convincing, moreover the variation wasn’t that dramatic for the large sota models.
Why should they write a paper about the inherent reasoning capabilities for “large” language models and then in the abstract cherrypick a number that’s from a tiny 1B parameter model?
I agree that it's not particularly surprising that if you try to trick an LLM with irrelevant text will make it perform worse.
I don't see this as an material limitation of LLMs but rather something that can be addressed at the application level to strip out irrelevant information.
It's interesting that I use deliberately artificial remarks to encourage more "creative" or random outputs from LLMs. In this approach, I'm not seeking an exact or precise response to prompts, but rather something more open-ended.
I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.