This doesn't feel like a "reasoning" challenge. The mental skill required to solve most of these seems to be the ability to loop over all known members of a category like "popular brand names" or "well-known actors" and see if they fit the clue.
As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.
Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).
(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)
I had the same thought. It reminds me of solving Project Euler problems, where there is often an obvious naive approach which is guaranteed to produce the correct answer but would consume prohibitive memory/compute resources to execute to completion. I suspect the models would perform much better if prompted to formulate a strategy for efficiently solving these challenges rather than solving them directly… which indicates a direction for potential improvement I suppose.
I agree that recall seems to play an important role in solving these problems. Similar to how the ARC-AGI problems seem to depend on visual perception of shapes and colors. When I come up with the correct answers to such puzzles, I feel subjectively that the answers flashed into my mind, not that I reasoned my way to them.
But, I do think this is reasoning. It requires recall, but anything other than pure logic puzzles do. For example, on a competition math problem or a programming problem, No person or LLM is inventing well-known lemmas and algorithms from first-principles.
I think what you mean is that once you've managed to recall, checking constraints is easy. Remarkably, a few people are much better at this than others. They are able to think fast and execute an explicit mental search over a very small number of plausible candidates. Other people take forever. Seems to be the case for models too.
I think what you said is the same as what your comment said? "Requires no non-trivial thought besides recall" seems remarkably similar to "once you have recalled an item, checking that it fits the constraints is trivial"
Or are you pointing to a nuanced difference between "easy" and "trivial" that I'm not understanding? Or do you think it requires non-trivial thought before the recall step?
As a human, you'd expect to fail either because you didn't know a category member (e.g. as a non-American I have no idea WTF "Citgo" is; I could never get the answer to the first question because I have never seen that name before in my life) or because you weren't able to bring it to mind; the mental act of looping over all members of a category is quite challenging for a human.
Admittedly this is something an AI system could in principle be REALLY good at, and it's interesting to test and see that current ones are not! But it seems weird to me to call what's being tested "reasoning" when it's so heavily focused on memory recall (and evaluating whether a candidate answer works or not is trivial once you've brought it to mind and doesn't really require any intelligent thought).
(If the questions were multiple-choice, eliminating the challenge of bringing candidate answers to mind that is the main challenge for a human, then I'd agree it was a "reasoning" test.)