Seems like they caught up because I have posted this before including in chatGPT. All that means is you have to change it up slightly.
Unfortunately “change it up slightly” is not good enough for people to do anything with, and anything more specific just trains the LLM eventually so it stops proving the point.
It also means that you should update your belief about the reasoning capabilities of LLMs at least slightly. if disconfirming evidence doesn’t shake your beliefs at all, you dont really have beliefs, you have an ideology.
The observation here is far too easily explained in other ways to be considered particularly strong evidence.
Memorizing a solution to a classic brainteaser is not the same as having the reasoning skills needed to solve it. Finding out separate solutions for related problems might allow someone to pattern-match, but not to understand. This is about as true for humans as for LLMs. Lots of people ace their courses, even at university level, while being left with questions that demonstrate a stunning lack of comprehension.
Or it just means anything shared on the internet gets RLHF’d / special cased.
It’s been clear for a long time that the major vendors have been watching online chatter and tidying up well-known edge cases by hand. If you have a test that works, it will keep working as long as you don’t share it widely enough to get their attention.
"It also means that you should update your belief about the reasoning capabilities of LLMs at least slightly."
AI, LLM, ML - have no reasoning ability, they're not human, they are software machines, not people. People reason, machines calculate and imitate, they do not reason.
People are just analog neural circuits. They are not magic. Human brain is one physical implementation of intelligence. It is not the only one possible.
I don't want to sound like a prick, but this is generally what people mean when they say "things are moving very fast due to amount of investments". 6 months old beliefs can be scrapped due to new models, research and etc.
Where "etc." includes lots of engineers who are polishing up the frontend to the model which translates text to tokens and vice versa, so that well-known demonstrations of the lack of reasoning in the model cannot be surfaced as easily as before? What reason is there to believe that the generated output, while often impressive, is based on something that resembles understanding, something that can be relied on?
https://chatgpt.com/c/6760a0a0-fa34-800c-9ef4-78c76c71e03b