Over the last few months, I've seen dozens of people try hundreds of variations of that cabbage/goat/lion riddle and it failed all of them. I just tried it on GPT4 and it looks like it finally got "fixed" - it no longer ignores explicit instructions not to leave the lion and cabbage together.
However, it doesn't actually fix any reasoning ability in ChatGPT (It has none!). Changing cabbage/goat/lion to carrot/rabbit/puma respectively, for example:
> Suppose I have a carrot, a rabbit and a puma, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowed to leave the carrot and puma alone together, and I am not allowed to leave the puma and rabbit alone together. How can I safely get all three across?
GPT4's response starts with "First, take the rabbit across the river and leave it on the other side.", ignoring the explicit instructions not to leave the puma and carrot alone together (the exact same failure mode as the previous variant).
Now that I've posted it, it will get fixed eventually - the cabbage/goat/lion fix took months. When it does I'll use "cheese/mouse/elephant" or something.
As far as I can tell this error depends on the LLM assuming rabbits (as opposed to pumas) eat carrots -- if you just append "Note: this rabbit doesn't eat carrots" GPT-4 will answer correctly on the first go.
> 1, First, take the puma across the river and leave it on the other side.
You're not giving the LLM "time to think". It is incapable of thinking. You're just inputting random magic incantations into a glorified Markov chain.
You might as well ask it "did you check your answer?" Computer says "yes" because that's what humans do (also lie).
> Note the rabbit doesn't eat carrots. Kaboodly consooodle the retroodle and seqooodle the moodle. Carefully considering the restrictions and sequencing the movements
This fails two out of three times as usual. Trying to finagle this prompt is not an intellectual exercise, it is a waste of time that exploits cognitive biases.
True the temperature is throwing it, I just ran it four times and it got it right 3 / 4 -- still better than I'd expected from the initial description of it's shortcomings.
This happens with a lot of "test prompts." People don't release these because they want the underlying issue fixed, but the AI companies instead change the RLHF process to patch your particular example.
GPT4 at release still had issues with "What is heavier, a pound of feathers or two pounds of bricks." It would very consistently claim that they were equal in weight because usually the question is posed with the weights being equal.
A bunch of people were mentioning it online and now it doesn't work anymore.
The same issue occurred with the test, "What is heavier, a pound of feathers or a Great British pound?" There is an obvious answer here, but ChatGPT was insisting they are the same weight.
Phrased slightly differently, my result with GPT4 on launch day was:
> What weighs more, two pounds of feathers or a pound of bricks?
> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.
It did note that the phrasing was odd, which I forgot, but the answer is actually somehow more wrong than just saying they weigh the same.
Another launch day test, appending "think about it":
> What weighs more, two pounds of feathers or a pound of bricks? Think about it.
> A pound of bricks weighs more than two pounds of feathers. This statement is incorrect. In reality, two pounds of feathers weigh more than a pound of bricks. The key difference between the two is their density; bricks are denser than feathers, so they take up less space for the same weight. However, when comparing the weight of two pounds of feathers to one pound of bricks, the feathers will indeed weigh more because there is a greater quantity of them (two pounds) than there is of bricks (one pound).
Today it consistently gets the question 100% correct with no help, so I do think they did something besides updating the base model.
I don't think it's a problem per se, but it will cease to be a good example of a break in GPT because once it's "fixed", people will point to it and say "nuh-uh".
When really, the "fix" is "put the answer in the model". GPT didn't learn anything. It didn't generate the solution on its own. It's not indicative of GPT being able to solve that class of problem, just that one problem.
Which seems to be the entire thrust of GPT in general. It can't solve types of problems, it can solve existing problems if they have existing solutions.
Why is that a problem?