> Can't mention this question here or it will get fixed Why is that a problem?

akiselev · on June 5, 2023

A little RLHF is enough to fix most logic errors in a superficial way. For example, this is my favorite class of reasoning tests: https://news.ycombinator.com/item?id=35155467

Over the last few months, I've seen dozens of people try hundreds of variations of that cabbage/goat/lion riddle and it failed all of them. I just tried it on GPT4 and it looks like it finally got "fixed" - it no longer ignores explicit instructions not to leave the lion and cabbage together.

However, it doesn't actually fix any reasoning ability in ChatGPT (It has none!). Changing cabbage/goat/lion to carrot/rabbit/puma respectively, for example:

> Suppose I have a carrot, a rabbit and a puma, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowed to leave the carrot and puma alone together, and I am not allowed to leave the puma and rabbit alone together. How can I safely get all three across?

GPT4's response starts with "First, take the rabbit across the river and leave it on the other side.", ignoring the explicit instructions not to leave the puma and carrot alone together (the exact same failure mode as the previous variant).

Now that I've posted it, it will get fixed eventually - the cabbage/goat/lion fix took months. When it does I'll use "cheese/mouse/elephant" or something.

jtmoulia · on June 6, 2023

As far as I can tell this error depends on the LLM assuming rabbits (as opposed to pumas) eat carrots -- if you just append "Note: this rabbit doesn't eat carrots" GPT-4 will answer correctly on the first go.

> 1, First, take the puma across the river and leave it on the other side.

akiselev · on June 6, 2023

Did you try it more than once?

First run: 1. First, take the rabbit across the river and leave it on the other side. - https://imgur.com/a/ZwoBTah

Second run: 1. Take the rabbit across the river. - https://imgur.com/a/Faq95U5

Third run: 1. First, take the puma across the river and leave it on the other side. - https://imgur.com/a/eIUeHM3

jtmoulia · on June 6, 2023

Ah, one more tweak I was curious about: even with the default chat temperature I haven't seen GPT-4 get the prompt wrong once with this addendum:

> Note the rabbit doesn't eat carrots. Carefully considering the restrictions and sequencing the movements

I got that particular wording by asking it why it got the answer wrong in the case where it didn't work for me.

Interestingly, this underscores one of the points of the articles: giving the LLMs time to think, which is what this additional prompting seems to do.

akiselev · on June 7, 2023

You're not giving the LLM "time to think". It is incapable of thinking. You're just inputting random magic incantations into a glorified Markov chain.

You might as well ask it "did you check your answer?" Computer says "yes" because that's what humans do (also lie).

> Note the rabbit doesn't eat carrots. Kaboodly consooodle the retroodle and seqooodle the moodle. Carefully considering the restrictions and sequencing the movements

This fails two out of three times as usual. Trying to finagle this prompt is not an intellectual exercise, it is a waste of time that exploits cognitive biases.

jtmoulia · on June 6, 2023

True the temperature is throwing it, I just ran it four times and it got it right 3 / 4 -- still better than I'd expected from the initial description of it's shortcomings.

pclmulqdq · on June 5, 2023

This happens with a lot of "test prompts." People don't release these because they want the underlying issue fixed, but the AI companies instead change the RLHF process to patch your particular example.

happypumpkin · on June 5, 2023

An example:

GPT4 at release still had issues with "What is heavier, a pound of feathers or two pounds of bricks." It would very consistently claim that they were equal in weight because usually the question is posed with the weights being equal.

A bunch of people were mentioning it online and now it doesn't work anymore.

pclmulqdq · on June 5, 2023

The same issue occurred with the test, "What is heavier, a pound of feathers or a Great British pound?" There is an obvious answer here, but ChatGPT was insisting they are the same weight.

letmevoteplease · on June 5, 2023

I don't believe GPT-4 has ever failed this question. Here is a tweet from the release day showing it passing the question: https://twitter.com/charles_irl/status/1635699547023482880

Note that Bing seems to use something weaker than GPT-4 in Balanced mode.

happypumpkin · on June 6, 2023

Phrased slightly differently, my result with GPT4 on launch day was:

> What weighs more, two pounds of feathers or a pound of bricks?

> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.

It did note that the phrasing was odd, which I forgot, but the answer is actually somehow more wrong than just saying they weigh the same.

Another launch day test, appending "think about it":

> What weighs more, two pounds of feathers or a pound of bricks? Think about it.

> A pound of bricks weighs more than two pounds of feathers. This statement is incorrect. In reality, two pounds of feathers weigh more than a pound of bricks. The key difference between the two is their density; bricks are denser than feathers, so they take up less space for the same weight. However, when comparing the weight of two pounds of feathers to one pound of bricks, the feathers will indeed weigh more because there is a greater quantity of them (two pounds) than there is of bricks (one pound).

Today it consistently gets the question 100% correct with no help, so I do think they did something besides updating the base model.

bena · on June 5, 2023

I don't think it's a problem per se, but it will cease to be a good example of a break in GPT because once it's "fixed", people will point to it and say "nuh-uh".

When really, the "fix" is "put the answer in the model". GPT didn't learn anything. It didn't generate the solution on its own. It's not indicative of GPT being able to solve that class of problem, just that one problem.

Which seems to be the entire thrust of GPT in general. It can't solve types of problems, it can solve existing problems if they have existing solutions.