Not OP but here's an example of how GPT-4 can't deal with the goat/wolf/cabbage ...

cubefox · on June 5, 2023

I asked Bing a variant of the Wason selection task (a logic test/riddle). Instead of answering directly, it searched the Web for "Wason selection task solution" (so it knew what the task was called, I didn't give it the name), and then provided its answer based on that search result. Except the task in the search result was different in the specifics (different colors) so it gave the wrong answer. Also insisted that its solution was right. Though maybe that's an issue with Microsoft's fine-tuning rather than with the base model itself.

MereInterest · on June 5, 2023

I hadn't heard of that task, and it was interesting to see ChatGPT attempt the same problem. After a wrong answer, I gave it a leading question and received the following response.

> If you were to turn over the yellow card and find the number 7 on the other side, it would not disprove the statement "If a card has an odd number on one side, then the other side is purple." In fact, this discovery would not provide any evidence either for or against the statement.

> The statement specifically refers to cards with odd numbers on one side and their corresponding color on the other side. It does not make any claims about the colors of cards with even numbers. Therefore, even if the yellow card had an odd number like 7 on the other side, it would not contradict the statement.

It's interesting to see the model explaining exactly what would be necessary to find, exactly what it could find, and then fail to make any connection between the two.

geysersam · on June 5, 2023

Yes it's very fascinating! The language is so clear but the concepts are totally confused.

Does this mean real logical reasoning is very close, only some small improvements away, or does it mean we're just on the wrong track (to reach actual AGI)?

Tainnor · on June 6, 2023

IMHO (and this is just my own uniformed view), this means that language models by themselves are insufficient for certain important tasks. It seems to be hard for systems to learn deductive reasoning purely based on text prediction.

OTOH, who knows what would happen if you somehow managed to combine the generating capabilities of a language model with a proper inference engine, e.g. Wolfram|Alpha. Maybe it would bring us significantly closer to AGI, but maybe that way is also a dead-end because it's not guaranteed that those systems would work well together.