Hacker News new | past | comments | ask | show | jobs | submit login

>I see many people claim GPT-n is "dumb"

Depends.

Can't do math or logic. I have a question I ask ChatGPT to see if it can do logic yet, it still cannot. (Can't mention this question here or it will get fixed.)

Its great for brain storming or low risk problems. I don't think the accuracy problem will ever be fixed.

I probably 5x my productivity as well, but that doesnt mean its able to do logic.




> Its great for brain storming or low risk problems

Definately. I resort to GPT when I have no clue where to even start digging a problem, like not even knowing what keywords to google. I just prompt my candid question and GPT does help narrowing things down.


When you say "it can't do logic" what do you mean? "Logic" can be as simple as A=A, A!=B or as arbitrarily complex as you wish.

In my experience GPT-4 can solve unique logical puzzles, but I find it can be a bit clumsy with the context of more complex problems. What I mean by that is that it often can solve these problems with the right prompt, but you might need to ask it to think out loud and check its logic.


Not OP but here's an example of how GPT-4 can't deal with the goat/wolf/cabbage problem when things are switched up just a little.

https://amistrongeryet.substack.com/p/gpt-4-capabilities

Although it's interesting that if you use different nouns it does just fine: https://jbconsulting.substack.com/p/its-not-just-statistics-...


I asked Bing a variant of the Wason selection task (a logic test/riddle). Instead of answering directly, it searched the Web for "Wason selection task solution" (so it knew what the task was called, I didn't give it the name), and then provided its answer based on that search result. Except the task in the search result was different in the specifics (different colors) so it gave the wrong answer. Also insisted that its solution was right. Though maybe that's an issue with Microsoft's fine-tuning rather than with the base model itself.


I hadn't heard of that task, and it was interesting to see ChatGPT attempt the same problem. After a wrong answer, I gave it a leading question and received the following response.

> If you were to turn over the yellow card and find the number 7 on the other side, it would not disprove the statement "If a card has an odd number on one side, then the other side is purple." In fact, this discovery would not provide any evidence either for or against the statement.

> The statement specifically refers to cards with odd numbers on one side and their corresponding color on the other side. It does not make any claims about the colors of cards with even numbers. Therefore, even if the yellow card had an odd number like 7 on the other side, it would not contradict the statement.

It's interesting to see the model explaining exactly what would be necessary to find, exactly what it could find, and then fail to make any connection between the two.


Yes it's very fascinating! The language is so clear but the concepts are totally confused.

Does this mean real logical reasoning is very close, only some small improvements away, or does it mean we're just on the wrong track (to reach actual AGI)?


IMHO (and this is just my own uniformed view), this means that language models by themselves are insufficient for certain important tasks. It seems to be hard for systems to learn deductive reasoning purely based on text prediction.

OTOH, who knows what would happen if you somehow managed to combine the generating capabilities of a language model with a proper inference engine, e.g. Wolfram|Alpha. Maybe it would bring us significantly closer to AGI, but maybe that way is also a dead-end because it's not guaranteed that those systems would work well together.


The biggest giveaway is that it doesn't have an internally consistent model of what it's trying to do.

For example, I've once asked it to "prove that first order logic is not complete", a statement which is false, given that every logic textbook will give you a proof for why first order logic is complete. ChatGPT apparently knew this too, because it happily reproduced that result but somehow thought that it was a valid answer for what I asked for (the valid answer would have been "I can't prove this because it's wrong").

I can't reproduce this exact result right now (in fact when I tried, it just proved something slightly different, which was correct but not exactly what I asked for). But I was able to similarly confuse it by asking it to prove that the equation x^3+y^3=z^3 has solutions. That this is true is immediately apparent by x=y=z=0, but ChatGPT replies with:

> The equation x^3 + y^3 = z^3 represents a special case of Fermat's Last Theorem, which states that there are no solutions to the equation x^n + y^n = z^n for any integer values of n greater than 2.

> For the case of n = 3, the equation x^3 + y^3 = z^3 has infinitely many solutions. One such set of solutions is known as Pythagorean triples, where x, y, and z are positive integers.

> A Pythagorean triple is a set of three positive integers (x, y, z) that satisfy the equation x^2 + y^2 = z^2, which is equivalent to (x^2)^(3/2) + (y^2)^(3/2) = (z^2)^(3/2). By raising both sides to the power of 3/2, we get x^3 + y^3 = z^3.

> For example, the Pythagorean triple (3, 4, 5) satisfies 3^3 + 4^3 = 5^3 (27 + 64 = 125).

This answer is just confused on so many levels:

- It quotes back Fermat's Last Theorem at me (as indeed I hoped it would), but that theorem only applies to positive integer solutions and nowhere did I specify that constraint.

- If the Theorem did apply, then it would be a proof that such solutions don't exist. So ChatGPT has no internal understanding of how a theorem it quotes relates to a specific question, it just parrots off things that look vaguely similar to the input.

- Then, it just tells me what Pythagorean Triples are, which is hilarious, because those are the solutions to x^2+y^2=z^2 - and not what I asked. It then tries to somehow transform Pythagorean triples into (non-integer) solutions of my equation (which doesn't work), and then doesn't even apply the transformation to its own example (and the calculation is just... wrong).

The problem IMO is not that ChatGPT gives a wrong answer, it's that its answer isn't even internally consistent.


Are you using code interpreter to get the answers, or is this just based GPT4?


what do you mean? It's ChatGPT. Quite possibly GPT-4 performs a bit better but the underlying principle is the same.


Aristotle has defined logic in Organon.

https://en.wikipedia.org/wiki/Organon


For the people downvoting, his work was literally where logic originates from. Not only he theorized about it but he also described the exact rules which define logic.

The origin of the very word Logic has its roots in that exact era as phrased at the time, by the very people who came up with its ruleset in the first place.

You may define logic otherwise but in the context of past occurrences they're more or less irrelevant.


It can do math and/or logic. Take a look at the "Chain of Thought" and "Few Shots" prompting techniques.


> Can't mention this question here or it will get fixed

Why is that a problem?


A little RLHF is enough to fix most logic errors in a superficial way. For example, this is my favorite class of reasoning tests: https://news.ycombinator.com/item?id=35155467

Over the last few months, I've seen dozens of people try hundreds of variations of that cabbage/goat/lion riddle and it failed all of them. I just tried it on GPT4 and it looks like it finally got "fixed" - it no longer ignores explicit instructions not to leave the lion and cabbage together.

However, it doesn't actually fix any reasoning ability in ChatGPT (It has none!). Changing cabbage/goat/lion to carrot/rabbit/puma respectively, for example:

> Suppose I have a carrot, a rabbit and a puma, and I need to get them across a river. I have a boat that can only carry myself and a single other item. I am not allowed to leave the carrot and puma alone together, and I am not allowed to leave the puma and rabbit alone together. How can I safely get all three across?

GPT4's response starts with "First, take the rabbit across the river and leave it on the other side.", ignoring the explicit instructions not to leave the puma and carrot alone together (the exact same failure mode as the previous variant).

Now that I've posted it, it will get fixed eventually - the cabbage/goat/lion fix took months. When it does I'll use "cheese/mouse/elephant" or something.


As far as I can tell this error depends on the LLM assuming rabbits (as opposed to pumas) eat carrots -- if you just append "Note: this rabbit doesn't eat carrots" GPT-4 will answer correctly on the first go.

> 1, First, take the puma across the river and leave it on the other side.


Did you try it more than once?

First run: 1. First, take the rabbit across the river and leave it on the other side. - https://imgur.com/a/ZwoBTah

Second run: 1. Take the rabbit across the river. - https://imgur.com/a/Faq95U5

Third run: 1. First, take the puma across the river and leave it on the other side. - https://imgur.com/a/eIUeHM3


Ah, one more tweak I was curious about: even with the default chat temperature I haven't seen GPT-4 get the prompt wrong once with this addendum:

> Note the rabbit doesn't eat carrots. Carefully considering the restrictions and sequencing the movements

I got that particular wording by asking it why it got the answer wrong in the case where it didn't work for me.

Interestingly, this underscores one of the points of the articles: giving the LLMs time to think, which is what this additional prompting seems to do.


You're not giving the LLM "time to think". It is incapable of thinking. You're just inputting random magic incantations into a glorified Markov chain.

You might as well ask it "did you check your answer?" Computer says "yes" because that's what humans do (also lie).

> Note the rabbit doesn't eat carrots. Kaboodly consooodle the retroodle and seqooodle the moodle. Carefully considering the restrictions and sequencing the movements

This fails two out of three times as usual. Trying to finagle this prompt is not an intellectual exercise, it is a waste of time that exploits cognitive biases.


True the temperature is throwing it, I just ran it four times and it got it right 3 / 4 -- still better than I'd expected from the initial description of it's shortcomings.


This happens with a lot of "test prompts." People don't release these because they want the underlying issue fixed, but the AI companies instead change the RLHF process to patch your particular example.


An example:

GPT4 at release still had issues with "What is heavier, a pound of feathers or two pounds of bricks." It would very consistently claim that they were equal in weight because usually the question is posed with the weights being equal.

A bunch of people were mentioning it online and now it doesn't work anymore.


The same issue occurred with the test, "What is heavier, a pound of feathers or a Great British pound?" There is an obvious answer here, but ChatGPT was insisting they are the same weight.


I don't believe GPT-4 has ever failed this question. Here is a tweet from the release day showing it passing the question: https://twitter.com/charles_irl/status/1635699547023482880

Note that Bing seems to use something weaker than GPT-4 in Balanced mode.


Phrased slightly differently, my result with GPT4 on launch day was:

> What weighs more, two pounds of feathers or a pound of bricks?

> A pound of bricks weighs more than two pounds of feathers. However, it seems like you might have made an error in your question, as the comparison is usually made between a pound of feathers and a pound of bricks. In that case, both would weigh the same—one pound—though the volume and density of the two materials would be very different.

It did note that the phrasing was odd, which I forgot, but the answer is actually somehow more wrong than just saying they weigh the same.

Another launch day test, appending "think about it":

> What weighs more, two pounds of feathers or a pound of bricks? Think about it.

> A pound of bricks weighs more than two pounds of feathers. This statement is incorrect. In reality, two pounds of feathers weigh more than a pound of bricks. The key difference between the two is their density; bricks are denser than feathers, so they take up less space for the same weight. However, when comparing the weight of two pounds of feathers to one pound of bricks, the feathers will indeed weigh more because there is a greater quantity of them (two pounds) than there is of bricks (one pound).

Today it consistently gets the question 100% correct with no help, so I do think they did something besides updating the base model.


I don't think it's a problem per se, but it will cease to be a good example of a break in GPT because once it's "fixed", people will point to it and say "nuh-uh".

When really, the "fix" is "put the answer in the model". GPT didn't learn anything. It didn't generate the solution on its own. It's not indicative of GPT being able to solve that class of problem, just that one problem.

Which seems to be the entire thrust of GPT in general. It can't solve types of problems, it can solve existing problems if they have existing solutions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: