> GPT-3.5 gave me a right-ish answer of 24.848 liters, but it did not realize the last lap needs to be completed once the leader finishes. GPT-4 gave me 28-29 liters as the answer, recognizing that a partial lap needs to be added due to race rules, and that it's good to have 1-2 liters of safety buffer.
I don't believe that for a second. If that's the answer it gave it's cherry picked and lucky. There are many examples where GPT4 fails spectacularly at much simpler reasoning tasks.
I still think ChatGPT is amazing, but we shouldn't pretend it's something it isn't. I wouldn't trust GPT4 to tell me how much fuel I should put in my car. Would you?
This seems needlessly flippant and dismissive, especially when you could just crack open ChatGPT to verify, assuming you have plus or api access. I just did, and ChatGPT gave me a well-reasoned explanation that factored in the extra details about racing the other commenters noted.
>There are many examples where GPT4 fails spectacularly at much simpler reasoning tasks.
I pose it would be more productive conversation if you would share some of those examples, so we can all compare them to the rather impressive example the top comment shared.
>I wouldn't trust GPT4 to tell me how much fuel I should put in my car. Would you?
Not if I was trying to win a race, but I can see how this particular example is a useful way to gauge how an LLM handles a task that looks at first like a simple math problem but requires some deeper insight to answer correctly.
> Not if I was trying to win a race, but I can see how this particular example is a useful way to gauge how an LLM handles a task that looks at first like a simple math problem but requires some deeper insight to answer correctly.
It's not just testing reasoning, though, it's also testing fairly niche knowledge. I think a better test of pure reasoning would include all the rules and tips like "it's good to have some buffer" in the prompt.
At least debunk the example before you start talking about the shortcomings. Right now your comment feels really misplaced when it's a reply to an example where it actually shows a great deal of complex reasoning.
> GPT-3.5 gave me a right-ish answer of 24.848 liters, but it did not realize the last lap needs to be completed once the leader finishes. GPT-4 gave me 28-29 liters as the answer, recognizing that a partial lap needs to be added due to race rules, and that it's good to have 1-2 liters of safety buffer.
[0]: https://news.ycombinator.com/item?id=35893130