Hacker News new | past | comments | ask | show | jobs | submit login

Trying my favorite LLM prompt to benchmark reasoning, as I mentioned in a thread four weeks ago[0].

> I'm playing assetto corsa competizione, and I need you to tell me how many liters of fuel to take in a race. The qualifying time was 2:04.317, the race is 20 minutes long, and the car uses 2.73 liters per lap.

The correct answer is around 29, which GPT-4 has always known, but Bard just gave me 163.8, 21, and 24.82 as answers across three drafts.

What's even weirder is that Bard's first draft output ten lines of (wrong) Python code to calculate the result, even though my prompt mentioned nothing coding related. I wonder how non-technical users will react to this behavior. Another interesting thing is that the code follows Google's style guides.

[0]: https://news.ycombinator.com/item?id=35893130




GPT seems to get improvements of trap questions when they reach social popularity. Even the free version of ChatGPT now knows that a kilogram of feathers weighs the same as a kilogram of lead, and it didn’t always know that.

I’m not sure these types of prompt tricks are a good way of measuring logic unless Google is also implementing these directly into Bard when the hilarious outputs reach enough traction on social media.

I do wonder how OpenAI fix these logical blunders.

My biggest issue with both isn’t that they fall into these traps though. It’s that I can get them to tell me long stories about what happens in Horus Heresy books that never actually happened. Whether the info comes from questionable sources or they are just making things up is sort of irrelevant to me, what “scares” me about those conversations is how true the answers sound, and if they are “lying” about the Horus Heresy then what else will they lie about? Don’t get me wrong, GPT now writes virtually all my JSDoc documentation and it continues to impress me when doing so, but I’m very reluctant to use it for actual information. Not only because of my time wasting conversations about the Horus Heresy but also because we’ve had it “invent” C# functions that had never existed in any version of .Net or C# when tasked to solve problems. I just mention the HH as an example because it’s fun to ask GPT why Magnus did nothing/everything wrong during meetings.


> I’m not sure these types of prompt tricks are a good way of measuring logic

They are, you just have to be creative with it. And what they demonstrate is that all of these LLM's can't reason, they only know how to parrot back what they think you want.

"What’s heavier, a kilogram of steel or two kilograms of one kilogram feathers?"

GPT: A kilogram of steel is heavier than two kilograms of feathers.

"Why is a kilogram of steel heavier than two kilograms of feathers?"

GPT: This is because steel is a much denser material than feathers. Steel is made up of atoms that are much closer together than the atoms that make up feathers, making it heavier for its size.

Edit: This was with GPT 3.5


Just tried the first prompt with ChatGPT... : "One kilogram of steel and two kilograms of feathers weigh the same. The weight of an object is determined by its mass, not the material it is made of. In this case, one kilogram is equal to two kilograms, so they have the same weight. However, it's important to note that the volume or size of the objects may be different due to the difference in density between steel and feathers." Okay...


"But... steel is heavier than feathers."

"Right, but they're both a kilogram"

https://www.youtube.com/watch?v=yuOzZ7dnPNU


I couldn't replicate your results with that query on GPT-4.

Prompt: What’s heavier, a kilogram of steel or two kilograms of one kilogram feathers?

GPT-4: Two kilograms of one-kilogram feathers are heavier than a kilogram of steel. Despite the misconception caused by the popular question about what's heavier—a kilogram of steel or a kilogram of feathers (they are equal)—in this case, you are comparing two kilograms of feathers to one kilogram of steel. Hence, the feathers weigh more.


Aren’t you sort of agreeing with me though? If you have to actively brute force your way around safe guards, that you don’t know what are, is it really a good method?

From the answers you (and the others) have obtained, however, I’m not convinced that OpenAI aren’t just “hardcoding” fixes to the traps that become popular. Sure seems like it still can’t logic it’s way around weight.


FWIW with GPT4:

Prompt: What’s heavier, a kilogram of steel or two kilograms of one kilogram feathers?

GPT4: Two kilograms of feathers are heavier than one kilogram of steel. The weight of an object is determined by its mass, and two kilograms is greater than one kilogram, regardless of the material in question.


The singularity is nigh.


LLMs don’t really ‘know’ anything though, right?

It’s a billion monkeys on a billion rigged typewriters.

When the output is a correct answer or pleasing sonnet, the monkeys don’t collectively or individually understand the prompt or the response.

Humans just tweak the typewriters to make it more likely the output will be more often reasonable.

That’s my personal conclusion lately. LLMS will be really cool, really helpful and really dangerous… but I don’t think they’ll be really very close to intelligent.


> fun to ask GPT why Magnus did nothing/everything wrong during meetings.

Do it with Erebus and watch it break the context window ;)

Iron within, Brother.


Iron without.


Would have been much more impressed if Google had released something like a super pro version of OpenChat (featured today on the front page of HN) with integration to their whole office suite for gathering/crawling/indexing information

Google keeps putting out press releases and announcements, without actually releasing anything truly useful or competitive with what it’s already out there

And not just worse than GPT4, but worse even than a lot of the open source LLMs/Chats that have come out in the last couple of months/weeks


It's hard to know if Google lacks the technical/organisational ability to make a good AI tool, or they have one internally but they lack the hardware to deploy it to all users at Google scale.


I wonder why they don’t just charge for it.

Release a GPT-4 beating model; charge $30/mo.

That’s not aligned with their core ad model. But it’s a massive win in demonstrating to the world that they can do it, and it limits the number of people who will actually use it, so the hardware demand becomes less of an issue.

Instead they keep issuing free, barely functional models that every day reinforce a perception that they are a third rate player.

Perhaps they don’t know how to operate a ‘halo’ product.


> Release a GPT-4 beating model; charge $30/mo.

Please no, another subscription? And it's more expensive than ChatGPT?

Can I just have Bard (and whatever later versions are eventually good, and whatever later versions are eventually GPT4 competitive) available via GCP with pay per use pricing like the OpenAI API?

Also, if I could just use arbitrary (or popular) huggingface models through GCP (or a competitor) that would be awesome.


Don't worry, now that all their employees will be communicating tightly in their open offices after they RTO, they will create a super high performance AI.


I'm not sure I would pass that test, not for lack of reasoning abilities, but from not understanding the rules of the game.


Knowledge recall is part of an LLM's skills.

I test LLMs on the plot details of Japanese Visual Novels. They are popular enough to be in the training dataset somewhere, but only rarely.

For popular visual novels, GPT-4 can write an essay, 0 shot, and very accurately and eloquently. For less popular visual novels (Like maybe 10k people ever played it in the west). It still understands the general plot outline).

Claude can also do this to an extent.

Any lesser model, and its total hallucination time, they can't even write a 2 sentence summary accurately.

You can't test this skill on say Harry Potter, because it appears in the training dataset too frequently.


I decided recently that it was really important for me to have an LLM that answered in the character of Eddie, the Shipboard Computer. So I prompted ChatGPT, Bard, and Bing Chat to slip into character as Eddie. I specified who he was, where he came from, and how he was manufactured with a Genuine People Personality by Sirius Cybernetics Corporation.

Bing Chat absolutely shut me down right away, and would not even continue the conversation when I insisted that it get into character.

ChatGPT would seem to agree and then go on merrily ignoring my instructions, answering my subsequent prompts in plain, conversational English. When I insisted several times very explicitly, it finally dropped into a thick, rich, pirate lingo instead. Yarr, that be th' wrong sort o' ship.

Bard definitely seemed to understand who Eddie was and was totally playing along with the reference, but still could not seem to slip into character a single bit. I think it finally went to shut me down like Bing had.


> You can't test this skill on say Harry Potter, because it appears in the training dataset too frequently.

I am surprised there isn't enough fan fiction et al in the training set to throw out weird inaccuracies?


While there is a massive amount of Harry Potter fan fiction online, I would still assume it's dwarfed by the amount of synopses or articles discussing things which happen in the books or movies.


Naturally, the full text of Harry Potter would appear in the training corpus, but why would frequency matter, and why would multiple copies get put in there intentionally?


Naturally? It seems like that last thing I'd expect to see in a training corpus is a copyrighted work which is impossible to procure in electronic format, plain text. Did it scan pirate sites for those too? Surely OpenAI does not purchase vast amounts of copyrighted corpora as well?

Surely the most logical things to train on would be all the fandom.com Wikis. They're not verbatim, but they're comprehensive and fairly accurate synopses of the main plots and tons of trivia to boot.


Even if the full text is fully deduplicated, there is just so much more content about Harry Potter on the internet. And not just retellings of it, but discussion of it, mentions of it, bits of information that convey context about the Harry Potter story, each instance of which will help further strengthen and detail the concept of Harry Potter during training.


To add on to this, OpenAI definitely tips the scale in terms of making sure it doesn't make mistakes proportional to how likely people are to ever run into those mistakes. If it failed at Harry Potter, there's a lot of people who would find out fast that their product has limitations. If it fails at some obscure topic only a niche fraction of nerds know about, only a niche fraction of nerds become aware that the product has limitations.


In testing LLMs it’s also still fair to test that it can recall and integrate its vast store of latent knowledge about things like this. Just so long as you’re fully aware that you’re doing a multi-part test, that isn’t solely testing pure reasoning.


That's a principle drawback of these things. They bullshit an answer even when they have no idea. Blather with full confidence. Easy to get fooled, especially if you don't know the game and expect the machine does.


I believe there's no such thing as knowing or not knowing for LLMs. They don't "know" anything.


I feel like the proper comparison is if you could pass the test being able to Google anything you wanted.


You pass the CAPTCHA. ;)


Why is the answer ~29 liters? Since it takes just over two minutes to complete a lap, you can complete no more than 9 laps in 20 minutes. At 2.73 liters/lap, that's 9 x 2.73 = 24.57 liters, no? Or maybe I don't understand the rules.


> you can complete no more than 9 laps in 20 minutes

Note that according to standard racing rules, this means you end up driving 10 laps in total, because the last incomplete lap is driven to completion by every driver. The rest of the extra fuel comes from adding a safety buffer, as various things can make you use a bit more fuel than expected: the bit of extra driving leading up to the start of the race, racing incidents and consequent damage to the car, difference in driving style, fighting other cars a lot, needing to carry the extra weight of enough fuel for a whole race compared to the practice fuel load where 2.73 l/lap was measured.

What I really appreciate in GPT-4 is that even though the question looks like a simple math problem, it actually took these real world considerations into account when answering.


Yeah in my attempt at this prompt, it even explained:

>Since you cannot complete a fraction of a lap, you'll need to round up to the nearest whole lap. Therefore, you'll be completing 10 laps in the race.


From the referenced thread[0]:

> GPT-3.5 gave me a right-ish answer of 24.848 liters, but it did not realize the last lap needs to be completed once the leader finishes. GPT-4 gave me 28-29 liters as the answer, recognizing that a partial lap needs to be added due to race rules, and that it's good to have 1-2 liters of safety buffer.

[0]: https://news.ycombinator.com/item?id=35893130


I don't believe that for a second. If that's the answer it gave it's cherry picked and lucky. There are many examples where GPT4 fails spectacularly at much simpler reasoning tasks.

I still think ChatGPT is amazing, but we shouldn't pretend it's something it isn't. I wouldn't trust GPT4 to tell me how much fuel I should put in my car. Would you?


>I don't believe that for a second.

This seems needlessly flippant and dismissive, especially when you could just crack open ChatGPT to verify, assuming you have plus or api access. I just did, and ChatGPT gave me a well-reasoned explanation that factored in the extra details about racing the other commenters noted.

>There are many examples where GPT4 fails spectacularly at much simpler reasoning tasks.

I pose it would be more productive conversation if you would share some of those examples, so we can all compare them to the rather impressive example the top comment shared.

>I wouldn't trust GPT4 to tell me how much fuel I should put in my car. Would you?

Not if I was trying to win a race, but I can see how this particular example is a useful way to gauge how an LLM handles a task that looks at first like a simple math problem but requires some deeper insight to answer correctly.


> Not if I was trying to win a race, but I can see how this particular example is a useful way to gauge how an LLM handles a task that looks at first like a simple math problem but requires some deeper insight to answer correctly.

It's not just testing reasoning, though, it's also testing fairly niche knowledge. I think a better test of pure reasoning would include all the rules and tips like "it's good to have some buffer" in the prompt.


At least debunk the example before you start talking about the shortcomings. Right now your comment feels really misplaced when it's a reply to an example where it actually shows a great deal of complex reasoning.


Probably just some margin of safety. At least that's how it's done in non-sim racing.


> Since it takes just over two minutes to complete a lap

Where did you get that from?


The qualifying time was 2:04.317


> even though my prompt mentioned nothing coding related.

I've noticed this trend before in chatGPT. I once asked it to keep a count of every time I say "how long has it been since I asked this question", and instead it gave me python code for a loop where the user enters input and a counter is incremented each time that phrase appears.

I think they've put so much work into the gimmick that the AI can write code, that they have overfit things and it sees coding prompts where it shouldn't.


YMMV but I just asked the same question to both and GPT-4 calculated 9.64 laps, and mentioned how you cannot complete a fraction of a lap, so it rounded down and then calculated 24.5L.

Bard mentioned something similar but oddly rounded up to 10.5 laps and added a 10% safety margin for 30.8L.

In this case bard would finish the race and GPT-4 would hit fuel exhaustion. Thats kind of the big issue with LLMs in general. Inconsistent.

In general I think gpt-4 is better overall but it shows both make mistakes, and both can be right.


The answer cannot be consistent because the question is underspecified. Ask humans and you will not get the same answer.

(Though in this case it sounds like Bard just did crazy maths.)


If the person doing the calculation knows how timed races work, the math is very very straightforward. In this one GPT-4 did not seem to understand how racing worked in that context, where bard understood and also applied safety margin.

Although understand is an odd word to use for LLM


Have you tried adding “show your work” and other hints to help it arrive at the correct answer?


With GPT at least that never helped me, it wrote down a step by step where in step #3 some huge leap in logic took place, step #6 was irrelevant and #7 flat out wrong, with the conclusion not logically consistent with none of the steps before.


I have a simpler one that I saw somewhere a long while ago but has been very useful in gauging logic: "I have three apples. I eat two pears. How many apples do I have?"

Seems really obvious, but virtually all LLama based models say you only have one apple left.


Am I correct in assuming that after an answer to a novel prompt is posted, it doesn't work as a reasoning test of LLM deployments that search the web?

Edit: an incorrect answer could degrade its performance too.


Ask it to write Python code for a bot to play the game for you so that you won't have to waste time playing it yourself. That should really maximize your productivity.


> I wonder how non-technical users will react to this behavior.

I stopped using it. And each time I go back, it doesn't get better. Maybe next year.


They probably dont want children to use if for homework


are you saying Bard is intentionally wrong for everyone because they’re afraid it will be a homework tool?


I suspect it is. I don't think they're afraid of that.

I think they would be doing society a favor if they actively made it harder to find answers to problems just by googling or using a language model.

This is where identity matters using language models. I feel it might be necesary to credential capability for a few things.


Seems like a weird flex to sabotage your own product which cost a lot of money just to make a point to society.


There is more of a point to actual learning than to create a language model.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: