Trying my favorite LLM prompt to benchmark reasoning, as I mentioned in a thread four weeks ago[0].
> I'm playing assetto corsa competizione, and I need you to tell me how many liters of fuel to take in a race. The qualifying time was 2:04.317, the race is 20 minutes long, and the car uses 2.73 liters per lap.
The correct answer is around 29, which GPT-4 has always known, but Bard just gave me 163.8, 21, and 24.82 as answers across three drafts.
What's even weirder is that Bard's first draft output ten lines of (wrong) Python code to calculate the result, even though my prompt mentioned nothing coding related. I wonder how non-technical users will react to this behavior. Another interesting thing is that the code follows Google's style guides.
GPT seems to get improvements of trap questions when they reach social popularity. Even the free version of ChatGPT now knows that a kilogram of feathers weighs the same as a kilogram of lead, and it didn’t always know that.
I’m not sure these types of prompt tricks are a good way of measuring logic unless Google is also implementing these directly into Bard when the hilarious outputs reach enough traction on social media.
I do wonder how OpenAI fix these logical blunders.
My biggest issue with both isn’t that they fall into these traps though. It’s that I can get them to tell me long stories about what happens in Horus Heresy books that never actually happened. Whether the info comes from questionable sources or they are just making things up is sort of irrelevant to me, what “scares” me about those conversations is how true the answers sound, and if they are “lying” about the Horus Heresy then what else will they lie about? Don’t get me wrong, GPT now writes virtually all my JSDoc documentation and it continues to impress me when doing so, but I’m very reluctant to use it for actual information. Not only because of my time wasting conversations about the Horus Heresy but also because we’ve had it “invent” C# functions that had never existed in any version of .Net or C# when tasked to solve problems. I just mention the HH as an example because it’s fun to ask GPT why Magnus did nothing/everything wrong during meetings.
> I’m not sure these types of prompt tricks are a good way of measuring logic
They are, you just have to be creative with it. And what they demonstrate is that all of these LLM's can't reason, they only know how to parrot back what they think you want.
"What’s heavier, a kilogram of steel or two kilograms of one kilogram feathers?"
GPT: A kilogram of steel is heavier than two kilograms of feathers.
"Why is a kilogram of steel heavier than two kilograms of feathers?"
GPT: This is because steel is a much denser material than feathers. Steel is made up of atoms that are much closer together than the atoms that make up feathers, making it heavier for its size.
Just tried the first prompt with ChatGPT... : "One kilogram of steel and two kilograms of feathers weigh the same. The weight of an object is determined by its mass, not the material it is made of. In this case, one kilogram is equal to two kilograms, so they have the same weight. However, it's important to note that the volume or size of the objects may be different due to the difference in density between steel and feathers."
Okay...
I couldn't replicate your results with that query on GPT-4.
Prompt: What’s heavier, a kilogram of steel or two kilograms of one kilogram feathers?
GPT-4: Two kilograms of one-kilogram feathers are heavier than a kilogram of steel. Despite the misconception caused by the popular question about what's heavier—a kilogram of steel or a kilogram of feathers (they are equal)—in this case, you are comparing two kilograms of feathers to one kilogram of steel. Hence, the feathers weigh more.
Aren’t you sort of agreeing with me though? If you have to actively brute force your way around safe guards, that you don’t know what are, is it really a good method?
From the answers you (and the others) have obtained, however, I’m not convinced that OpenAI aren’t just “hardcoding” fixes to the traps that become popular. Sure seems like it still can’t logic it’s way around weight.
Prompt: What’s heavier, a kilogram of steel or two kilograms of one kilogram feathers?
GPT4: Two kilograms of feathers are heavier than one kilogram of steel. The weight of an object is determined by its mass, and two kilograms is greater than one kilogram, regardless of the material in question.
It’s a billion monkeys on a billion rigged typewriters.
When the output is a correct answer or pleasing sonnet, the monkeys don’t collectively or individually understand the prompt or the response.
Humans just tweak the typewriters to make it more likely the output will be more often reasonable.
That’s my personal conclusion lately. LLMS will be really cool, really helpful and really dangerous… but I don’t think they’ll be really very close to intelligent.
Would have been much more impressed if Google had released something like a super pro version of OpenChat (featured today on the front page of HN) with integration to their whole office suite for gathering/crawling/indexing information
Google keeps putting out press releases and announcements, without actually releasing anything truly useful or competitive with what it’s already out there
And not just worse than GPT4, but worse even than a lot of the open source LLMs/Chats that have come out in the last couple of months/weeks
It's hard to know if Google lacks the technical/organisational ability to make a good AI tool, or they have one internally but they lack the hardware to deploy it to all users at Google scale.
That’s not aligned with their core ad model. But it’s a massive win in demonstrating to the world that they can do it, and it limits the number of people who will actually use it, so the hardware demand becomes less of an issue.
Instead they keep issuing free, barely functional models that every day reinforce a perception that they are a third rate player.
Perhaps they don’t know how to operate a ‘halo’ product.
Please no, another subscription? And it's more expensive than ChatGPT?
Can I just have Bard (and whatever later versions are eventually good, and whatever later versions are eventually GPT4 competitive) available via GCP with pay per use pricing like the OpenAI API?
Also, if I could just use arbitrary (or popular) huggingface models through GCP (or a competitor) that would be awesome.
Don't worry, now that all their employees will be communicating tightly in their open offices after they RTO, they will create a super high performance AI.
I test LLMs on the plot details of Japanese Visual Novels. They are popular enough to be in the training dataset somewhere, but only rarely.
For popular visual novels, GPT-4 can write an essay, 0 shot, and very accurately and eloquently.
For less popular visual novels (Like maybe 10k people ever played it in the west). It still understands the general plot outline).
Claude can also do this to an extent.
Any lesser model, and its total hallucination time, they can't even write a 2 sentence summary accurately.
You can't test this skill on say Harry Potter, because it appears in the training dataset too frequently.
I decided recently that it was really important for me to have an LLM that answered in the character of Eddie, the Shipboard Computer. So I prompted ChatGPT, Bard, and Bing Chat to slip into character as Eddie. I specified who he was, where he came from, and how he was manufactured with a Genuine People Personality by Sirius Cybernetics Corporation.
Bing Chat absolutely shut me down right away, and would not even continue the conversation when I insisted that it get into character.
ChatGPT would seem to agree and then go on merrily ignoring my instructions, answering my subsequent prompts in plain, conversational English. When I insisted several times very explicitly, it finally dropped into a thick, rich, pirate lingo instead. Yarr, that be th' wrong sort o' ship.
Bard definitely seemed to understand who Eddie was and was totally playing along with the reference, but still could not seem to slip into character a single bit. I think it finally went to shut me down like Bing had.
While there is a massive amount of Harry Potter fan fiction online, I would still assume it's dwarfed by the amount of synopses or articles discussing things which happen in the books or movies.
Naturally, the full text of Harry Potter would appear in the training corpus, but why would frequency matter, and why would multiple copies get put in there intentionally?
Naturally? It seems like that last thing I'd expect to see in a training corpus is a copyrighted work which is impossible to procure in electronic format, plain text. Did it scan pirate sites for those too? Surely OpenAI does not purchase vast amounts of copyrighted corpora as well?
Surely the most logical things to train on would be all the fandom.com Wikis. They're not verbatim, but they're comprehensive and fairly accurate synopses of the main plots and tons of trivia to boot.
Even if the full text is fully deduplicated, there is just so much more content about Harry Potter on the internet. And not just retellings of it, but discussion of it, mentions of it, bits of information that convey context about the Harry Potter story, each instance of which will help further strengthen and detail the concept of Harry Potter during training.
To add on to this, OpenAI definitely tips the scale in terms of making sure it doesn't make mistakes proportional to how likely people are to ever run into those mistakes. If it failed at Harry Potter, there's a lot of people who would find out fast that their product has limitations. If it fails at some obscure topic only a niche fraction of nerds know about, only a niche fraction of nerds become aware that the product has limitations.
In testing LLMs it’s also still fair to test that it can recall and integrate its vast store of latent knowledge about things like this. Just so long as you’re fully aware that you’re doing a multi-part test, that isn’t solely testing pure reasoning.
That's a principle drawback of these things. They bullshit an answer even when they have no idea. Blather with full confidence. Easy to get fooled, especially if you don't know the game and expect the machine does.
Why is the answer ~29 liters? Since it takes just over two minutes to complete a lap, you can complete no more than 9 laps in 20 minutes. At 2.73 liters/lap, that's 9 x 2.73 = 24.57 liters, no? Or maybe I don't understand the rules.
> you can complete no more than 9 laps in 20 minutes
Note that according to standard racing rules, this means you end up driving 10 laps in total, because the last incomplete lap is driven to completion by every driver. The rest of the extra fuel comes from adding a safety buffer, as various things can make you use a bit more fuel than expected: the bit of extra driving leading up to the start of the race, racing incidents and consequent damage to the car, difference in driving style, fighting other cars a lot, needing to carry the extra weight of enough fuel for a whole race compared to the practice fuel load where 2.73 l/lap was measured.
What I really appreciate in GPT-4 is that even though the question looks like a simple math problem, it actually took these real world considerations into account when answering.
> GPT-3.5 gave me a right-ish answer of 24.848 liters, but it did not realize the last lap needs to be completed once the leader finishes. GPT-4 gave me 28-29 liters as the answer, recognizing that a partial lap needs to be added due to race rules, and that it's good to have 1-2 liters of safety buffer.
I don't believe that for a second. If that's the answer it gave it's cherry picked and lucky. There are many examples where GPT4 fails spectacularly at much simpler reasoning tasks.
I still think ChatGPT is amazing, but we shouldn't pretend it's something it isn't. I wouldn't trust GPT4 to tell me how much fuel I should put in my car. Would you?
This seems needlessly flippant and dismissive, especially when you could just crack open ChatGPT to verify, assuming you have plus or api access. I just did, and ChatGPT gave me a well-reasoned explanation that factored in the extra details about racing the other commenters noted.
>There are many examples where GPT4 fails spectacularly at much simpler reasoning tasks.
I pose it would be more productive conversation if you would share some of those examples, so we can all compare them to the rather impressive example the top comment shared.
>I wouldn't trust GPT4 to tell me how much fuel I should put in my car. Would you?
Not if I was trying to win a race, but I can see how this particular example is a useful way to gauge how an LLM handles a task that looks at first like a simple math problem but requires some deeper insight to answer correctly.
> Not if I was trying to win a race, but I can see how this particular example is a useful way to gauge how an LLM handles a task that looks at first like a simple math problem but requires some deeper insight to answer correctly.
It's not just testing reasoning, though, it's also testing fairly niche knowledge. I think a better test of pure reasoning would include all the rules and tips like "it's good to have some buffer" in the prompt.
At least debunk the example before you start talking about the shortcomings. Right now your comment feels really misplaced when it's a reply to an example where it actually shows a great deal of complex reasoning.
> even though my prompt mentioned nothing coding related.
I've noticed this trend before in chatGPT. I once asked it to keep a count of every time I say "how long has it been since I asked this question", and instead it gave me python code for a loop where the user enters input and a counter is incremented each time that phrase appears.
I think they've put so much work into the gimmick that the AI can write code, that they have overfit things and it sees coding prompts where it shouldn't.
YMMV but I just asked the same question to both and GPT-4 calculated 9.64 laps, and mentioned how you cannot complete a fraction of a lap, so it rounded down and then calculated 24.5L.
Bard mentioned something similar but oddly rounded up to 10.5 laps and added a 10% safety margin for 30.8L.
In this case bard would finish the race and GPT-4 would hit fuel exhaustion. Thats kind of the big issue with LLMs in general. Inconsistent.
In general I think gpt-4 is better overall but it shows both make mistakes, and both can be right.
If the person doing the calculation knows how timed races work, the math is very very straightforward. In this one GPT-4 did not seem to understand how racing worked in that context, where bard understood and also applied safety margin.
With GPT at least that never helped me, it wrote down a step by step where in step #3 some huge leap in logic took place, step #6 was irrelevant and #7 flat out wrong, with the conclusion not logically consistent with none of the steps before.
I have a simpler one that I saw somewhere a long while ago but has been very useful in gauging logic: "I have three apples. I eat two pears. How many apples do I have?"
Seems really obvious, but virtually all LLama based models say you only have one apple left.
Ask it to write Python code for a bot to play the game for you so that you won't have to waste time playing it yourself. That should really maximize your productivity.
The blog posts suggests, "What are the prime factors of 15683615?" as an example, and Bard does indeed appear to write and execute (although I don't know how I can be sure it's actually executing and not hallucinating an execution) Python code and returns the right answer.
But what about, "What is the sum of the digits of 15683615?"
To find the sum of the digits of a number, you add together all the individual digits. In the case of the number 15683615, the sum of its digits would be:
1 + 5 + 6 + 8 + 3 + 6 + 1 + 5 = 35
Therefore, the sum of the digits of 15683615 is 35.
====
Good job ChatGPT. This sort of “near miss” for Bard has been my experience the entire time I’ve compared Bard and ChatGPT.
In the most hilarious example, I asked both to reword an email to someone named “Helena” to explain how I’ve been busy and did not return her email. I included her prompt for context, including her asking if I was on vacation. ChatGPT did a good job of adjusting my wording to be better understood by a non native English speaker and used her prompt appropriately. Bard spun a story about how I was vacationing with my non-existent children in Helena Montana.
It's frankly amazing how bad it is. On benchmarks and human comparisons, ChatGPT obliterates the small but growing open LLM community--but Bard is right in there with the open models. It's insane to me that Google, with the people who basically invented transformers and LLMs and basically unlimited TPUs, can't do better than a few hobbyists with like three A100s between them.
No one has trained a LLM of the open source level quality with just 3 gpus. Fine tuning sure, but pretraining the even the smaller models takes more than that.
To be fair, ChatGPT's answer is wrong when given the Bard example.
To find the prime factors of 15,683,615, we can use a process called prime factorization. We divide the number by prime numbers until it can no longer be divided evenly. Here's the step-by-step process:
We start by dividing 15,683,615 by the smallest prime number, which is 2. However, 15,683,615 is an odd number and not divisible by 2.
Next, we try the next prime number, 3. If we divide 15,683,615 by 3, we get 5,227,871. Again, this is an odd number.
We continue with the next prime number, 5. Dividing 15,683,615 by 5 gives us 3,136,723. Still an odd number.
Moving on to the next prime number, 7. Dividing 15,683,615 by 7 yields 2,240,516. Again, this is an odd number.
We try the next prime number, 11. Dividing 15,683,615 by 11 gives us 1,425,783. Still an odd number.
Next, we try the prime number 13. Dividing 15,683,615 by 13 yields 1,205,659. This is an odd number.
Continuing with the next prime number, 17. Dividing 15,683,615 by 17 gives us 920,755. Still an odd number.
We try the prime number 19. Dividing 15,683,615 by 19 yields 823,035. This is an odd number.
The next prime number is 23. Dividing 15,683,615 by 23 gives us 680,563. Still an odd number.
Finally, we try the prime number 29. Dividing 15,683,615 by 29 yields 539,475, which is divisible by 5.
Therefore, the prime factors of 15,683,615 are 3, 5, 5, 7, 11, 13, 17, 19, 23, and 29.
A couple of days ago I gave both of them a prompt similar to:
"Create a map with tect using only the letters B,W,S. The map should be a 20 line by 20 colum text . Each line should contain only B,W or S letters were W are walls, B is blank/background space and S represent "stairs" to outside of the room...."
The query was a it longer with more specs.
Neither ChatGPT nor Bard could give me a good answer. They used other letters , they made 21 or 19 chars lines. They made 5 or 6 line maps. They basically made a mess.
That's my current test for reasoning, analysis and intelligence for these things.
They are both pretty bad. I ask about templates for CI/CD and they imagine parameters that don’t exist, and no amount of wrestling it around can suppress this. People like to cherry-pick examples where they work great and then proclaim it’s the best thing since sliced bread, but it’s just simply not.
If that were the case, shouldn't google be equally capable of including so many examples in their own dataset?
Like, regardless of how it works under the hood, I as an end user just want a useful result. Even if ChatGPT is "cheating" to accomplish those results, it looks better for the end user.
The continued trickle of disappointing updates to Bard seems to indicate why Google hadn't productized their AI research before OpenAI did.
google isn't even able to keep google authenticator working¹. Since the last update it has its icon "improved", but it doesn't reliably refresh tokens anymore. Since we have a policy of at most 3 wrong tokens in a row, a few people of my team almost got locked out.
Feel free to downvote as I'm too tired to post links to recent votes in the play store :)
Sorry for the snark in this post, but I have been less than impressed by google's engineering capability for more than 10 years now. My tolerance to quirks like the one I just posted is, kind of, low.
¹ An authenticator app is a very low bar to mess up
This is like when their speech-to-text-service always got "how much wood could a woodchuck chuck if a woodchuck could chuck wood" right even if you replaced some of the words with similar words. But then failed at much easier sentences.
I downvoted you because you didn't give what's the correct answer in this case. (though it's easy, but it's better to give correct answer for reader save the thought)
I think they massively screwed up by releasing half baked coding assistance in the first place. I use ChatGPT as part of my normal developer workflow, and I gave Bard and ChatGPT a side-by-side real world use comparison for an afternoon. There is not a single instance where Bard was better.
At this point why would I want to devote another solid afternoon to do an experiment on a product that just didn’t work out the gate? Despite the fact that I’m totally open minded to using the best tool, I have actual work to get done, and no desire to eat one of the world’s richest corporations dog food.
Yep, the progress will be slow but inexorable on this front.
Sooner or later we'll arrive at what I see as the optimum point for "AI", which is when I can put an ATX case in my basement with a few GPUs in it and run my own private open source GPT-6 (or whatever), without needing to get into bed with the lesser of two ShitCos, (edit: and while deriving actual utility from the installation). That's the milestone that will really get my attention.
Precisely my point I don’t think a lot of people will go back. Even somebody like me who’s willing to put several hours into trying to see how both work won’t do that for every blog post about an “improvement”.
Bard was rushed, and it shows. You only get one chance to make the first impression and they blew it.
I think there's a way in which ChatGPT is paying this, by having released GPT-3.5, rather than just waiting 6 months and releasing it with GPT-4 out of the gate. In this thread everyone is making a clear distinction, but in a lot of other contexts it ends up quite confused: people don't realize how much better GPT-4 is.
I don't think so for stuff like this, it kinda has to be built in public, and iteratively. If it gets good enough they'll surface it more in search and that'll be that.
Yes, but switching costs increase over time, especially with API integration, and it’s not like OpenAI isn’t also improving at what seems to be a faster rate. My code results on ChatGPT seemed to have gotten a real bump a few weeks ago. Not sure if it was just me doing stuff it was better at, or it got better.
DuckDuckGo is closer to Google Search than Bard is to ChatGPT at this point, and that should be a concern for Google.
Bard is fast enough compared to ChatGPT (like at least 10x in my experience) that it's actually worth going to Bard first. I think that's Google's killer advantage here. Now they just need to implement chat history (I'm sure that's already happening, but as an Xoogler, my guess is that it's stuck in privacy review).
Also it can give you up to date information without giving you the "I'm sorry, but as an AI model, my knowledge is current only up until September 2021, and I don't have real-time access to events or decisions that were made after that date. As of my last update..." response.
For coding type questions, I use GPT4, for everything else, easily Bard.
Have you used Bing? It's great for stuff up until a few days ago (not necessarily today's news), powered by GPT-4, and the results have been consistently much better than Bard for me.
Subscribing to OpenAI, GPT4 seems to go a bit faster than I would read without pushing for speed, and GPT3.5 is super fast, probably like what you're seeing with Bard.
Not an apples to apples comparison if you're comparing free tiers, though, obviously.
In my testing it was faster with worse answers, and GPT spits out code only slightly slower than I can read it. I don’t care for “fast and wrong” if I can get “adequate and correct” in the next tab over.
I read human language quickly, I’m talking about the rate at which I read code from the internet I’m about to copy and paste. Which is, and I’m my opinion should be, slow.
But I agree for normal human language GPT needs to pick up the pace or have an adjustable setting.
If they ever get to a point where it's reliably better than ChatGPT, they could just call it something else other than "Bard" and erase the negative branding associated with it.
(If switched up the branding too many times with negative results, then it'd reflect more poorly on Google's overall brand, but I don't think that's happened so far.)
Every so often I go back to GPT-3.5 for a simpler task I think it might be able to handle (and which I either want faster or cheaper), and am always disappointed. GPT-3.5 is way better than GPT-3, and GPT-4 is way better than GPT-3.5.
I generally get in that benefit from the time I spent on here to learn about new things that are pertinent to my work.
Whether or not I want to keep going back and re-testing a product that failed me on the first use is a completely different issue.
Also, it’s a good thing I run my own company. My boss is incredibly supportive of the time I spend learning about new things on hacker news in between client engagement.
I’d love to use Bard but I can’t because my Google account uses a custom domain through Google Workspaces or whatever the hell its called. I love being punished by Google for using their other products.
Typically features like this are disabled by default for Workspace so that admins can opt-in to them. This has happened for years with many features. Part of the selling point of Workspace is stability and control.
In this particular case, I would guess (I have no inside info) that companies are sensitive to use of AI tools like Bard/ChatGPT on their company machines, and want the ability to block access.
All this boils down to Workspace customers are companies, not individuals.
I think they don't know their market. For every IT guy who doesn't want users stumbling across a new Google product at work and uploading corporate documents to it, there is some executive who hates their 'buggy' IT systems because half the stuff he uses on his home PC doesn't work properly from a work account.
The smart move would have been for workspace accounts to work exactly the same as consumer accounts by default, and then something akin to group policy for admins to disable features. For new stuff like this, let the admins have a control for 'all future products'.
This works the other way though, Google adds a new button to Gmail and the IT illiterate exec gets in touch to ask what it is or clicks it not knowing it does something they don't want to do, and suddenly the IT team find out from users that their policies and documentation are out of date.
It may not be the option we like as tech-aware users, and I've found it annoying in the past at a previous role where I was always asking our Workspace admin to enable features. But, I don't think it's the wrong choice.
You're on a business account. Businesses need control of how products are rolled out to their users. Compliance, support, etc, etc.
It's not really fair to cast your _business_ usage of Google as the same as their consumer products. I have a personal and business account. In general, business accounts have far more available to them. They often just need some switches flipped in the admin panels.
Sort of. If you have a Google Workspace account, and Microsoft launches some neat tool, the Google domain admin can't really control whether or not you use it. So Google just kind of punishes themselves here.
I'd love to give it a try as well (as a paying OpenAI customer, and as a paying Google customer). It seems European Union isn't good enough of a market to launch it for Google. Google just doesn't have resources OpenAI has, it seems.
Yes, yes.. yet, somehow they all operate in EU. Google somehow can't. Not to mention (non) availability of pixel and similar which have nothing to do with the above.
Eh, I hate to say it, but this is probably the right move (if there's a switch to get it if you really want it, which other commenters are saying there is). Enough businesses are rapidly adopting "no GPT/Bard use in the workplace for IP/liability reasons" policies that it makes sense to default to opt-in for Workspaces accounts.
I don't care that it's opt-in. I care that it didn't tell me I could enable it and so assumed it was impossible. Also, perhaps it was not originally available? I don't know.
This has been an issue for so long, why don't they just let you attach a custom domain to a normal account? Paywall it behind the Google One subscription if you must, it would still be an improvement over having to deal with the needlessly bloated admin interface (for single-user purposes) and randomly being locked out of features that haven't been cleared as "business ready" yet.
I believe so, I haven’t had any issues at all. I use my email for my business and personal and in all the dealings I’ve done with different providers, none have ever marked me spam. I also have a very spam-looking domain so I might have a better than average say on it.
I just don’t want to manage switching accounts or profiles or whatever, plus I’m salty about it, plus people think it’s the runner-up so I’ll use ChatGPT for now.
I don't use Bard for another reason: Google's nefarious history of canceling its services out of the blue. Is there any guarantee that Bard is not going to end up like G+, G Reader, and several other Google apps/services?
I'm still mourning Inbox, and my muscle memory goes to inbox.google.com instead of mail.google.com in solemn protest. But, in this case, it doesn't really matter a ton if it disappears.
> Large language models (LLMs) are like prediction engines — when given a prompt, they generate a response by predicting what words are likely to come next. As a result, they’ve been extremely capable on language and creative tasks, but weaker in areas like reasoning and math. In order to help solve more complex problems with advanced reasoning and logic capabilities, relying solely on LLM output isn’t enough.
And yet I've heard AI folks argue that LLM's do reasoning. I think it still has a long way to go before we can use inference models, even highly sophisticated ones like LLMs, to predict the proof we would have written.
It will be a very good day when we can dispatch trivial theorems to such a program and expect it will use tactics and inference to prove it for us. In such cases I don't think we'd even care all that much how complicated a proof it generates.
Although I don't think they will get to the level where they will write proofs that we consider, beautiful, and explain the argument in an elegant way; we'll probably still need humans for that for a while.
I think some people get caught up on the “next word prediction” point, because this is just the mechanism. For the next word prediction to work, the LLM has all sorts of internal representations of the world inside it which is where the capability comes from.
Human reasoning probably comes from evolution (genetic survival/replication), and then somehow thought was an emergent behaviour that unexpectedly came from that process. A thinking machine wasn’t designed, it just kind of came to be over millennia.
Seems to be kind of the same with AI, but the first example of these emergent behaviours seems to be coming out of the back of building a next-word-guesser. It’s a little unexpected, but a simple framework seems to be allowing a neural net to somehow build representations of the world inside it.
GPT is just a next word guesser, but humans are just big piles of cells trying to replicate and not die.
I don’t think they’re mutually exclusive. Next word prediction IS reasoning. It cannot do arbitrarily complex reasoning but many people have used the next word prediction mechanism to chain together multiple outputs to produce something akin to reasoning.
What definition of reasoning are you operating on?
I can write a program in less than 100 lines that can do next work prediction and I guarantee you it's not going to be reasoning.
Note that I'm not saying LLMs are or are not reasoning. I'm saying "next word prediction" is not anywhere near sufficient to determine if something is able to reason or not.
Semantic reasoning, being able to understand what a symbol means and ascertain truth from expressions (which can also mean manipulating expressions in order to derive that truth). As far as I understand tensors and transformers that's... not what they're doing.
If you understand transformers, you’d know that they’re doing precisely that.
They’re taking a sequence of tokens (symbols), manipulating them (matrix multiplication is ultimately just moving things around and re-weighting - the same operations that you call symbol manipulations can be encoded or at least approximated there) and output a sequence of other tokens (symbols) that make sense to humans.
You use the term “ascertain truth” lightly. Unless you’re operating in an axiomatic system or otherwise have access to equipment to query the real world, you can’t really “ascertain truth”.
Try using ChatGPT with gpt4 enabled and present it with a novel scenario with well defined rules. That scenario surely isn’t present in its training data but it will able to show signs of making inferences and breaking the problem down. It isn’t just regurgitating memorizing text.
Oh cool, so we can ask it to give us a proof of the Erdős–Gyárfás conjecture?
I’ve seen it confidently regurgitate incorrect proofs of linear algebra theorems. I’m just not confident it’s doing the kind of reasoning needed for us to trust that it can prove theorems formally.
Just because it makes mistakes on a domain that may not be part of it's data and/or architectural capabilities doesn't mean it can't do what humans consider "reasoning".
Once again, I implore you to come up with a working definition of "reasoning" so that we can have a real discussion about this.
Many undergraduates also confidently regurgitate incorrect proofs of linear algebra theorems, do you consider them completely lacking in reasoning ability?
> Many undergraduates also confidently regurgitate incorrect proofs of linear algebra theorems, do you consider them completely lacking in reasoning ability?
No. Because I can ask them questions about their proof, they understand what it means, and can correct it on their own.
I've seen LLM's correct their answers after receiving prompts that point out the errors in prior outputs. However I've also seen them give more wrong answers. It tells me that they don't "understand" what it means for an expression to be true or how to derive expressions.
For that we'd need some form of deductive reasoning; not generating the next likely token based off a model trained on some input corpus. That's not how most mathematicians seem to do their work.
However I think it seems plausible we will have a machine learning algorithm that can do simple inductive proofs and that will be nice. To the original article it seems like they're taking a first step with this.
In the mean time why should anyone believe that an LLM is capable of deductive reasoning? Is a tensor enough to represent semantics to be able to dispatch a theorem to an LLM and have it write a proof? Or do I need to train it on enough proofs first before it can start inferring proof-like text?
I suspect you have adopted the speech patterns of people you respect criticizing LLMs of lacking “reasoning” and “understanding” capabilities without thinking about it carefully yourself.
1. How would you define these concepts so that incontrovertible evidence is even possible. Is “reasoning” or “understanding” even possible to measure? Or are we just inferring by proxy of certain signals that an underlying understanding exists?
2. Is it an existence proof? I.e we have shown one domain where it can reason, therefore reasoning is possible. Or do we have to show that it can reason on all domains that humans can reason in?
3. If you posit that it’s a qualitative evaluation akin to the Turing test, specify something concrete here and we can talk once that’s solved too.
"In such cases I don't think we'd even care all that much how complicated a proof it generates."
I think a proof is only useful, if you can validate it. If a LLM spits out something very complicated, then it will take a loooong time, before I would trust that.
I play with Bard about once a week ago so. It is definitely getting better, I fully agree with that. However, 'better' is maybe parity with GPT-2. Definitely not yet even DaVinci levels of capability.
It's very fast, though, and the pre-gen of multiple replies is nice. (and necessary, at current quality levels)
I'm looking forward to its improvement, and I wish the teams working on it the best of luck. I can only imagine the levels of internal pressure on everyone involved!
I don't understand how Google messed up this bad, they had all the resources and all the talent to make GPT-4. Initially, when the first Bard version was unveiled, I assumed that they were just using a heavily scaled-down model due to insufficient computational power to handle an influx of requests. However, even after the announcement of Palm 2, Google's purported GPT-4 competitor, during Google IO , the result is underwhelming, even falling short of GPT 3.5. If the forthcoming Gemini model, currently training, continues to lag behind GPT-4, it will be a clear sign that Google has seriously dropped the ball on AI. Sam Altman's remark on the Lex Fridman podcast may shed some light on this - he mentioned that GPT-4 was the result of approximately 200 small changes. It suggests that the challenge for Google isn't merely a matter of scaling up or discovering a handful of techniques; it's a far more complex endeavor. Google backed Anthropic's Claude+ is much better than Bard, if Gemini doesn't work out, maybe they should just try and make a robust partnership with them similar to Microsoft and OpenAI.
Have you ever considered the problem tech like this actually creates for their owners? This is why they didn't release it.
From a legal, PR, safety, resource, monetization perspective, they're quire treacherous products.
OpenAI released it because they needed to make money. Google were wise enough not to release the product, but as others have said, it's an arms race now and we'll be the guinea pigs.
This line of reasoning implies that Google had models that were equivalent to OpenAI's but chose to keep them behind closed doors. However, upon releasing Bard, it was apparent—and continues to be—that it does not match up to OpenAI's offerings. This indicates that the discrepancy is more likely due to the actual capabilities of Google's models, rather than concerns such as legal, PR, safety, resource allocation, or monetization.
As we all know, we don't know what C-4 is trained on. It might be trained on information they didn't have the rights to use (for example). This is why they might be so tight lipped on how it was produced.
Google on the other hand, has much much more to loose here, much bigger reputation to protect, and may have built an inferior product that's actually produced in a more legally compliant way.
Another example would be Midjourney vs Adobe Firefly, there is no way Firefly makes art as nice as MJ produces. Technically it's good stuff, but it's not as fun to use because I can't generate Pikachu photos with Firefly.
People have stated that ChatGPT-4 isn't as good anymore. My personally belief is this is just the shine wearing off what was a novelty. However it may also be OpenAI removing the stuff they shouldn't have used in the first place. Although there are reports the model hasn't changed for some time so who knows.
I guess in time we'll find out. Personally I don't really care for either product so much, most of my interactions have been fairly pointless.
I think it's just fun to watch these big tech companies try deal with these products they've created. It's amusing as fuck.
If Google only used data that isn't copyrighted, they'd probably make a big deal about it, just like Adobe does with their Firefly model. Also, it's not really possible for OpenAI to just take out certain parts from the model without retraining the whole thing. The drop in quality might be due to attempts to make the model work faster through quantization and additional fine-tuning with RLHF to curb unwanted behavior.
I think Google still has a decent chance of catching up. It's just a bit surprising to see them fall behind in an area they were supposed to be leading, especially since they wrote the paper which started all of this. Also, Anthropic is already kind of close to OpenAI, so I don't think OpenAI has some magic that no one else can figure out. In the future, I predict that these LLMs will become a commodity, and most of the available models will work for most tasks, so people will just choose the cheapest ones.
They have explicitly said in interviews that it was intentional not to release epowerful ai models without being sure of the safety. OpenAI put them in the race and let's see how humanity will be affected.
If safety were the only consideration, it's reasonable to expect that they could have released a model comparable to GPT 3.5 within this time frame. This strongly suggests that there may be other factors at play.
I'm not sure Bard and GPT-4 are quite an apples-to-apples comparison though.
GPT-4 is restricted to paying users, and is notable for how slow it is, whereas Bard is free to use, widely available (and becoming more so), and relatively fast.
In other words, if Google had a GPT-4 quality model I'm not sure they would ship it for Bard as I think the cost would be too high for free use and the UX debatable.
They both represent SOTA of two firms trying for technically the same thing. Just because the models or the infrastructure aren't identical doesn't mean we should not be comparing those to the same standards. Where Bard gains in speed and accessibility, it looses in reasoning and response quality.
Bard represents SOTA in terms of optimizing for low cost; ChatGPT represents SOTA in terms of optimizing for accuracy. On the SOTA frontier, these two goals represent a tradeoff. ChatGPT could choose to go for lower accuracy for lower cost, while Google could for higher accuracy at higher cost. It's like comparing a buffet to a high end restaurant.
Even if Bard were targeting accuracy, it'd still fall short of ChatGPT, but much less so than it does now. (That said, as a product strategy it's questionable: at some point, which I think Bard reaches, the loss in quality makes it more trouble than it's worth.)
I cancelled my OpenAI plus because why pay for something you cannot use because it is always slow, down, busy, or returning errors. You cannot build a reliable business on OpenAI APIs either
ChatGPT also spouts falsehoods and makes mistakes on non-trivial problems, there is not much difference here. Both have enough issues that you have to be very careful with them, especially when building a product that will be user facing
I think there are two viable strategies here: make a model that is useful at the lowest possible cost and make a model that is maximally useful at high costs. Probably some spots in between them as well.
Google's mistake is in thinking that ChatGPT was a maximally useful product at high cost. Right now, ChatGPT is a useful product at a high cost which is nonetheless the lowest possible cost for a useful model.
On the contrary, Bard is a product not a model. If you want to see the cutting edge capabilities then comparing the GPT-4 API to the bigger PaLM2 APIs available on GCP is probably a more apples to apples comparison.
Bard is more directly comparable to ChatGPT as a product in general, and since it doesn’t have swappable models, comparing it to the opt-in paid-only model isn’t really a direct comparison.
Yes, basically everywhere except europe, likely due to regulatory concerns. (Would be interested to know what precisely, but the page doesn't say. Any guesses?)
There's a good chance ChatGPT gets banned from Europe, whereas Google, despite its fines by EU authorities (most of which are for antitrust), can at least demonstrate that it's set up and continues to maintain GDPR compliance.
I've used Bard a few times. it just doe not stack up to what I am getting from ChatGPT or even BingAI. I can take the same request copy it in all three and Bard always gives me code that is wildly inaccurate.
I'd settle for any amount of factual accuracy. One thing it is particularly bad at is units. Ask Bard to list countries that are about the same size as Alberta, Canada. It will give you countries that are 40% the size of Alberta because it mixes up miles and kilometers. And it makes unit errors like that all the time.
I asked it for the size of Alberta, Canada in square miles, and then after it gave me that, I asked it for some countries that are similar sized to Alberta, Canada and it said:
There are no countries that are exactly the same size as Alberta, but there are a few that are very close. Here are some countries that are within 10,000 square miles of Alberta's size:
Sudan (250,581 square miles)
Mexico (255,000 square miles)
Argentina (278,040 square miles)
Western Australia (267,000 square miles)
New South Wales (263,685 square miles)
(all these sizes are incorrect, MX for example is 761,600 mi²)
Then I asked it:
Why did you list New South Wales as a country above?
I apologize for the confusion. I listed New South Wales as a country above because it is often referred to as such in informal conversation. However, you are correct, New South Wales is not a country. It is a state in Australia.
> Here are some countries that are within 10,000 square miles of Alberta's size:
> Sudan (250,581 square miles) Mexico (255,000 square miles) Argentina (278,040 square miles) Western Australia (267,000 square miles) New South Wales (263,685 square miles)
Argentina is ~28k square miles larger than Sudan by its own fallacious statistics, so it doesn't even imply a consistent size for Alberta.
The Free Wales Army rises again! They have infiltrated every rung of society and soon the plan will be complete, if not for your meddling large language models!
Google, with all due respect, you made a terrible first impression with Bard. When it was launched, it only supported US English, Japanese, and Korean. Two months of people asking for support for other languages, those are still the only ones it supports. Internally it can use other languages but they're filtered out with a patronizing reply of "I'm still learning languages". https://www.reddit.com/r/Bard/comments/12hrq1w/bard_says_it_...
They've kind of botched it by releasing something that even though it may surpass ChatGpt sooner than later, at present doesn't. With the Bard name and being loud about it, I've started referring to it as https://asterix.fandom.com/wiki/Cacofonix (or Assurancetourix for my French brethren)
I tried out Bard the other day, asking some math and computer science questions, and the answers were mostly bullshit. I find it greatly amusing that people are actually using this as part of their day-to-day work.
I believe that was just their demonstration. They're calling it implicit code execution so it's ought to be done transparently to the user for the queries that qualify as requiring code.
Used bard just recently to research some taxation on stocks differences between a few countries. I used bard for it because I thought googles knowledge graph probably has the right answers and bard may be powered by it
The results were just completely wrong and hallucinated while gpt4 was spot on.
(Of course I double check info it gives me and use it as a starting point)
Widely offered answer here seems to be legislation / fear of fines. I wonder how does that translate to other products too like pixel, and nest? I'm more inclined to believe google just doesn't have capacities outside of core tech. Their sales and marketing is just a dud, always has been. That explains lackluster results where they should've dominated like GCP.
One nice improvement is applying a constraint. Bard will now give a valid answer for "give a swim workout for 3000m" that correctly totals 3k, while chatgpt does not.
This is a great capability. I wish that it ran the code in a sandboxed iframe in the browser so that I could ask for things that'd waste too much of the providers server CPU to compute. It'd also be great for those iframes to be able to output graphics for tiny visual simulations and widgets, e.g. ciechanow.ski.
I asked Google [Generative] Search today how to run multiple commands via Docker's ENTRYPOINT command. It gave me a laughably wrong answer along with an example to support it. ChatGPT gave multiple correct alternative answers with examples. Doh!
FYI ChatGPTs experimental “Code Interpreter” model does this and it’s awesome. LLMs orchestrating other modes of thinking and formal tools seems very promising. We don’t need the LLM to zero-shot everything.
I first subbed to chatgpt when I found out about plugins are out. Imagine my surprise when after paying $20 I found out I can get myself on waitlist only.
Then I found out about code interpreter and subbed again, still not having access to code interpreter.
Needless to say I will be thinking long and hard before I pay openai again.
It seems to be randomly rolled out. I had that happen for a while. Make sure you check your settings to see if its in the enable experimental features list.
It’s not better, they just hooked up a calculator to it. Like OpenAI’s plugins, but more opaque and less useful.
What happened to Google? Touting this as some achievement feels really sad. This is just catching up, and failing. I’m beginning to think they are punching above their weight and should focus on other things. Which is.. odd, to say the least. I guess money isn’t everything.
Google certainly has an internal LLM of GPT-4 quality (PALM-2 or some variant of it) but they would never allow access to it via an API as it would require them to operate on too high of a loss. Google is too seasoned a company to try something new or interesting that would involve a risk to its ad revenue bottom line.
People keep repeating they have “things in the works” and “massive reserves”, but meanwhile they flail around for years. They could have had a massive head-start, they were the inventors of the transformer for crying out loud.
I’m not seeing indications of anything interesting brewing in their HQ.
Nope, there's no reasoning. It's just generating the text that best matches its training data. They admit that themselves, which makes the statement "bard is getting better at reasoning" even more irritating:
> Large language models (LLMs) are like prediction engines — when given a prompt, they generate a response by predicting what words are likely to come next
> Nope, there's no reasoning. It's just generating the text that best matches its training data.
That's like saying that when you answer questions on an exam, you're just generating the text that best matches your training data...
Both statements are correct, but only if you understand what "generating" and "matches" mean.
Generating doesn't (always) mean copying, and matches doesn't (always) mean exactly the same. In the more general case you're drawing a kind of analogy between what you were taught and the new problem you are answering.
You should google "Induction heads" which is one of the mechanisms that researchers believe Transformers are using to perform in-context learning. In the general case this is an analogical A'B' => AB type of "prediction".
Probably the best answer is, "The concept in your head labelled by 'reasoning' doesn't apply, but neither does the one you associate with 'unreasoning'."
It isn't doing classical reasoning per se, but neither does it match an unreasoning brute process.
In general, you should get used to this. Probably every AI from this point on out until they simply exceed us entirely and we can't mentally model them at all are going to be neither quite what we consider "human reasoning", but that doesn't mean they are "unreasoning" either. We'll be able to see certain flaws, but then again they will increasingly be able to see our flaws too.
> It isn't doing classical reasoning per se, but neither does it match an unreasoning brute process.
What we call "classical reasoning" is also running on neural network trained by predicting data.
I think the distinction is only there because we mix levels of abstraction - when talking about AI we focus on the lowest levels, when talking about people we focus on the higher levels.
Try asking LLM to explain its reasoning and after dismissing the "I'm just AI" disclaimers it will usually generate logical chain that could just as well be generated by a human. Even if you think it's not "really" reasoning - just pretending - pretending means it's running a virtual machine and that virtual machine is reasoning. Same difference.
> (Imagine trying to solve a math problem using System 1 alone: You can’t stop and do the arithmetic, you just have to spit out the first answer that comes to mind.)
Uh, for problems below some level of complexity, that's exactly how I do it, I just know the answer. I assume everyone does.
doing math implicitly kinda prerequisites writing a good code.
If you cant count up letters, the chance the code for anything will be fundamentally broken skyrockets.
im not really caring if bard can do something gpt can already do
i always find myself using every llm accessible to me if i have a serious question because i expect variation, sometimes one is better than the others and that's all i need
a way of submitting a single input to multiple prompts would make for a nice tool
There are generally two costs to compliance: actually compliance, and proving compliance. The latter is the concern in the EU. It's already gotten OpenAI in trouble in e.g. Italy. None of this means nobody should deploy LLMs in Europe. Just that there are unique costs that should be considered.
Hence "EU and adjacent." Swiss law incorporates the problematic elements of GDPR, namely, its complain-investigate model and unilaterally-empowered regulator.
It might take Bard 3 more iterations to reach the current level of chatGPT, which to my surprise even managed to solve advanced linear algebra questions, while Bard was no where close to answering even basic questions in Linear Algebra
Is it really "getting better at logic and reasoning" though, or is it actually just another LLM like any other, and therefore just getting better at the appearance of logic and reasoning? The distinction is important, after all. One possibly leads to AGI, where the other does not (even though people who don't understand will likely believe it's AGI and do stupid and dangerous things with it). As I understand it, LLMs do not have any logic or reason, despite often being quite convincing at pretending to.
If you clarify the prompt a little (I used Write a python script that can produce this pattern:), GPT-4 gave me this:
To generate the given pattern, we need to identify the structure. After analyzing, it seems that the pattern follows a certain structure where the binary representation of each number from 1 to 15 is preceded by a certain number of zeroes. Here's a python script to generate such a pattern:
def generate_pattern():
pattern = ""
for i in range(1, 16):
# Convert i to binary and remove the "0b" prefix.
binary = bin(i)[2:]
# Number of zeroes to add before each binary.
zeroes = 5 - len(binary)
# Add binary to pattern with zeroes.
pattern += "0" \* zeroes + binary
return pattern
print(generate_pattern())
The script starts by iterating through numbers from 1 to 15. For each number, it converts the number to a binary string and removes the "0b" prefix that Python adds by default. It calculates the number of zeroes to add before the binary string to maintain a fixed width of 5 characters, and then adds the binary string to the pattern with the preceding zeroes. Finally, it returns the complete pattern.
Which outputs:
000010001000011001000010100110001110100001001010100101101100011010111001111
It's missing the first set of zero's, but matches otherwise.
The answer is wrong though (not just because it's missing leading zeros, but perhaps you didn't copy the right input?) and it's certainly not the shortest way to output that.
Not sure I follow- the answer matches minus the first leading zeros. Change the range from 0-32, and it matches exactly. So it pretty clearly recognized the pattern and produced working code.
This question is a pretty obscure benchmark. Another commenter has it just printing the string, as suggested.
If there's some weird math trick to get an optimal implementation, it's probably beyond the grasp of nearly all actual people.
Well, it's both dumb and smart: it's smart in the sense that it recognized the pattern in the first place, and it's dumb that it made such a silly error (and missed obvious ways to make it shorter).
This is the problem with these systems: "roughly correct, but not quite, and ends up with the wrong answer". In the case of a simple program that's easy to spot and correct for (assuming you already know to program well – I fear for students) but in more soft topics that's a lot harder. When I see people post "GPT-4 summarized the post as [...]" it may be correct, or it may have missed one vital paragraph or piece of nuance which would drastically alter the argument.
This is the simplest and most direct method to output the string. If you have a more complex task in mind, like generating this string according to a certain pattern, please provide more details.
May be possible to shave off a few bytes with f'..' strings, or see if there are any repeating patterns, I'm not the sort who enjoys "code golfing", but "use base-16 to represent a base-2 number more compactly" seems fairly obvious to me.
Oh right, you need the leading zeroes won't get printed; need a formatting string with a specific width for that. I don't do much Python so I don't recall the exact syntax off-hand, but the point was: there is an obvious way to compact the number that can be done without any analysis of the number itself (or even looking at it, for that matter).
While print(literal) is "cheating" if you ask for "create a program that generates ...", it is a very obvious thing to do if you want to go down that route.
The "more complex task in mind" was, of course, to generate the "shortest" program. GPT-4, by asking for a "certain pattern" is attempting to have you do the intellectual heavy lifting for it -- although in this case the intellectual lifting is quite light.
I would venture to guess most college graduates familiar with Python would be able to write a shorter program even if restricted from using hexidecimal representation. Agreed, that may be the 99th percentile of the general population, but this isn't meant to be a Turing test. The Turing test isn't really about intelligence.
I don't see how arbitrary questions like this substantially show AGI. If there is a common solution, it could simply look up the solution. Also, AGI could be present just not in this very niche problem (that 99.9% of humans can't solve).
The point of this "IQ Test" is to set a relatively low-bar for passing the IQ test question so that even intellectually lazy people can get an intuitive feel for the limitation of Transformer models. This limitation has been pointed out formally by the DeepMind paper "Neural Networks and the Chomsky Hierarchy".
The general principle may be understood in terms of the approximation of Solomonoff Induction by natural intelligence during the activity known as "data driven science" aka "The Unreasonable Effectiveness of Mathematics In the Natural Sciences". Basically, if your learning model is incapable of at least context sensitive grammars in the Chomsky hierarchy, it isn't capable of inducing dynamical algorithmic models of the world. If it can't do that, then it can't model causality and is therefore going to go astray when it comes to understanding what "is" and therefore can't be relied upon when it comes to alignment of what it "ought" to be doing.
PS: You never bothered to say whether the program you provided was from an LLM or from yourself. Why not?
I think the argument is that current and future AI advancements could lead to AGI. The people I've seen like Yudkowsky who are concerned about AGI don't claim that Chat-GPT is an AGI AFAIK. BTW, I disagree with Yud, but there's no reason to misconstrue his statements.
Yud is doing more than his share of generating misconstrual of his own statements as evidenced by the laws and regulations being enacted by people who are convinced that AGI is upon is.
Ironically, they're right in the sense that the global economy is an unfriendly AGI causing the demographic transition to extinction levels of total fertility rate in exact proportion to the degree it has turned its human components into sterile worker mechanical Turks -- most exemplified by the very people who are misconstruing Yud's statements.
>There are plenty of those who purport AGIs threaten us and conflate "existence" with "potential". This is aimed at those driven to hysterics by such.
I'd hazard a guess that the Venn diagrams of "those who purport AGIs threaten us and conflate 'existence' with 'potential'" and of "people who grok binary and can solve esoteric brain teasers using it" have very little overlap.
You might have more success with an example that's a little more accessible to "normies".
> I'm playing assetto corsa competizione, and I need you to tell me how many liters of fuel to take in a race. The qualifying time was 2:04.317, the race is 20 minutes long, and the car uses 2.73 liters per lap.
The correct answer is around 29, which GPT-4 has always known, but Bard just gave me 163.8, 21, and 24.82 as answers across three drafts.
What's even weirder is that Bard's first draft output ten lines of (wrong) Python code to calculate the result, even though my prompt mentioned nothing coding related. I wonder how non-technical users will react to this behavior. Another interesting thing is that the code follows Google's style guides.
[0]: https://news.ycombinator.com/item?id=35893130