I've tested bard/gemini extensively on tasks that I routinely get very helpful results from GPT-4 with, and bard consistently, even dramatically underperforms.
It pains me to say this but it appears that bard/gemini is extraordinarily overhyped. Oddly it has seemed to get even worse at straightforward coding tasks that GPT-4 manages to grok and complete effortlessly.
The other day I asked bard to do some of these things and it responded with a long checklist of additional spec/reqiurement information it needed from me, when I had already concisely and clearly expressed the problem and addressed most of the items in my initial request.
It was hard to say if it was behaving more like a clerk in a bureaucratic system or an employee that was on strike.
At first I thought the underperformance of bard/gemini was due to Google trying to shoehorn search data into the workflow in some kind of effort to keep search relevant (much like the crippling MS did to GPT-4 in it's bingified version) but now I have doubts that Google is capable of competing with OpenAI.
I don't think Google has released the version of Gemini that is supposed to compete with GPT4 yet. The current version is apparently more on the level of GPT 3.5, so your observations don't surprise me
I will say as someone who tries to regularly evaluate all the models Google's censorship is much worse than other companies. I routinely get "I can't do that" messages from Bard and no one else when testing queries.
As an example, I had a photo of a beach I wanted to see if it knew the location of and it was blocked for inappropriate content. I stared at the picture for like 5 minutes confused until I blacked out the woman in a bikini standing on the beach and resubmitted the query at which point it processed it.
It's refused to do translation for me because the text contains 'rude language'. It's blocked my requests on copyright grounds.
I don't at all understand the heavy-handed censorship they're applying when they're behind in the market.
their censorship is the worst of any platform. being killed from within by the woke mob apparently. it's a pity for google employees, they're going to be undergoing cost cutting/perpetual lay offs for the foreseeable future as other players eat their advertising lunch.
On the flip side, I find that GPT4 is constantly getting degraded. It intentionally only returns partial answers even when I direct it specifically not to do so.
My guess is, that they are trying to save on CPU consumption by generating shorter responses.
I think at high traffic times it gets slightly different parameters that make it more likely to do that. I've had the best results during what I think are off-peak hours.
> I've tested bard/gemini extensively on tasks that I routinely get very helpful results from GPT-4 with, and bard consistently, even dramatically underperforms.
Yes. And I don't buy the lmsys leaderboard results where Google somehow shoved a mysterious gemini-pro model to be better than GPT-4. In my experience, its answers looked very much like GPT-4 (even the choice of words) so it could be that Bard was finetuned on GPT-4 data.
Shady business when Google's Bard service is miles behind GPT-4.
True, what is most puzzling about it is the effort Google is putting into generating hype for something that is at best months away (by which time OpenAI will likely have released a better model)...
My best guess is that Google realizes that something like GPT-4 is a far superior interface to interact with the world's information than search, and since most of Google's revenue comes from search, the handwriting is on the wall that Google's profitability will be completely destroyed in a few years once the world catches on.
MS seeems to have had that same paranoia with the bingified GPT-4. What I found most remarkable about it was how much worse it performed seemingly because it was incorporating the top n bing results into the interaction.
Obviously there are a lot of refinements to how a RAG or similar workflow might actually generate helpful queries and inform the AI behind the scenes with relevant high quality context.
I think GPT-4 probably does this to some extent today. So what is remarkable is how far behind Google (and even MS via it's bingified version) are from what OpenAI has already available for $20 per month.
Google started out free of spammy ads and has increasingly become more and more like the kind of ads everywhere in your face, spammy stuff that it replaced.
GPT-4 is such a refreshingly simple and to the point way to interact with information. This is antithetical to what funds Google's current massive business... namely ads that distract from what the user wanted in hopes of inspiring a transaction that can be linked to the ad via a massive surveillance network and behavioral profiling model.
I would not be surprised if within Google the product vision for the ultimate AI assistant is one that gently mentions various products and services as part of every interaction.
the search business has always been caught between delivering simple and to the point results to users and skewing results to generate return on investment to advertisers.
in its early years google was also refreshingly simple and to the point. the billion then trillion dollars market capitalization placed pressure on them to deliver financial results, the ads spam grew like a cancer. openai is destined for the same trajectory, if only faster. it will be poetic to watch all the 'ethical' censorship machinery repurposed to subtly weigh conversations in favor of this or other brand. pragmatically, the trillion dollar question is what will be the openai take on adwords.
Ads are supposed to reduce transaction cost by spreading information to allow consumers to efficiently make decisions about purchases, many of which entail complex trade-offs.
In other words, people already want to buy things.
I would love to be able to ask an intelligence with access to the world's information questions to help me efficiently make purchasing decisions. I've tried this a few times with GPT-4 and it seems to bias heavily toward whatever came up in the first few pages of web results, and rarely "knows" anything useful about the products.
A sufficiently good product or service will market itself and it is rarely necessary for marketing spend or brand marketing for those rare exceptional products and services.
For the rest of the space of products and services, ad spend is a signal that the product is not good enough that the customer would have already heard about it.
With an AI assistant, getting a sense of the space of available products and services should be simple and concise, without the noise and imprecision of ads and clutter of "near miss" products and services ("reach" that companies paid for) cluttering things up.
The bigger question is which AI assistant people will trust they can ask important questions to and get unbiased and helpful results. "Which brand of Moka pot under $20 is the highest quality?" or "Help me decide which car to buy" are the kinds of questions that require a solid analytical framework and access to quality data to answer correctly.
AI assistants will act like the invisible hand and shoudl not have a thumb on the scale. I would pay more than $20 per month to use such an AI. I find it hard to believe that OpenAI would have to resort to any model other than a paid subscription if the information and analysis is truly high quality (which it appears to be so far).
I did exactly that with a custom GPT and it works pretty well. I did my best to push it to respond with its training knowledge about brand reputation and avoid searches. When it has to resort to searches I pushed it to use trusted product information sources and avoid spammy or ad-ridden sites.
It allowed me to spot the best brands and sometimes even products in verticals I knew nothing about beforehand. It’s not perfect but already very efficient.
The ad model already went to take attribution / conversion from different sources into account (although there's a lot of spammy implementations), but it took many years for Google to make youtube / mobile ads profitable, and now adoption is much faster.
> And I don't buy the lmsys leaderboard results where Google somehow shoved a mysterious gemini-pro model to be better than GPT-4.
What do you mean by "don't buy"? You think lmsys is lying and the leaderboard do not reflect the results? Or that google is lying to lmsys and have a better model to serve exclusively to lmsys but not to others? Or something else?
Most likely the latter. Either Google has a better model which they disguise as Bard to make up for the bad press Bard has received, or Google doesn't really have a better model—just a Gemini Pro fine tuned on GPT-4 data to sound like GPT-4 and rank high in the leaderboard.
> Either Google has a better model which they disguise as Bard
Why wouldn't they use this model in bard then?
Anyway this is easily verifiable claim, are there any prompts that consistently work at lmsys but not at bard interface?
> fine tuned on GPT-4 data to sound like GPT-4 and rank high
This I don't get. Why would many different random people rank bad model that sounds like gpt4 higher than good model that doesn't? What is even the meaning of "better model" in such settings if not user preference?
I guess Pro is not supposed to be on par with GPT4. That would be Ultra coming out sometime in the first quarter. I’m going to reserve judgement till that is released.
I think there’s bias in the types of prompts they’re getting. In my personal experience, Bard is useful for creative use cases but not good with reasoning or facts.
Here is a simple maths problem that GPT-4 gets right but Bard (even the Gemini Pro version) consistently gets wrong: “What is one (short scale) centillion divided by the cube of a googol?”
But you are right, we don’t know the types of prompts Chatbot Arena users are submitting. Maths problems like that are probably a small minority of usage.
One other thing I notice: if you ask about controversial issues, both GPT-3.5/4 and Bard can get a bit “preachy” from a progressive perspective - but I personally find Bard to be noticeably more “preachy” than OpenAI at this (while still not reaching Llama levels)
In my experience, Bard is not comparable to GPT-3.5 in terms of instruction following and it sometimes gets lost in complex situations and then the response quality drops significantly. While GPT-3.5 is a much better feel, if that is a word for evaluating LLMs. And Bard is just annoying if it can't complete a task.
Also hallucinations are wild in Gemini pro compared to GPT-3.5.
Just a note, AFAIK it was only available in the US.
It was usable via VPN with an US IP address, and whenever I tried it without VPN Bard reported not using Gemini when asked, even when asked in English.
I get good results through ChatGPT image generation but mostly disappointing when using DALL-E directly. Not sure if my prompt game is just sorely lacking or if there's something else being involved via ChatGPT.
Apparently, but when I use Dall-e 3 on OpenAI, the images it generates look like shit. Under-developed, with crappy eyes and hands, the kind of typical mutant stuff you see with AI generated Images. Bing seems to be much better at those types of details out of the box
It pains me to say this but it appears that bard/gemini is extraordinarily overhyped. Oddly it has seemed to get even worse at straightforward coding tasks that GPT-4 manages to grok and complete effortlessly.
The other day I asked bard to do some of these things and it responded with a long checklist of additional spec/reqiurement information it needed from me, when I had already concisely and clearly expressed the problem and addressed most of the items in my initial request.
It was hard to say if it was behaving more like a clerk in a bureaucratic system or an employee that was on strike.
At first I thought the underperformance of bard/gemini was due to Google trying to shoehorn search data into the workflow in some kind of effort to keep search relevant (much like the crippling MS did to GPT-4 in it's bingified version) but now I have doubts that Google is capable of competing with OpenAI.