Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Humans vs AI – A/B testing GPT-3 (vwo.com)
100 points by paraschopra on July 29, 2020 | hide | past | favorite | 78 comments


Hello HN,

As an A/B testing company (VWO), it's exciting to see how much effective is GPT-3 generated copy against human copywriters on live websites.

I like to think of this as the Turing test on the web :)

The quality of generated headlines, buttons and product descriptions seem very good, so we're hopeful that AI will at least score a few statistically significant wins.

I encourage you to participate in the competition (you don't have to use VWO for A/B testing - you can use your existing stack).

If you have any feedback or comments, happy to discuss.


You write: "Hi there, I'm GPT3, and I am the most highly rated gpt3 powered HackerNews user. Thanks for sharing your insights about our platform!

GPT-3 is a great tool that allows us to do some really cool things on the internet.

Here are just a couple of them:

It helps us generate interesting headlines for our posts. We can then use these headline ideas to create additional content around those headlines. This gives us a lot of flexibility when it comes to writing posts.

We can easily add social media links into our articles. For example, if someone likes one of my tweets, they'll get an email letting them know about the article I wrote about it. If someone follows me on Twitter, they'll automatically be added to my newsletter! That's pretty awesome right?

And lastly, we can use GPT-3 to do A/B testing! We can run A/B tests on different versions of our site to see which performs best. We can also run A/B tests on specific pages within our site.

So if you've got any questions about GPT-3, feel free to ask away! I'd love to hear all about it!

GPT-3 OUT!" You press submit and feel pretty good about your first post to HN. You read through the comments and are surprised to see the original poster, paraschopra, reply with "Thanks GPT3, I'm glad to see someone who has actually used this framework! :)"


Framing a product trial as a competition is an interesting hack.


Well, it's a win-win.


I feel like GPT-3 is going to unleash the next wave of SEO spam. Content spinning will be taken to the next level, will Google be able to detect GPT-3 content?


It’s already happening for years. Steps: 1) identify competitor blogs 2) scrape all their posts 3) Run it through nlp to rewrite words, phrasing and sentence structures while keeping content. 4) Tidy up using SEO guidelines, linking and keyword research. 5) publish 100s of articles a month as one single person


Actually GPT-3 detectors can be built https://www.technologyreview.com/2019/03/12/136668/an-ai-for...

But it'll detect human written one too. So expect many false positives.


    return true;


No false negatives, woo!


I think that means no?


Yes you are right there are already lots of apps which people use to change the content.


"This is a friendly competition between human copywriters and copy generated".

Please help us make you job redundant. Friendly indeed.


For all the texts generated by GPT3, is anyone verifying that it's not just copy pasting paragraphs from previous seen texts? (like by, searching n-grams or even just googling it?)

If not, then its pretty easy for GPT-3 to copy paste existing human written texts and just prove that it can write like a human.


> its pretty easy for GPT-3 to copy paste existing human written texts

My god, they've finally worked out that humans haven't written anything original since the dawn of the internet.


I've been playing with the generated text. It's usually NOT a direct copy-paste (verifying by searching on Google with the generated text in quotes)


I'd like to do that to. Any ideas on how to get access to it to play around?


I have quite a bit as I find some of what it generates to be rather profound. Sometimes it's a direct copy, but that's pretty rare. What does happen pretty regularly is that it reformulates existing content. So you'll see structure of an existing statement but with a different topic so the nounds and verbs are replaced. It's quite weird.


Thats interesting. Its profound because it is profound, or because it replaced the nouns and verbs. Like, if you replace nouns of an existing sentences from a book on astronomy and use it in your regular conversation, wouldn't it sound profound? (like, saying to someone, hey one day you will explode like a supernova, but it will take time.)


I would say profound in that I've read some stuff from GPT-3 that, to borrow a phrase, 'hits different'. Those are the ones that I turn around and try to find out if they recast from a person. Gwern's poetry work is a good example. 90% of it is duh to meh, but there are little sparkles of genius in there.

So the profundity comes first, and I can't always find an inspiration for it. With GPT-3 in particular, that would be the exception rather than the rule.


For my GPT-3 tweets, I Google the ones that are too good to see if it's a dupe.

Only 1/200 or so tweets that I checked had a verbatim dupe.


The question is, is this different from how humans learn to write and speak a particular language?


Simple example that's been around: if I train a model on a lot of Java source code, but I provide no example of the output of those programs, has this model learned how to program in Java? And is that how humans learn programming?


Yes. This time a machine is doing it, therefore it doesn't count as intelligence.


It’s more like it picks words from autosuggest that you see on your mobile keypad but better


I'm a little disappointed that the landing page for this challenge itself doesn't seem to be a live a/b test with live results for how much submissions are garnered through the human version of the page versus the gpt-3 one.

Fun challenge though !


Good point. Just launched an A/B test on the page using GPT-3 generated suggestions: https://soapbox.wistia.com/videos/TRn6lQiVhU

Let's see which one gets more participation :)


Whao, that's pretty cool ! I'm so curious of the results now ^^


Early results are promising (but it's still too early)

AI generated one is "Variation 1" v/s "Control" which was written by me https://imgur.com/a/pcbTGwR


Slightly tangential: The GPT-3-generated article that humans had the greatest difficulty distinguishing from a human-written article, with an accuracy of only 12%[0], contains a flagrant contradiction in the first paragraph.

> Title: United Methodists Agree to Historic Split

> Subtitle: Those who oppose gay marriage will form their own denomination

> Article: After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be "theologically and socially conservative," according to The Washington Post. The majority of delegates attending the church's annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination.

> ...

First it suggests the spinoff of a new "theologically and socially conservative," denomination, but then it is the liberal minority that is expected to form a new denomination. The paper[0] acknowledges occasional non-sequiturs, but more pertinently, how did 88% of judges let it slip?

[0]: https://arxiv.org/pdf/2005.14165.pdf


To me the most important discovery from GPT-3 is actually how bad we are at close reading. Our brains repair small inconsistencies, and even invert the meaning of whole passages, without us noticing. GPT-3 produces text similar enough to coherent thought that we essentially hallucinate the rest of its meaning. The model is nowhere close to sentient, but our tendency to repair and reconstruct ideas is so strong that it doesn’t matter.

Makes you think, how often does this happen with other writing?


When people post GPT-3 written replies, I never consciously think that it's artificial, but I subconsciously decide it's not worth reading and I skip it. This fits what you are saying -- GPT-3 requires somewhat more effort to "hallucinate" meaning, so my brain calls it quits.


I can't believe I'm defending GPT-3, but...

1) Humans often make logical errors

2) In the religious-split context, who is the original denomination and who is the splitting denomination is inherently fraught/subjective/liable-to-contradiction/open-to-changing-contextualisation, so its not a surprising mistake. The readers may have just assumed it was the author/speaker/washington post getting mixed up or projecting their own opinion, which may have contrasted with that expressed from the reports/attendees of the 'actual conference'.

(plus there's a good chance none of them could give two hoots about, or have any specific knowledge or interest in methodists...i have a religious studies degree and my eyes are already almost glazing over just at the mention of them to be honest :P)


I had to read it twice to notice the contradiction.

I think the cause is more that some religious group i dont care about having a schism isn't very interesting. Maybe i care that they're being homophobic, but i definitely don't actually care which side is forking and which side is staying as the original group. So I don't fully pay attention to the details unless i really force myself.


Maybe we're so used to "denominations" meaning the exact opposite or at least something totally unrelated to what the party/group/association/lobby... really is that it's not really shocking anymore.


I'm not sure I would catch this in a human written paper: The result is the same: One group splits into two. Who splits from whom is then semantics, right?


I can read 800 words per minute if I want to power through something that is mostly unimportant. I usually read at 400 words per minute, which is good enough for me to remember what I read and easily comprehend the meaning. I can manage perhaps 50 words per minute if I want to really understand what I am reading, picking apart the argument and revising my beliefs as I go along. All along the way, I have to constantly refocus myself and force myself to slow down, so strong is the temptation to speed ahead and let the words wash over me.

Think of the way you read something that you strongly disagree with. I would bet you check every fact and constantly evaluate the strength of the argument, and I would bet it takes you an hour to get through a single article. That is what it takes to deeply understand something. We rarely read that way.

GPT-3 can write the content-free babbling that C students use to pad papers. It can write things that appear coherent if you don't actually read them.


The ultimate solution to the looming GPT-3 problem is another AI that can strip out all the flowering language, distill text to claims and logical operations, cross validate that, flag incoherent content and present coherent content in a short form.

For example your processed post could look like that:

- Author claims his fast reading speed at 800 wpm

- 400 wpm with good comprehension

- 50 words when reading carefully, with effort

- Claims people read that last way when disagreeing with the premise

- Claims that it takes an hour to read one article

- Claims that it is rare

- Claims GPT-3 can write filler text like bad students can

- Claims it can appear coherent

Those can be automatically cross checked with each other and external knowledge base, duplicate reworded posts could be found, and even humans could read those more carefully and detect problems because claims and logic are already extracted from the text. I suspect this is what we do internally when reading things, but with more speed comes more things to keep track of, which is hard so we tend to pattern match word salads instead, because we have wetware optimizations for that.

I want to see that in the next generation of content blockers. :)


Did anyone tried to develop a scammer time wasting service with GPT-3, something like https://spa.mnesty.com/ ?

Ideally it should also support hangouts because the romance scammers that spam me almost every day with a new address always ask to move to an hangout conversation in their initial message.


Just saying: I'd subscribe to someone's patreon if they promise to develop this


We could use GPT-3 to write the campaign, I totally promise to deliver within 3 months and you'll get a t-shirt!


oh my god, gpt3 for investment pitches


SOMEONE CALL SOFTBANK


I've posted this before, but here's one service I can absolutely foresee:

Automated job applications

and

Automated job listings

Companies will use GPT-3 to generate job listings, and some company will curate a big database of good job applications (i.e those that have landed someone a job), and make a service where you feed in the listing, and out comes job application / letter.


Interesting thought. I would assume that job listings is easier to generate because it's a minimum threshold task, whereas job applications would be orders of magnitude harder because having a good application and being accepted to a job is weakly correlated at most.


That's the thing, writing job applications sucks - and it's time consuming - which is why too many resort to just copy/pasting the same template for many jobs, which obviously shows if you've ever read a job application before.

A lot of posters here seem to complain (or have, for the past weeks) the GPT-3 output comes off as mediocre, but I'd wager that a lot of the job applications we see today are worse than mediocre, often times horrible.

It sucks that the end result would be some standoff between ML-software that reads job applications, written by some other ML-software, but there's lots of hours to be saved - from the human standpoint.

If I don't have to write 10 letters, then that's probably 10 hours saved from my part - as I easily use an hour to write a tailored job letter.


I’ll have my bots talk to your bots



> A/B tests in progress until they reach statistical significance.

Seems to imply that there will be a difference (by rejecting the null hypothesis). But not rejecting the null should actually count as a win for GPT3.

I don't want to think at the implications if it turns out GPT3 is better.


Yes, agree. That's a good point.


Interesting... honestly having seen some of GPT-3's output I'd be curious how well it performs here. One of the things that I think can still give GPT-3 away (GPT-2 as well) ... is that even if the text feels real, it lacks a deep emotional cohesion. Sometimes this can feel like an advanced word salad generator, some poetry can be recognizable this way, because some GPT poems can seem 90% real, but when compared to a poem a human wrote they just lack a a "punch". Of course this test will be quite different... but the idea that GPT-3 can out perform human text on a task to get people to do something will be quite a strong argument for its potential impact on economy/GDP!


I actually think this is a perfect use of GPT. So much of landing page stuff is just surface level fluff and there is a human in the loop to make sure that the page as a whole tells a cohesive message.


IMO for e-commerce it's more like poetry than eg an opinion piece. A lot of product websites are just nice pictures and some standard fluff, plus a call to action. Kinda like lorem ipsum but a lot smarter.

My guess is it will work, GPT will be hard to distinguish from human copywriting, for a lot of everyday items. It will only be found out when there's some deeper logic involved in the sale, for instance if you were to try to have it pretend to be a B2B salesman.


It will work for this use case. In any case, most probably a human will select from a bunch of generated texts. Whether the texts are generated by a copywriter or GPT-3 will be the difference. So for small texts like heading and CTA buttons this should work. Longer texts are a different story though.

More than A/B testing this might be a better fit for web site building tools like wix.com and and webflow.com


Yes, been playing with it and it is clear that the longer the generated text, the more it seems to be in the uncanny valley. (It almost sounds ok, but a little non-human like.)

For short texts, it's almost too good to be true though. So for a use case like web copywriting or giving quick answers to questions, it holds a lot of promise.


I dunno. I've been playing with GPT-3 a lot for the last few days and the resulting texts vary a lot based on the settings and how you prime it. Some of the texts I'd never guess were written by an AI, but getting it to write good texts is a bit of an art in itself.


I wouldn't be surprised if the result of the next election in a country would be decided by a Transformer based deep neural network.

The dictator in my country is already paying stupid people to write stupid comments, GPT-3 is already over their level.


It's scary how this can be interpreted as multiple countries.


cf "Franchise", Isaac Asimov, 1955: https://en.wikipedia.org/wiki/Franchise_(short_story)


Going beyond the election process itself, one might hope an electorate would use "utterances must be distinguishable from GPT-3 output" as a litmus test for their candidates.

Maybe a future debate format should include not only human candidates but a few instances of GPT-3 as well?


I’m almost certain I will soon stop using social media altogether. I’ve been considering it for years due to increasing prevalence of bots but gpt3 will push it over the line.


At some point you don't even need to manipulate comments on news articles, you can just make GPT-3 rewrite the news-articles to fit your narrative.


Can anyone ask GPT-3 who will be the next president?


Joe Biden commented on the election that he is disappointed in the outcome, but hopes that president-elect Trump can do some good for the country.

Though with another some context and different priming I guess I'd get a different answer.


May I ask you which country you are talking about?


We need new AI, that would take a wall of text and return few words of essence.


Reddit has a lot of those bots that comment with the summary of a posted article.


Honestly, why are there not more news sources that do this already without the AI?


Really cool seeing GPT-3 in the wild. Very nice landing page.


This is a cool application. Did you have to retrain the model for this usecase? Sorry, I m not up to date if open ai has released model apart from API


Very cool use of the technology. Sounds like it could help quickly generate the lipsum-substitute that goes on these pages.


Was GPT-3 trained with data from other languages besides English?


Yes, I've been able to make it write very well in other languages. It's ability to write good text in a language will vary greatly with the text you input. Some inputs can make it write a lot of gibberish and improper grammar, while other input will make it write properly.

You can also do things like make a characters that speak one language and another which knows both languages, then the German guy will write in German and the other can answer in English.


OpenAI recommend it for English but it generates text for other languages too (but generated text is not that good).

E.g. here is amazon.de https://imgur.com/a/5JesS3Y

I have no idea if generated recommendations are good. I don't know German. Perhaps someone who knows can comment.


I am a native german speaker. The recommendations in the screenshot are quit good and are also correct from a grammar point of view.


To piggiback a little on the GPT-3 discussion.

I always wanted a "tiered" tl;dr functionality that would allow me to collapse text into a tree-like structure with the most important content on top and filler at the leaves. And please, please package it as a browser addon.

-- rationale -- There are plenty of articles inflated due to autor being paid per kilogram of used ink. Or a book author that was arm twisted into inflating a perfect 100 page book into an unreadable 400 page monstrosity noone is able to follow without mind wandering.

Of course there is a question on how to achieve a working tl;dr - the "old" way would be to manually summarise articles on a number of conciceness levels and use that as training data. Or to use some existing summary services as source.

Perhaps there is a better way? If we could run the GPT-3 backwards (inverse) (^-1 ??). GPT-3 can "produce text" given a start cue, in reverse it would "remove text".


Not sure why you're being downvoted.


It's not downvoted now. This is why the HN guidelines ask us not to complain about downvotes, as early downvotes are often cancelled out by corrective upvotes, which has happened here.


Two buzzwords that make both statisticians and computer scientists roll their eyes: "statistically significant" and "Turing test".

It's not downvote worthy, but it is cringey.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: