Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I get the feeling it was outsourced to a content farm, which outsourced itself to GPT-3


...which then outsourced itself to GPT-2...


Given the way that OpenAI sources their training corpus and the amount of people using GPT-3 I would not be surprised if GPT-4 winds up getting trained on a large amount of GPT-3 output.

Think about it - the best niche GPT-3 has is generating plausible spam. If you just need a lot of text, but you don't care about what it says[0], you're going to write it using the cheapest possible tool. OpenAI's training corpus is sourced through web crawls, so all of that spam is destined to get recycled back into the next generation of GPT.

[0] For example, if you want to be able to post a bunch of political spam that looks like genuine comments on a web forum. See GPT-4chan[1] as a practical example of this.

[1] A tweaked version of GPT-3 using 4chan's politics board as training corpus.


Same with most AI image generating models. In 10 years 99% of images on the internet will be AI generated. Would it not regressive for the models to train on their own outputs? Should regulation require AI generated media label itself, or is it the responsibility to train detectors which can intuit the difference between the models and reality better than humans can?


GPT-3 is lounging on a beach somewhere while GPT-2 does all the hard work!


or, judging from the usual IQ and age difference between those doing the actual work and their managers, probably the inverse...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: