Yes, but GP's idea of segregating AI-generated content is worth considering. If ...

RandomBK · 2025-06-10T22:16:29 1749593789

My 2c is that it is worthwhile to train on AI generated content that has obtained some level of human approval or interest, as a form of extended RLHF loop.

cryptonector · 2025-06-10T22:32:29 1749594749

Ok, but how do you denote that approval? What if you partially approve of that content? ("Overall this is correct, but this little nugget is hallucinated.")

bongodongobob · 2025-06-10T22:43:24 1749595404

It apparently doesn't matter unless you somehow consider the entire Internet to be correct. They didn't only feed LLMs correct info. It all just got shoveled in and here we are.

cryptonector · 2025-06-11T02:07:50 1749607670

Sure, humans also hallucinate.

thephyber · 2025-06-10T22:41:54 1749595314

I can see the value of labeling all AI can be trained on purely non-AI generated content.

But I don’t think that’s a reasonable goal. Pragmatic example: There’s almost no optional HTML tags or optional HTTP Headers which are used anywhere close to 100% of the times they apply.

Also, I think field is already muddy, even before the game starts. Spell checker, grammar.ly, and translation all had AI contributions and likely affect most of human-generated text on the internet. The heuristic of “one drop of AI” is not useful. And any heuristic more complicated than “one drop” introduces too much subjective complexity for a Boolean data type.

cryptonector · 2025-06-11T23:37:21 1749685041

Yes, it's impossible. We'd have to have started years ago. And then people wouldn't have the discipline to label content correctly or at all. It can't be done.

IncreasePosts · 2025-06-10T21:53:34 1749592414

Shouldn't there be enough training content from the pre-ai era that the system itself can determine whether content is AI generated, or if it matters?

Infinity315 · 2025-06-10T22:16:20 1749593780

Just ask any person who works in teaching or any of the numerous faulty AI detectors (they're all faulty).

Any current technology which can used to accurately detect pre-AI content would necessarily imply that that same technology could be used to train an AI to generate content that could skirt by the AI detector. Sure, there is going to be a lag time, but eventually we will run out of non-AI content.

cryptonector · 2025-06-10T22:34:03 1749594843

No, that's the problem. Pre-AI era content a) is often not dated, so not identifiable as such, and b) also gets out of date. What was thought to be true 20 years ago might not be thought to be true today. Search for the "half-life of facts".