Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In multiple subthreads here, people are embracing the idea that if machines can just write all the prose -- great, why should humans bother?

As with image generation, one thing I think we haven't adequately considered is once a sizable fraction of the available online data is machine generated, but isn't marked as such, and we begin training models on the outputs of the last generation of models, structurally we can enter a different regime. A sequence of models each trained on the prior model's output can converge to meaningfully different behavior, because we're repeatedly, incrementally changing the task by changing the training data distribution. If _all_ the data is generated by the prior model, this process seeks a fixed point which is reflective of the model architecture and not the original training data. There's a very real possibility that using generative models more (and publishing their outputs) can make these models worse in the future.

Weirder, however, is no one has really had the opportunity to look at what happens to human language when a sizable fraction of what we read is produced by these models. Will we normalize any quirks of their output? Will we reproduce or incorporate any idiosyncratic features into our own writing? How will we adapt in a changing linguistic environment?



I'm not to worried about convergence of the models that use prior training data. Eventually the models won't work and people will notice, and either create new training data or go back to only using older data to train. Also there will always be people who "want to do it the old way" and will still create new art and new writing, which will seed the training data.

As for normalizing the quirks of the output -- maybe? But would that be so bad? Language changes all the time, it's constantly mutated by influencers (not the Instagram kind, but the ones that have existed for centuries). Look at how British English is literally called "The Queen's English" because it actually shifted to how the Queen spoke, since she ruled for so long and was very influential to that society.

Also it should be noted that some news articles, especially in finance, have been written by computers for over a decade now, and not a lot of people seem to have noticed.


As an applied ML practitioner, currently we get to choose how to use synthetic data vs "real" data, and in what proportions. This can be a valuable tool in our kit. To the degree that data in the wild becomes an unlabeled mix of the two, functionally we lose the ability to make those choices for any given model.

> Eventually the models won't work and people will notice

For any product dependent on these models, that sounds like pretty negative outcome ... and entirely consistent with my concern that "[t]here's a very real possibility that using generative models more (and publishing their outputs) can make these models worse in the future."

> and either create new training data or go back to only using older data to train

Especially given that currently LLMs basically learn about entities and concepts in the world via their training text, this breaks the ability to update the model to know about more recent topics of discourse independently from shifting the real vs synthetic proportions.

> there will always be people who ... will still create new art and new writing, which will seed the training data

But if we aren't able to consistently separate the human-generated and machine-generated content, model training won't be able to place any extra weight on the human-generated stuff. The mere fact that human-generated output doesn't disappear entirely doesn't remove these issues.

The analogy is loose, but click fraud creates realistic looking data exhaust that looks close to the behavior of a real user, and can meaningfully disrupt one's ability to optimize for clicks or to know how many actual end users interacted with your item of interest. The fact that some nonzero portion of the clicks are real doesn't erase these problems. And that's in a system which doesn't create the kind of feedback loop described above.


> Eventually the models won't work and people will notice, and either create new training data or go back to only using older data to train

That has already happened in "free market capitalism". Prices are supposed to be the representative models of the product quality, but became decoupled from it because companies set the prices based on competitors and not based on quality of what they sell.


> As with image generation, one thing I think we haven't adequately considered is once a sizable fraction of the available online data is machine generated, but isn't marked as such, and we begin training models on the outputs of the last generation of models, structurally we can enter a different regime.

Ahh but the spark of true genius lights itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: