Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I read a NYT article and publish a summary of facts that I learned: totally legit.

Train a model on NYT text that outputs a summary of facts that it learned: OMG literally murder.



Sounds like you didn't read the article. Here's a better synoposis:

I read a NYT article and publish an exact copy of that article on my website: copyright infringement.

Train a model on NYT text and it outputs an exact copy of that text: also copyright infringement.


A small number of outputs of ChatGPT are close enough to training articles to be (probably) copyright infringement.

What does that mean?

Look up "substantial non-infringing use" and this little court case:

https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

Now spend a few million on lawyers and roll your dice.


In Sony vs. Universal case, Sony is the producer of a tool where the consumer uses to "time-shift" a broadcast that they legally are allowed to view. Similarly, you can rip your own CDs or photocopy your own books. This case never made reselling those content legal. OpenAI does not train ChatGPT on the content you own - they do it on some undisclosed amount of data that you may or may not have a legal right to access, and then move on and (is shown to) reproduce it nearly verbatim - they may even charge you for the pleasure.


So presumably when they fix that issue (which, if the text matches exactly, should be trivially easy) then would you accept that as a sufficient remedy?


Copyright infringement is not avoided by changing some text so it isn’t an exact clone of the source.

Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.

There is not an algorithm for this, cases are decided on by people.

There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.


And you think that it would be impossible to train a model to avoid outputs that are substantially similar to training data?


I certainly don't think it's impossible, but I think it is hard problem that won't be solved in the immediate future, and creators of data used for training are right to seek to stop wide availability of LLMs that regurgitate information they worked hard to obtain.


I think it will be a bit easier than you believe. The reason why it hasn’t been done yet is that there hasn’t been a compelling economic reason to do so.


Basically, ya. It's not enough to change just a couple words around. But ya, there's probably some way to engineer around the problem.


> then would you accept that as a sufficient remedy?

Probably not until they pay him a hefty copyright fee.


Fair use is intended for humans, much like copyright in general.

If you can't copyright AI-generated pieces, then why would fair use apply to LLMs?


> Fair use is intended for humans.

Is it? Can you quote relevant legislation or case law?


That's why there will be a legalization of the fair use. Just let your intellectual to be used for free training material is not sustainable.

Also remember copyright laws was not there in the first place.


Because it's not just summarizing the bare facts. It's a parrot.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: