I read a NYT article and publish a summary of facts that I learned: totally legi...

bloppe · on Dec 28, 2023

Sounds like you didn't read the article. Here's a better synoposis:

I read a NYT article and publish an exact copy of that article on my website: copyright infringement.

Train a model on NYT text and it outputs an exact copy of that text: also copyright infringement.

slyall · on Dec 28, 2023

A small number of outputs of ChatGPT are close enough to training articles to be (probably) copyright infringement.

What does that mean?

Look up "substantial non-infringing use" and this little court case:

https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....

Now spend a few million on lawyers and roll your dice.

postexitus · on Dec 28, 2023

In Sony vs. Universal case, Sony is the producer of a tool where the consumer uses to "time-shift" a broadcast that they legally are allowed to view. Similarly, you can rip your own CDs or photocopy your own books. This case never made reselling those content legal. OpenAI does not train ChatGPT on the content you own - they do it on some undisclosed amount of data that you may or may not have a legal right to access, and then move on and (is shown to) reproduce it nearly verbatim - they may even charge you for the pleasure.

cycrutchfield · on Dec 28, 2023

So presumably when they fix that issue (which, if the text matches exactly, should be trivially easy) then would you accept that as a sufficient remedy?

Vegenoid · on Dec 28, 2023

Copyright infringement is not avoided by changing some text so it isn’t an exact clone of the source.

Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.

There is not an algorithm for this, cases are decided on by people.

There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.

cycrutchfield · on Dec 28, 2023

And you think that it would be impossible to train a model to avoid outputs that are substantially similar to training data?

Vegenoid · on Dec 28, 2023

I certainly don't think it's impossible, but I think it is hard problem that won't be solved in the immediate future, and creators of data used for training are right to seek to stop wide availability of LLMs that regurgitate information they worked hard to obtain.

cycrutchfield · on Dec 28, 2023

I think it will be a bit easier than you believe. The reason why it hasn’t been done yet is that there hasn’t been a compelling economic reason to do so.

bloppe · on Dec 28, 2023

Basically, ya. It's not enough to change just a couple words around. But ya, there's probably some way to engineer around the problem.

tarruda · on Dec 28, 2023

> then would you accept that as a sufficient remedy?

Probably not until they pay him a hefty copyright fee.

bad_user · on Dec 28, 2023

Fair use is intended for humans, much like copyright in general.

If you can't copyright AI-generated pieces, then why would fair use apply to LLMs?

mdekkers · on Dec 28, 2023

> Fair use is intended for humans.

Is it? Can you quote relevant legislation or case law?

up2isomorphism · on Dec 28, 2023

That's why there will be a legalization of the fair use. Just let your intellectual to be used for free training material is not sustainable.

Also remember copyright laws was not there in the first place.

zozbot234 · on Dec 28, 2023

Because it's not just summarizing the bare facts. It's a parrot.