I've been arguing since ChatGPT came out that LLMs should fall under fair use as a "transformative work". I'm not a lawyer and this is just my non-expert opinion, but it will be interesting to see what the legal system has to say about this.
I'm sure the NYT uses dictionaries, encyclopaedias and style books verbatim as well. And they don't invent the facts they write about. As journalists they are compiling and passing along other knowledge. You usually don't get a piece of their income when a journalist quotes you verbatim (people usually don't get paid for interviews).
I don’t doubt it does. It’s easy to get it to spit out long answers from Stack Overflow verbatim, I’ve done it. Maybe some of the “transformative” nature of the LLM output is the removal of any authorship, copyright, license, and edit history information. ;) The point here is to supplant Google as the portal of information, right? It doesn’t have new information, but it’s pretty good at remixing the words from multiple sources, when it has multiple sources. One possible reason for their legal woes wrt copyright is that it’s also great at memorizing things that only have one source. My college Markov-chain text predictor would do the same thing and easily get stuck in local regions if it couldn’t match something else.
This tired 'fair use' excuses from AI bros whilst the GPT has reproduced the article text verbatim, word for word and it being monetized without the permission from the copyright holder and source (NYT) is an obvious copyright violation 101. Full stop.
Again, just like Getty v. Stability, this copyright lawsuit will end in a licensing deal. Apple played it smart with their choice with licensing deals to train their GPT [0]. But this time, OpenAI knew they could get a license to train on NYT articles but chose not to.
the purpose and character of the use
the nature of the copyrighted work
the amount and substantiality of the portion taken
the effect of the use upon the potential market.
Literally every single one of these factors has very complicated precedent and each one is an open question when it comes to AI. Since fair use is a balancing test this could go any way.
Stability took the easy way out because they didn't have billions of dollars to play around with and Microsoft to back them. Let's see what OpenAI does but calling everyone who disagrees with your naive interpretation of fair use "AI bros" is doing everyone a disservice.
I generally tend to downvote comments that use "x bros" for pretty much any x on sight for that reason. It's exceedingly rare for such a comment to be much more than a thinly veiled insult with little substance. Sometimes I might even agree with the insult, but it's still rarely appropriate here.
Young males that wear Tensorflow branded muscle tank tops and drive Mitsubishi Eclipse convertibles with the vanity plate OVERFIT. They are everywhere these days.
Thank you for the absurd visual. The vanity plate, especially, was worth saving for last. Somehow, the car is well suited, also. Love how they prefer Tensorflow over Pytorch, too.
This seems like a reasonable opinion when you think about the training data size and imagine that any given output is some kind of interpolation of some unknown large number of training examples all from different people. If it’s borrowing snippets from tens or hundreds or thousands of sources, then who’s copyrights are being violated? Remixing in music seems to be withstanding some amount of legal scrutiny, as long as the remix is borrowing from multiple sources and the music is clearly different and original.
It gets harder to stand behind a blanket claim that LLMs or any AI we’ve got falls under fair use when they keep repeatedly reproducing complete and identifiable individual works and clearly violating copyright laws in specific instances. The models might be remixing and/or transformative most of the time, but we have proof that they don’t do that every time nor all the time… yet. Maybe the lawsuits will be the impetus we need to fix the AIs so they don’t reproduce specific works, and thus make the fair use claim solid and actually defensible?
It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.
>It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.
Why do you think the architecture is important?
If I have a computer program and it outputs the an entire copyrighted poem then the answer to "is this copyright violation" SHOULD NOT depends on the architecture of the program.
It's not stored in ChatGPT actually, unlike Google's web search cache where it is stored verbatim, can be recalled perfectly, and is still fair use.
Fair use has nothing to do with reproducibility. LLMs are more clearly fair use than a search engine cache and those court cases are long settled. There's no world in which OpenAI doesn't win this entire thing.