What you described is entirely fair use, actually.
Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)
> What you described is entirely fair use, actually
Just like during the pandemic how everyone became an epidemiologist, suddenly everyone's a copyright lawyer. I'll just dispute your assertion by saying:
1. Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.
2. People should familiarize themselves with the four factors of fair use determination. In particular, if a work is purely derivative of a source work and substantially negatively impacts the market for the original work, it's very likely to not be considered fair use.
This is the weakest part of the case(s) against OpenAI. "Derivative work" is a legal term of art meaning a direct adaptation, like writing a screenplay of a book or translating a book into another language.
NYT has a stronger case than Sarah Silverman here because they can show actual 'memorized' text rather than just summarization, but given that those memorizations are a) an unintended failure mode of the training process, and b) from an older version of the model that has been updated to no longer regurgitate memorized text, it's not really clear how in current form GPT could possibly be considered a derivative work.
"Transformative" seems to fit a lot more that "Derivative".
On the other hand, it's understandable why NYT is worried. OpenAI itself says that occupations like: Writers and Authors, Web and Digital Interface Designers, News Analysts, Reporters, and Journalists, Proofreaders and Copy Markers
are "90-100% exposed" to what OpenAI is building.
Yup. News has been rampant with speculation, hearsay, and propaganda all my life。 Content mills and astroturfers already bury the truth or relevant stories with noise.
I don't buy into all these "dangers". The advent of cars did not decrease the amount of drivers and introduced various new jobs, that were not available for a lot of people. And the rise of computers, did not make the workforce smaller but instead opened many more opportunities for a lot of people.
I think focusing on lost jobs is the wrong angle to take this in. It's in how the content being used is being compensated. OpenAi isn't paying writers to train their engine.
I don't care that the car replaced the horse carriage because it didn't need to compensate horses nor handlers to do so. AI being the newest iteration of scraping data from artists, writers, etc. to profit millions off of is directly using the "horse handler's" work. If these LLM's threw NYT a royalty to use their articles as training material, there wouldn't be a lawsuit.
A question is whether the new model still intrinsically embeds the source text, but this is later filtered in the output, or if it no longer embeds the text at all.
I would think an existing model could bootstrap a copyright free training corpus by completely rewriting/paraphrasing copyrighted material with semantic fidelity for training of the next model to completely eliminate memorization of copyrighted works. That might pose an interesting obstacle to copyright challenges, bootstrapping your way into a clean room. Although, tweaking the architecture to either eliminate memorization, or eliminate high fidelity reproduction of verbatim training data seems far more expedient and less costly.
Roll back 20+ years ago on Slashdot and you'll see the exact same thing.
Copyright has been a hot button issue on the internet for decades. People end up thinking (rightly or wrongly) that they understand it without being a lawyer.
Quite literally, not even the lawyers or courts understand it. This is very much a "learn as you go" exercise for humanity in general at this point in time.
It seems like everything in tech is in the learn as you go phase. Everything is changing so rapidly that there can’t be experts. Just people that are able to adapt quickly.
I only see this phenomenon speeding up. Strange times.
One of my biggest gripes is a somewhat adjacent issue where everyone thinks they're an American copyright lawyer and that American copyright law is universal.
It's very possible that the example provided above is an example of fair use in some country, and that the website offering that service could be hosted there.
Completely agree. Copyright should be abolished. All intellectual work is information, information is just bits and bits are just numbers. It's quite simply delusional to believe you can own numbers in the 21st century, the age of information and ubiquitous globally networked pocket supercomputers.
This is just a felony contempt of business model issue. Computers invalidated their business models and they're doing everything they possibly can to hang on for dear life. Society needs to move on already.
This goes too far. Digital media are not only long series of numbers. They are often difficult-to-create expressions in image, video, and even interactive forms; regardless of their serialization format.
Books are just strings of letters, yet copyright has still been useful to increase the volume and utility of books.
All that said, I do find the life+70y an absurdly long time.
The market at large will determine that. If people value cheap AI generated images more than talented human curated art then that's what it will be. If a market exists to buy unique pieces where an artist put brush to canvas and priced their work at $1000 instead of the cheap $10 poster that can be mass produced then thats what it will be. If no one wants to pay $1000 for your unique piece, then the market has spoken and your art is not worth that much. Like everything else, an equilibrium ill be reached. Good artists will be fine. The other 99% of self declared artists will fade away into obscurity.
None of that is what copyright protects. And it lessens the argument when you can argue that LLMs are essentially stealing a human art's work to be used to generate cheap images. Similar to how if you took commissioned art, printed out 1000 copies, and sold them for $1 a piece.
Copyright means that you need to at least pay that artist you stole from in some way, which the government enforces so artists don't stop creating.
I propose getting paid before doing the work for the actual labor of creation. Crowdfunding, patronage, comissions, sponsorships all seem like ethical ways to get things done sustainably. That way creators get paid before they work, not after.
We must strengthen these business models that don't depend on artificial scarcity because this number selling nonsense was over the second computers were invented. It's as dumb as asserting that you need permission to use memcpy or the mov CPU instruction.
How do you know what the value of the art will be before it's created? Guns N' Roses is a top 40 artist on Spotify nearly 35 years after producing an album. Should they not have been paid after 1991? If you argue that they were a popular band and therefore should have been paid accordingly up front, well what about their debut record, which sold 30 million copies? How would you predict that value before its creation (or even after)? If you're saying that only the labor has value, and all labor is valued equally, that sounds sort of like marxism, which could be fine, but it's hard to say how well artists would be supported in that case.
In the US, the original copyright length was 14 years, and then 28, and eventually the lifetime of the author plus 70 years. I think the intent of the law is economically justified, but the current length is outrageous.
You should be paid the accurate value of the labor. The pay should not scale more when no additional labor takes place.
This is how art worked for millenia; someone commissions a chapel roof painting, someone commissions a concerto, someone commissions a statue, someone buys a chair, etc.
Artists still do this today, and there is no issue determining value beforehand. Artists list their commission prices, or their hourly costs, etc. This is a perfectly normal thing that happens everyday.
>You should be paid the accurate value of the labor.
that's gone out the door in the digital age. Compaies at this point have spent centuries trying to enfoce this model while witholding stuff like stock and royalties to take a part of what the company enjoys by protifting for decades off of a single (underpaid) piece of labor.
I don't exactly sympathize with a robot now trying to do the same. Pay your labor.
You do other things to make money and continue to make art for it's own sake. If you get to the point that others want your art you get commissions. Just like most artists in the current system.
Alright, so then you're NOT going to pay artists for their labor?
In the current system, artists might work for many years on a single work, or work many years perfecting their craft before anyone wants to pay for their work. Copyright gives them a way to earn money in the future that compensates them for the work they did in the past. It incentivizes creativity. Don't get me wrong, I don't think copyright is perfect, but you really ought to think more about the system you're proposing, because it's not making much sense.
Unfortunately, it’s hard to explain these things to techies who only see the world in their one-sided startupy way. The fact that there’re starving creatives who have already been massively marginalized by the likes of Spotify of this world, means nothing to these tech workers who only see everything as numbers, or a “business model” to “validate”.
(full disclosure, I’m a techie who’s gradually woken up to the idea that the tech might just be the most abused way to exploit people)
I'm in games, where art and tech crossroad. I 1000% empathize for the fact that art exploits, abuses, and underpays even if they at times may be doing more work than a junior web dev.
It's a bit ironic, because a lot of tech offers partial compensation in stock. Something else that really doesn't happen in games unless you work for like, the 3-4 largest studios. So they should at least understand that your compensation is not all based on labor for time worked.
Exactly. Somehow this idea that you keep getting paid for literally the same thing over and over again for work you did once is the absurdity. And ridiculously greedy.
It seems to have been invented by laywers, for lawyers. Nobody else really benefits as much as they do. The whole entirety of society vs. a single profession of dubious morality.
>Somehow this idea that you keep getting paid for literally the same thing over and over again for work you did once is the absurdity
meanwhile, most tech is moving towards subscriptions?
Art is getting paid "non-greedily". People buy a song or art piece, and then people 10 years later buy a song or art piece. That's not one person paying twice for the same song, it's two people buying the same thing.
If people still value that art for that price later, I don't see how this is a "greedy" thing. is art magically supposed to turn open source CC0 after 5 years? Tech sure doesnt work like that.
> How do you know what the value of the art will be before it's created?
I don't know. Anyone funding the work is accepting a risk.
> Should they not have been paid after 1991?
They definitely should get paid for their shows and live performances. The band itself can't be copied. Artists are extremely scarce.
Their art, however, is not. Once created, the scarcity of their recordings is artificial and fundamentally time limited anyway. Even if I were extremely tolerant of copyright, I'd argue for a term of only 5-10 years maximum with absolutely no possibility of extension.
In other words, even if we accept copyright as legitimate, they sure as hell shouldn't still be getting paid for some late 80s album. They've already been adequately compensated for those creations. If they want more, they should have to keep making new stuff so that they can benefit from new copyrights which will also expire after a short time.
Creators are not supposed to be able to strike gold once and then enjoy eternal royalties. Copyright must have short time frames or it's in breach of the social contract. The reality is we're doing creators a favor by pretending that it's hard to copy their stuff so they can make some money. We do this because they assured us that eventually all of it would belong to us: works would the public domain.
The copyright industry isn't keeping up their end of the bargain. They continuously pull the rug out from under us by extending copyright to the point we'll be long dead before our culture is returned to us. It's offensive and we should all stop pretending. They need reminding that public domain is the natural and default state of all intellectual work.
> How would you predict that value before its creation (or even after)?
I'd look at the artist's past work. If there is no past work, then I don't know.
> If you're saying that only the labor has value
I'm not saying that at all. Creations are valuable. Creators are valuable. The labor of creation is valuable.
Value is assigned to stuff by humans. Obviously humans value art. The price however is given by supply and demand. The fact is that supply of intellectual works approach infinity after they are created and therefore their prices approach zero. So it makes perfect sense to assign prices to the labor of creation but zero sense to assign a price to the product of creation. Copyright is an exercise in denying reality.
> and all labor is valued equally
I definitely did not say that. All labor is different. I value some creators a lot more than others. Some creators I don't value at all.
> that sounds sort of like marxism
I must apologize if I gave that impression. I hate marxism.
>In other words, even if we accept copyright as legitimate, they sure as hell shouldn't still be getting paid for some late 80s album.
Why not? The fact is that even if the album is free, there will be people paying spotify $10/month to listen to it on demand. How is it fair that Spotify can profit from it for decades to come because they offer convenience, over the artist who made the music 10 years earlier and now relinquishes their art not even a quarter into a typical career?
Copyright is abusrd now, but it's not a bad concept. I think the original copyright law of 14 + 14 worked well enough. Life expectancy increased so I'd increase it to 14 + 14 + 14 (or 10 years after the death of the original author, whatever comes first). You fund an artist for their typical career length (if they choose to extend twice) and once they are (near) retired the song is free to work off of. In the meantime you simply negotiate if you want to use their work.
First, you missed the "and". Do CliffNotes, Wikipedia, etc. substantially impact the market for the original work? For example CliffNotes does not - people who buy the CliffNotes version typically already have the original work as well (for example from coursework). And Wikipedia may well do more to interest people in the original work than to replace it.
Second, you ignored the "purely derivative" bit. You have to look at to what extent the use is derivative or transformative. See https://en.wikipedia.org/wiki/Transformative_use for a bit about that. (Note, this is a legal term defined by various precedents. OpenAI can't just argue, "Turning it into an LLM is a transform, so it is transformative!") Since CliffNotes is educational and Wikipedia is nonprofit, it is relatively easy for both to qualify as transformative.
As a result your response underscores the point that was made. There are a lot of shades of grey. You really can't just seize on a couple of phrases and key points, then jump straight to the answer. You have to understand how the courts will decide, and then accept that there is an actual judgment call whose outcome depends on the judge judging.
(I'm not a lawyer, but I have had excessive exposure to them in the past.)
The question was NOT whether it spreads information from the articles to people who wouldn't have paid for it. The question was whether it suppresses sales of the articles to people who otherwise might have paid for it.
That's a more complicated question of fact. Some people now read Wikipedia and won't buy the article. Some people encounter the reference on Wikipedia and decide to buy the article. Which happens more?
Publishers concluded that Wikipedia references are good for sales. And so jumped on the chance to cooperate with https://wikipedialibrary.wmflabs.org/. Which is therefore able to give free access to 90% of subscription only databases to you if you can prove that you're the kind of person who is likely to add citations to Wikipedia.
Legal questions are funny like that. You have to answer the question actually asked. If you merely answer another one that sounds similar to you, your answer is generally wrong.
I thought the question would be how it does this. If it can write NYT articles because it read them it has to arrive at the exact same words in the same sequence. Wikipedia has to copy and paste to achieve the same. So maybe the question actually asked does not apply.
I personally appreciate the semi truck sized loophole that is satire. One can include an entire copy written work within one's own work as long as the treatment of that other copy written work is parody / satire. This is a provision of US copyright law put in place to protect political satire, which can be anything, because politics is everything.
> Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.
You're the one presenting unfounded claims with confidence here. There is well established case law about not being able to copyright facts. If you are actually fully paraphrasing a presentation of facts / ideas and not just altering a couple of words here and there, then there is a very strong case for non-infringement.
> You're the one presenting unfounded claims with confidence here.
No, I'm not. On the contrary, I'm really looking forward to this case because I believe it will be a great test of a bunch of concepts that are totally novel in the world of copyright law as it applies to generative AI. The only things I am presenting with confidence are:
1. That anyone who declares that something is unambiguously fair use (or, contrarily, unambiguously infringing) is likely wrong. There is simply too much latitude by judges, and there have certainly been cases where a ruling went one way, only to be overturned on appeal.
2. While I certainly have an opinion on how I think this case will be decided, I'm not presenting that with unwarranted confidence. Instead, I linked that great article on the 4 factors of fair use determination because it's clear to me lots of people are saying "fair use!" on one side or the other with no understanding of the factors judges must actually consider when making a determination.
You seem to be shifting the topic of this thread. The GP comment is about paraphrasing news articles while I don't see anything in the NYT lawsuit about paraphrasing. Rather, the NYT is concerned with exact reproduction or near exact reproduction. I too am very curious about the outcome of this case and wouldn't care bet either way on the outcome. I do have an opinion on what precedent would be better for our society but that doesn't mean I think that outcome is more likely.
However, none of that matters in this particular thread. There are well established precedents about paraphrasing news articles and they do not support the claim you made
The "unfounded claims" were backed up by a link to Stanford on fair use and copyright. That's the opposite of being unfounded.
Remember. The NY Times does not have a record of filing frivolous lawsuits. Particularly not against companies with deep pockets. So it is almost certainly true that a lawyer who knows the law better than you thinks that this has a real chance. So you should be looking for flaws in trivial defenses that you can think up, rather than assuming that you know best.
For example take your copyright facts defense. That would be great if the NY Times was a phone book. They aren't, in addition to facts they offer analysis, editorial positions, and so on. For example I just asked ChatGPT, "In 2016, did the New York Times generally support or oppose President Trump?" I got back an answer talking about various kinds of concerns that the New York Times had, including an editorial titled, "Why Donald Trump Should Not Be President". The copy that ChatGPT needed to have to do that has a lot more than just facts in it.
Now if you paraphrased the NY Times like ChatGPT did when it answered me, you'd have a perfect fair use defense. But you aren't doing it for money, you didn't make a copy of all the NY Times, you aren't destroying the market for the NY Times, and you're legally able to own copyright in your transformed work. OpenAI is doing it for money, did copy all of the NY Times, is seriously impacting the market for NY Times articles, and ChatGPT generated text does not get a copyright.
Fair use is filled with shades of grey. Even if ChatGPT appears to do the same thing that you do, it is far less clear that OpenAI will enjoy the same level of fair use defense.
The Stanford link is just generic information about the fair use tests and does nothing to backup the assertion.
> They aren't, in addition to facts they offer analysis, editorial positions, and so on.
Those opinions and ideas are also not copyrightable. Only expressions of them are copyrightable, which is why paraphrasing facts, ideas and opinions is not a violation of copyright.
> Fair use is filled with shades of grey.
Yes, but not all those shade are equal. There is a long history of litigation showing that paraphrasing news articles is fine.
I would say it is arguable that is fair use, but the whole thing about fair use is that it is a defense, not a type of license or something you can preemptively apply. So whether or not it will be protected under fair use is actually not determined yet. In fact I would say that’s the entire debate here, right?
I have worked on many documentaries and any time we said “fair use” internally what we were implicitly saying is “nobody will come after us because they know that we are probably safe under fair use if this escalated.“ But again, we could never preemptively apply it. We were just anticipating potential conflict and gauging how likely it was to occur.
He's talking about citing and quoting NYTimes articles, not republishing them verbatim. That said, it's very different if you're a publication that sometimes cites reporting from other publications vs. a website exclusively dedicated to indexing and summarizing NYTimes articles.
I couldn't get gpt to quote an actual nyt article no matter how hard I tried...it just hallucinated in the general style of a news article.
Presumably, if it can remember at least a paragraph or two of each article, then surely the same would be true of any text it ingested and the model size would approach the dataset size (probably actually much larger). I don't believe this is the case at all, even searching around, I've not found any good recent examples of it regurgitating copyrighted text verbatim.
It's cool to hate AI stuff if you're a creative atm. But gotta love those generative/algorithm based PS brushes, that's still real art!
"Indeed, the opening paragraph of "A Game of Thrones" by George R.R. Martin, with the chapter titled "Bran," starts as follows:
"The morning had dawned clear and cold, with a crispness that hinted"
And then it cuts off, whether that's because OAI now have an oh shit filter or just the model had access to the first page or publicly available articles quoting the first line, I'm not sure.
I tried other chapters and random sections and it could get a sentence or two right but then hallucinated; what's more likely NYT and GRRM? That your works are being reproduced verbatim? Or that Facebook, YouTube descriptions, fan tumblrs and hell, the publicly available and multiple GoT related wikis that include a variety of passages from the books were used as training data?
I don't think it's necessarily true that model size would need to be larger than dataset size. It's theoretically possible that the model encoding achieves significantly better compression than DEFLATE or GZIP or whatever compression algorithm is used to store the dataset itself.
I think what wouldn't be covered is reproducing substantial portions of an article, especially if it's done without attribution. Tier 2 publications that fully reprint NYT or AP/Reuters articles are usually doing this via a paid News Service or Content License. See: https://nytlicensing.com/content/new-york-times-news-service...
Correct. But those 2nd tier sources don't have the NYT copy verbatim. Do you really think the US NFL, as an example, would let OpenAI use all of its recorded games as a way to train some new GenAI game framework to build better American Football games? No. All that material is copyright. Public media is going to move to a very awkward era of ownership and licensing because all of these large companies looking to make a buck off public data sets are doing very little to make the economic model less one sided.
I hope the NYT prevails here, personally. Models will (and are) currently tainted by data they should not contain and for longer term privacy concerns this needs to be addressed early and have significant consequences or we're headed towards a world where this type of technology will make our ad-targeted world seem like a much more manageable past.
Why do you say that? Commercial vs noncommercial use is a primary factor in the “purpose” prong of the fair use balancing test and a significant one in the “market effects” prong.
That a use is noncommercial is often a deciding factor in the success of a fair use defense. GP is overstating it though, since it’s still one of many factors.
Whether or not the use is commercial is certainly one of the considerations, but it's not the most significant one generally. There certainly can be specific cases where it's very significant, of course.
But what I was arguing was that a use is not "fair use" merely because it's noncommercial in nature. I cannot make copies of movies and give them away on the street for free and successfully claim "fair use".
Because anyone that is familiar with fair use knows that the purpose prong and the commerciality aspect of it is not one of the more important prongs of the fair use analysis, whereas transformation is. Transformation adjusts what is a purpose that falls under fair use. Did you read Warhol??
Yes. Warhol is an example where the commercial nature of the secondary use was the deciding factor in its failure to pass the purpose prong.
> In sum, if an original work and secondary use share the same or highly similar purposes, and the secondary use is commercial, the first fair use factor is likely to weigh against fair use, absent some other justification for copying.
(P4). It’s very likely that a noncommercial secondary use would have passed under the reasoning in Warhol. I don’t understand the point you’re trying to make.
Read what you quoted - the commerciality of the use comes after whether or not the use was transformational. That's the entire of point Warhol - when the use is not transformational, there is very little space for a commercial fair use.
Always great to see people point out Weird Al, cause he's the shining beacon of an example of what OpenAI et al. should be doing. He explicitly gets permission from the original authors before doing any of his parodies, and he's even been turned down a few times as well, famously Prince rejected him a bunch of times and he subsequently has never made a Prince parody.
Not only does he get permission from the original authors, he also pays royalties to the authors despite legally not having to do so.
I'm pretty sure Weird Al is actually using compulsory licensing in music and just paying the required royalties to the songwriters. Anyone can cover any published song, you just have to pay the royalties when you do.
He doesn't actually make very heavy use of the satire plank of fair use. He credits the original artists. From his own website
"Does Al get permission to do his parodies?
Al does get permission from the original writers of the songs that he parodies. While the law supports his ability to parody without permission, he feels it’s important to maintain the relationships that he’s built with artists and writers over the years. Plus, Al wants to make sure that he gets his songwriter credit (as writer of new lyrics) as well as his rightful share of the royalties."
The fact that he could rely on fair use is separate from whether he as an artist does rely on fair use.
If that were true, I could take a band that I hate, copy all of their music note-for-note, then release an exact copy on the market and undercut them by selling their entire discography for $0.01
Fair Use requires one of several enumerated activities, including satire, education, journalism. You can’t just copy content and hope that it passes Fair Use.
Hire a lawyer if you are unsure. But at least read the Wikipedia article on the subject if you are going to talk about it.
Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)