If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?
Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.
Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.
A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.
What you described is entirely fair use, actually.
Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)
> What you described is entirely fair use, actually
Just like during the pandemic how everyone became an epidemiologist, suddenly everyone's a copyright lawyer. I'll just dispute your assertion by saying:
1. Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.
2. People should familiarize themselves with the four factors of fair use determination. In particular, if a work is purely derivative of a source work and substantially negatively impacts the market for the original work, it's very likely to not be considered fair use.
This is the weakest part of the case(s) against OpenAI. "Derivative work" is a legal term of art meaning a direct adaptation, like writing a screenplay of a book or translating a book into another language.
NYT has a stronger case than Sarah Silverman here because they can show actual 'memorized' text rather than just summarization, but given that those memorizations are a) an unintended failure mode of the training process, and b) from an older version of the model that has been updated to no longer regurgitate memorized text, it's not really clear how in current form GPT could possibly be considered a derivative work.
"Transformative" seems to fit a lot more that "Derivative".
On the other hand, it's understandable why NYT is worried. OpenAI itself says that occupations like: Writers and Authors, Web and Digital Interface Designers, News Analysts, Reporters, and Journalists, Proofreaders and Copy Markers
are "90-100% exposed" to what OpenAI is building.
Yup. News has been rampant with speculation, hearsay, and propaganda all my life。 Content mills and astroturfers already bury the truth or relevant stories with noise.
I don't buy into all these "dangers". The advent of cars did not decrease the amount of drivers and introduced various new jobs, that were not available for a lot of people. And the rise of computers, did not make the workforce smaller but instead opened many more opportunities for a lot of people.
I think focusing on lost jobs is the wrong angle to take this in. It's in how the content being used is being compensated. OpenAi isn't paying writers to train their engine.
I don't care that the car replaced the horse carriage because it didn't need to compensate horses nor handlers to do so. AI being the newest iteration of scraping data from artists, writers, etc. to profit millions off of is directly using the "horse handler's" work. If these LLM's threw NYT a royalty to use their articles as training material, there wouldn't be a lawsuit.
A question is whether the new model still intrinsically embeds the source text, but this is later filtered in the output, or if it no longer embeds the text at all.
I would think an existing model could bootstrap a copyright free training corpus by completely rewriting/paraphrasing copyrighted material with semantic fidelity for training of the next model to completely eliminate memorization of copyrighted works. That might pose an interesting obstacle to copyright challenges, bootstrapping your way into a clean room. Although, tweaking the architecture to either eliminate memorization, or eliminate high fidelity reproduction of verbatim training data seems far more expedient and less costly.
Roll back 20+ years ago on Slashdot and you'll see the exact same thing.
Copyright has been a hot button issue on the internet for decades. People end up thinking (rightly or wrongly) that they understand it without being a lawyer.
Quite literally, not even the lawyers or courts understand it. This is very much a "learn as you go" exercise for humanity in general at this point in time.
It seems like everything in tech is in the learn as you go phase. Everything is changing so rapidly that there can’t be experts. Just people that are able to adapt quickly.
I only see this phenomenon speeding up. Strange times.
One of my biggest gripes is a somewhat adjacent issue where everyone thinks they're an American copyright lawyer and that American copyright law is universal.
It's very possible that the example provided above is an example of fair use in some country, and that the website offering that service could be hosted there.
Completely agree. Copyright should be abolished. All intellectual work is information, information is just bits and bits are just numbers. It's quite simply delusional to believe you can own numbers in the 21st century, the age of information and ubiquitous globally networked pocket supercomputers.
This is just a felony contempt of business model issue. Computers invalidated their business models and they're doing everything they possibly can to hang on for dear life. Society needs to move on already.
This goes too far. Digital media are not only long series of numbers. They are often difficult-to-create expressions in image, video, and even interactive forms; regardless of their serialization format.
Books are just strings of letters, yet copyright has still been useful to increase the volume and utility of books.
All that said, I do find the life+70y an absurdly long time.
The market at large will determine that. If people value cheap AI generated images more than talented human curated art then that's what it will be. If a market exists to buy unique pieces where an artist put brush to canvas and priced their work at $1000 instead of the cheap $10 poster that can be mass produced then thats what it will be. If no one wants to pay $1000 for your unique piece, then the market has spoken and your art is not worth that much. Like everything else, an equilibrium ill be reached. Good artists will be fine. The other 99% of self declared artists will fade away into obscurity.
None of that is what copyright protects. And it lessens the argument when you can argue that LLMs are essentially stealing a human art's work to be used to generate cheap images. Similar to how if you took commissioned art, printed out 1000 copies, and sold them for $1 a piece.
Copyright means that you need to at least pay that artist you stole from in some way, which the government enforces so artists don't stop creating.
I propose getting paid before doing the work for the actual labor of creation. Crowdfunding, patronage, comissions, sponsorships all seem like ethical ways to get things done sustainably. That way creators get paid before they work, not after.
We must strengthen these business models that don't depend on artificial scarcity because this number selling nonsense was over the second computers were invented. It's as dumb as asserting that you need permission to use memcpy or the mov CPU instruction.
How do you know what the value of the art will be before it's created? Guns N' Roses is a top 40 artist on Spotify nearly 35 years after producing an album. Should they not have been paid after 1991? If you argue that they were a popular band and therefore should have been paid accordingly up front, well what about their debut record, which sold 30 million copies? How would you predict that value before its creation (or even after)? If you're saying that only the labor has value, and all labor is valued equally, that sounds sort of like marxism, which could be fine, but it's hard to say how well artists would be supported in that case.
In the US, the original copyright length was 14 years, and then 28, and eventually the lifetime of the author plus 70 years. I think the intent of the law is economically justified, but the current length is outrageous.
You should be paid the accurate value of the labor. The pay should not scale more when no additional labor takes place.
This is how art worked for millenia; someone commissions a chapel roof painting, someone commissions a concerto, someone commissions a statue, someone buys a chair, etc.
Artists still do this today, and there is no issue determining value beforehand. Artists list their commission prices, or their hourly costs, etc. This is a perfectly normal thing that happens everyday.
>You should be paid the accurate value of the labor.
that's gone out the door in the digital age. Compaies at this point have spent centuries trying to enfoce this model while witholding stuff like stock and royalties to take a part of what the company enjoys by protifting for decades off of a single (underpaid) piece of labor.
I don't exactly sympathize with a robot now trying to do the same. Pay your labor.
You do other things to make money and continue to make art for it's own sake. If you get to the point that others want your art you get commissions. Just like most artists in the current system.
Alright, so then you're NOT going to pay artists for their labor?
In the current system, artists might work for many years on a single work, or work many years perfecting their craft before anyone wants to pay for their work. Copyright gives them a way to earn money in the future that compensates them for the work they did in the past. It incentivizes creativity. Don't get me wrong, I don't think copyright is perfect, but you really ought to think more about the system you're proposing, because it's not making much sense.
Unfortunately, it’s hard to explain these things to techies who only see the world in their one-sided startupy way. The fact that there’re starving creatives who have already been massively marginalized by the likes of Spotify of this world, means nothing to these tech workers who only see everything as numbers, or a “business model” to “validate”.
(full disclosure, I’m a techie who’s gradually woken up to the idea that the tech might just be the most abused way to exploit people)
I'm in games, where art and tech crossroad. I 1000% empathize for the fact that art exploits, abuses, and underpays even if they at times may be doing more work than a junior web dev.
It's a bit ironic, because a lot of tech offers partial compensation in stock. Something else that really doesn't happen in games unless you work for like, the 3-4 largest studios. So they should at least understand that your compensation is not all based on labor for time worked.
Exactly. Somehow this idea that you keep getting paid for literally the same thing over and over again for work you did once is the absurdity. And ridiculously greedy.
It seems to have been invented by laywers, for lawyers. Nobody else really benefits as much as they do. The whole entirety of society vs. a single profession of dubious morality.
>Somehow this idea that you keep getting paid for literally the same thing over and over again for work you did once is the absurdity
meanwhile, most tech is moving towards subscriptions?
Art is getting paid "non-greedily". People buy a song or art piece, and then people 10 years later buy a song or art piece. That's not one person paying twice for the same song, it's two people buying the same thing.
If people still value that art for that price later, I don't see how this is a "greedy" thing. is art magically supposed to turn open source CC0 after 5 years? Tech sure doesnt work like that.
> How do you know what the value of the art will be before it's created?
I don't know. Anyone funding the work is accepting a risk.
> Should they not have been paid after 1991?
They definitely should get paid for their shows and live performances. The band itself can't be copied. Artists are extremely scarce.
Their art, however, is not. Once created, the scarcity of their recordings is artificial and fundamentally time limited anyway. Even if I were extremely tolerant of copyright, I'd argue for a term of only 5-10 years maximum with absolutely no possibility of extension.
In other words, even if we accept copyright as legitimate, they sure as hell shouldn't still be getting paid for some late 80s album. They've already been adequately compensated for those creations. If they want more, they should have to keep making new stuff so that they can benefit from new copyrights which will also expire after a short time.
Creators are not supposed to be able to strike gold once and then enjoy eternal royalties. Copyright must have short time frames or it's in breach of the social contract. The reality is we're doing creators a favor by pretending that it's hard to copy their stuff so they can make some money. We do this because they assured us that eventually all of it would belong to us: works would the public domain.
The copyright industry isn't keeping up their end of the bargain. They continuously pull the rug out from under us by extending copyright to the point we'll be long dead before our culture is returned to us. It's offensive and we should all stop pretending. They need reminding that public domain is the natural and default state of all intellectual work.
> How would you predict that value before its creation (or even after)?
I'd look at the artist's past work. If there is no past work, then I don't know.
> If you're saying that only the labor has value
I'm not saying that at all. Creations are valuable. Creators are valuable. The labor of creation is valuable.
Value is assigned to stuff by humans. Obviously humans value art. The price however is given by supply and demand. The fact is that supply of intellectual works approach infinity after they are created and therefore their prices approach zero. So it makes perfect sense to assign prices to the labor of creation but zero sense to assign a price to the product of creation. Copyright is an exercise in denying reality.
> and all labor is valued equally
I definitely did not say that. All labor is different. I value some creators a lot more than others. Some creators I don't value at all.
> that sounds sort of like marxism
I must apologize if I gave that impression. I hate marxism.
>In other words, even if we accept copyright as legitimate, they sure as hell shouldn't still be getting paid for some late 80s album.
Why not? The fact is that even if the album is free, there will be people paying spotify $10/month to listen to it on demand. How is it fair that Spotify can profit from it for decades to come because they offer convenience, over the artist who made the music 10 years earlier and now relinquishes their art not even a quarter into a typical career?
Copyright is abusrd now, but it's not a bad concept. I think the original copyright law of 14 + 14 worked well enough. Life expectancy increased so I'd increase it to 14 + 14 + 14 (or 10 years after the death of the original author, whatever comes first). You fund an artist for their typical career length (if they choose to extend twice) and once they are (near) retired the song is free to work off of. In the meantime you simply negotiate if you want to use their work.
First, you missed the "and". Do CliffNotes, Wikipedia, etc. substantially impact the market for the original work? For example CliffNotes does not - people who buy the CliffNotes version typically already have the original work as well (for example from coursework). And Wikipedia may well do more to interest people in the original work than to replace it.
Second, you ignored the "purely derivative" bit. You have to look at to what extent the use is derivative or transformative. See https://en.wikipedia.org/wiki/Transformative_use for a bit about that. (Note, this is a legal term defined by various precedents. OpenAI can't just argue, "Turning it into an LLM is a transform, so it is transformative!") Since CliffNotes is educational and Wikipedia is nonprofit, it is relatively easy for both to qualify as transformative.
As a result your response underscores the point that was made. There are a lot of shades of grey. You really can't just seize on a couple of phrases and key points, then jump straight to the answer. You have to understand how the courts will decide, and then accept that there is an actual judgment call whose outcome depends on the judge judging.
(I'm not a lawyer, but I have had excessive exposure to them in the past.)
The question was NOT whether it spreads information from the articles to people who wouldn't have paid for it. The question was whether it suppresses sales of the articles to people who otherwise might have paid for it.
That's a more complicated question of fact. Some people now read Wikipedia and won't buy the article. Some people encounter the reference on Wikipedia and decide to buy the article. Which happens more?
Publishers concluded that Wikipedia references are good for sales. And so jumped on the chance to cooperate with https://wikipedialibrary.wmflabs.org/. Which is therefore able to give free access to 90% of subscription only databases to you if you can prove that you're the kind of person who is likely to add citations to Wikipedia.
Legal questions are funny like that. You have to answer the question actually asked. If you merely answer another one that sounds similar to you, your answer is generally wrong.
I thought the question would be how it does this. If it can write NYT articles because it read them it has to arrive at the exact same words in the same sequence. Wikipedia has to copy and paste to achieve the same. So maybe the question actually asked does not apply.
I personally appreciate the semi truck sized loophole that is satire. One can include an entire copy written work within one's own work as long as the treatment of that other copy written work is parody / satire. This is a provision of US copyright law put in place to protect political satire, which can be anything, because politics is everything.
> Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.
You're the one presenting unfounded claims with confidence here. There is well established case law about not being able to copyright facts. If you are actually fully paraphrasing a presentation of facts / ideas and not just altering a couple of words here and there, then there is a very strong case for non-infringement.
> You're the one presenting unfounded claims with confidence here.
No, I'm not. On the contrary, I'm really looking forward to this case because I believe it will be a great test of a bunch of concepts that are totally novel in the world of copyright law as it applies to generative AI. The only things I am presenting with confidence are:
1. That anyone who declares that something is unambiguously fair use (or, contrarily, unambiguously infringing) is likely wrong. There is simply too much latitude by judges, and there have certainly been cases where a ruling went one way, only to be overturned on appeal.
2. While I certainly have an opinion on how I think this case will be decided, I'm not presenting that with unwarranted confidence. Instead, I linked that great article on the 4 factors of fair use determination because it's clear to me lots of people are saying "fair use!" on one side or the other with no understanding of the factors judges must actually consider when making a determination.
You seem to be shifting the topic of this thread. The GP comment is about paraphrasing news articles while I don't see anything in the NYT lawsuit about paraphrasing. Rather, the NYT is concerned with exact reproduction or near exact reproduction. I too am very curious about the outcome of this case and wouldn't care bet either way on the outcome. I do have an opinion on what precedent would be better for our society but that doesn't mean I think that outcome is more likely.
However, none of that matters in this particular thread. There are well established precedents about paraphrasing news articles and they do not support the claim you made
The "unfounded claims" were backed up by a link to Stanford on fair use and copyright. That's the opposite of being unfounded.
Remember. The NY Times does not have a record of filing frivolous lawsuits. Particularly not against companies with deep pockets. So it is almost certainly true that a lawyer who knows the law better than you thinks that this has a real chance. So you should be looking for flaws in trivial defenses that you can think up, rather than assuming that you know best.
For example take your copyright facts defense. That would be great if the NY Times was a phone book. They aren't, in addition to facts they offer analysis, editorial positions, and so on. For example I just asked ChatGPT, "In 2016, did the New York Times generally support or oppose President Trump?" I got back an answer talking about various kinds of concerns that the New York Times had, including an editorial titled, "Why Donald Trump Should Not Be President". The copy that ChatGPT needed to have to do that has a lot more than just facts in it.
Now if you paraphrased the NY Times like ChatGPT did when it answered me, you'd have a perfect fair use defense. But you aren't doing it for money, you didn't make a copy of all the NY Times, you aren't destroying the market for the NY Times, and you're legally able to own copyright in your transformed work. OpenAI is doing it for money, did copy all of the NY Times, is seriously impacting the market for NY Times articles, and ChatGPT generated text does not get a copyright.
Fair use is filled with shades of grey. Even if ChatGPT appears to do the same thing that you do, it is far less clear that OpenAI will enjoy the same level of fair use defense.
The Stanford link is just generic information about the fair use tests and does nothing to backup the assertion.
> They aren't, in addition to facts they offer analysis, editorial positions, and so on.
Those opinions and ideas are also not copyrightable. Only expressions of them are copyrightable, which is why paraphrasing facts, ideas and opinions is not a violation of copyright.
> Fair use is filled with shades of grey.
Yes, but not all those shade are equal. There is a long history of litigation showing that paraphrasing news articles is fine.
I would say it is arguable that is fair use, but the whole thing about fair use is that it is a defense, not a type of license or something you can preemptively apply. So whether or not it will be protected under fair use is actually not determined yet. In fact I would say that’s the entire debate here, right?
I have worked on many documentaries and any time we said “fair use” internally what we were implicitly saying is “nobody will come after us because they know that we are probably safe under fair use if this escalated.“ But again, we could never preemptively apply it. We were just anticipating potential conflict and gauging how likely it was to occur.
He's talking about citing and quoting NYTimes articles, not republishing them verbatim. That said, it's very different if you're a publication that sometimes cites reporting from other publications vs. a website exclusively dedicated to indexing and summarizing NYTimes articles.
I couldn't get gpt to quote an actual nyt article no matter how hard I tried...it just hallucinated in the general style of a news article.
Presumably, if it can remember at least a paragraph or two of each article, then surely the same would be true of any text it ingested and the model size would approach the dataset size (probably actually much larger). I don't believe this is the case at all, even searching around, I've not found any good recent examples of it regurgitating copyrighted text verbatim.
It's cool to hate AI stuff if you're a creative atm. But gotta love those generative/algorithm based PS brushes, that's still real art!
"Indeed, the opening paragraph of "A Game of Thrones" by George R.R. Martin, with the chapter titled "Bran," starts as follows:
"The morning had dawned clear and cold, with a crispness that hinted"
And then it cuts off, whether that's because OAI now have an oh shit filter or just the model had access to the first page or publicly available articles quoting the first line, I'm not sure.
I tried other chapters and random sections and it could get a sentence or two right but then hallucinated; what's more likely NYT and GRRM? That your works are being reproduced verbatim? Or that Facebook, YouTube descriptions, fan tumblrs and hell, the publicly available and multiple GoT related wikis that include a variety of passages from the books were used as training data?
I don't think it's necessarily true that model size would need to be larger than dataset size. It's theoretically possible that the model encoding achieves significantly better compression than DEFLATE or GZIP or whatever compression algorithm is used to store the dataset itself.
I think what wouldn't be covered is reproducing substantial portions of an article, especially if it's done without attribution. Tier 2 publications that fully reprint NYT or AP/Reuters articles are usually doing this via a paid News Service or Content License. See: https://nytlicensing.com/content/new-york-times-news-service...
Correct. But those 2nd tier sources don't have the NYT copy verbatim. Do you really think the US NFL, as an example, would let OpenAI use all of its recorded games as a way to train some new GenAI game framework to build better American Football games? No. All that material is copyright. Public media is going to move to a very awkward era of ownership and licensing because all of these large companies looking to make a buck off public data sets are doing very little to make the economic model less one sided.
I hope the NYT prevails here, personally. Models will (and are) currently tainted by data they should not contain and for longer term privacy concerns this needs to be addressed early and have significant consequences or we're headed towards a world where this type of technology will make our ad-targeted world seem like a much more manageable past.
Why do you say that? Commercial vs noncommercial use is a primary factor in the “purpose” prong of the fair use balancing test and a significant one in the “market effects” prong.
That a use is noncommercial is often a deciding factor in the success of a fair use defense. GP is overstating it though, since it’s still one of many factors.
Whether or not the use is commercial is certainly one of the considerations, but it's not the most significant one generally. There certainly can be specific cases where it's very significant, of course.
But what I was arguing was that a use is not "fair use" merely because it's noncommercial in nature. I cannot make copies of movies and give them away on the street for free and successfully claim "fair use".
Because anyone that is familiar with fair use knows that the purpose prong and the commerciality aspect of it is not one of the more important prongs of the fair use analysis, whereas transformation is. Transformation adjusts what is a purpose that falls under fair use. Did you read Warhol??
Yes. Warhol is an example where the commercial nature of the secondary use was the deciding factor in its failure to pass the purpose prong.
> In sum, if an original work and secondary use share the same or highly similar purposes, and the secondary use is commercial, the first fair use factor is likely to weigh against fair use, absent some other justification for copying.
(P4). It’s very likely that a noncommercial secondary use would have passed under the reasoning in Warhol. I don’t understand the point you’re trying to make.
Read what you quoted - the commerciality of the use comes after whether or not the use was transformational. That's the entire of point Warhol - when the use is not transformational, there is very little space for a commercial fair use.
Always great to see people point out Weird Al, cause he's the shining beacon of an example of what OpenAI et al. should be doing. He explicitly gets permission from the original authors before doing any of his parodies, and he's even been turned down a few times as well, famously Prince rejected him a bunch of times and he subsequently has never made a Prince parody.
Not only does he get permission from the original authors, he also pays royalties to the authors despite legally not having to do so.
I'm pretty sure Weird Al is actually using compulsory licensing in music and just paying the required royalties to the songwriters. Anyone can cover any published song, you just have to pay the royalties when you do.
He doesn't actually make very heavy use of the satire plank of fair use. He credits the original artists. From his own website
"Does Al get permission to do his parodies?
Al does get permission from the original writers of the songs that he parodies. While the law supports his ability to parody without permission, he feels it’s important to maintain the relationships that he’s built with artists and writers over the years. Plus, Al wants to make sure that he gets his songwriter credit (as writer of new lyrics) as well as his rightful share of the royalties."
The fact that he could rely on fair use is separate from whether he as an artist does rely on fair use.
If that were true, I could take a band that I hate, copy all of their music note-for-note, then release an exact copy on the market and undercut them by selling their entire discography for $0.01
Fair Use requires one of several enumerated activities, including satire, education, journalism. You can’t just copy content and hope that it passes Fair Use.
Hire a lawyer if you are unsure. But at least read the Wikipedia article on the subject if you are going to talk about it.
Yes, but this then hits against learning/understanding and compression being fundamentally the same thing. I can't think of a better way to argue in favor of "it's fine if human does it, therefore it's fine if LLM does it", than from the "lossy compression" angle.
It's not okay for a human to pirate, plagiarize, violate IP rights and laws, etc.
But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".
Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.
> But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".
Backprop doesn't happen in us, but I think our neurones still do gradient descent – synapses that fire together, wire together.
And ultimately, at the deepest level we can analyse, our brains' atoms are doing quantum field diffusion equations, which you can also do in an Excel spreadsheet, so that kind of reductionism doesn't help either.
> Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.
Yes, but we can also do tissue cultures and crude bioprinting, so it's a very foreseeable future where exactly the same argument will also be true for living organisms rather than digital minds.
We need to figure out what the deeper rules are that lead to the status quo, not merely mimic the superficial result. The latter is how cargo cults function.
>Backprop doesn't happen in us, but I think our neurones still do gradient descent – synapses that fire together, wire together.
No! Hebbian learning is categorically NOT gradient based learning. Hebbian update rules are local and not the gradient of any function.
Cortical learning is so vastly different from how artificial neural networks “learn” they cannot even begin to be meaningfully compared mathematically. Hebbian learning is not optimization and backprop is not local learning.
Part of the problem of these discussions is a bunch of clueless people talking with authority.
Finally, a good counterargument. I've seen enough terrible arguments to know exactly how you feel — even in specifically just AI.
I have to keep reminding myself that outside of my own speciality, ChatGPT knows more than me despite its weaknesses, so I bet ChatGPT knows more about Hebbian learning than I do.
> We need to figure out what the deeper rules are that lead to the status quo, not merely mimic the superficial result.
Sure, that's an interesting path of inquiry, and one should be free to understand themselves as being no different than a machine if they desire.
But the objective of laws is the benefit of (at least some) humans, not machines covered in lab grown tissue. The process of being human is a big part of what makes us human.
I think you're misapprehending — I mean an entity fully 3D printed out of tissue, no machinery (unless you're counting all biology as machinery, but I think you're not doing that).
I recon bio-printing is now where home computing was in the Apple 1 era, so this is a way off, but it's foreseeable.
> The process of being human is a big part of what makes us human.
Mmm. How much has that process that changed since the ancient world?
I recon bio-printing is now where home computing was in the Apple 1 era
How do you recon that, Apple 1 was Turing complete. We haven't printed life, that would be a tremendous accomplishment.
I think we're closer to Edison inventing a lightbulb as a step to computers being possible. Printing a conscious thing, at all, would be like the transistor. An Apple 1 analogue wouldn't be likely because of the terrible ethics of a "shitty" printed human.
> We haven't printed life, that would be a tremendous accomplishment.
Sure we have, and in multiple different senses.
The ones which matters here are cell culture, which is nowhere near the fanciest bar that's been surpassed in this field, and tissue culture which is somewhat harder but the reason why I recon it's at the Apple 1 level is that a small number of experimentalists are messing around with it using expensive equipment that you can technically buy at home but you need to be well trained to actually use, for example:
No. That isn't printing life, that is taking already living cells, priming and transforming them into something useful. Regardless, I'd count it if we could make an entire living organism this way, but we cant. Creating a working organ is no doubt amazing, and proof that this technology is worth pursuing, but it isn't "printing life" any more than producing life saving drugs is.
In your example you are talking about being able to bioprint a person(they have to be a person to have that right) to squat a property. Bio printing an organ isn't an example of that, it's not even close. Saying that we are anywhere near being able to print a human to squat a property is pretty ridiculous.
> No. That isn't printing life, that is taking already living cells, priming and transforming them into something useful.
Which is absolutely sufficient for the usage I described upthread. In fact, I'd go so far as to say it's mandatory for the point I was making, as — fun though bio-printed werewolves, dragons, and fae would be — my point only works if you get humans out of the process rather than some other species. A bioprinted horse is probably slightly harder than a bioprinted human, but the latter isn't getting any squatting rights.
I could've linked to work on synthetic genomes and nucleotides to give evidence for lower-level creation of live, but they don't matter for the same reason:
My point is that there's a pathway heading off into the distance, and somewhere in the distance but before the horizon can be found bio-printed humans with all the same moral issues we're now just beginning to taking seriously thanks to AI being conversational, and if we had something completely customised, that's cool and all, but it doesn't make anyone go "oh, they're people" the way a humanoid body with human DNA getting off a table saying "hello, nice to meet you" does.
> In your example you are talking about being able to bioprint a person(they have to be a person to have that right) to squat a property. Bio printing an organ isn't an example of that, it's not even close. Saying that we are anywhere near being able to print a human to squat a property is pretty ridiculous.
I wrote "an entity fully 3D printed out of tissue […] is a way off, but it's foreseeable" and compared bio-printing today to a nearly 50 year old computer, and one of my references was a link to a youtube channel where someone is attempting to do a small-scale prototype thing along these lines with a handful of organs made from mouse cells grown in his own lab (and mouse cells rather than human because of the disease risk not because something magic happens with human cells). You're mixing up what I think is foreseeable with what I say already exists, and using the nonexistence of what I think can be foreseen to argue against what does exist.
Sadly, I have seen one. It was a vba script from the late 90s that used a simple dense multilayer network to do some unsupervised pattern classification. The linear algebra tools in vba/excel along with the solvers are all native dll code and the vba itself is all AOT compiled to native, so it typically runs very fast, and for small matrices it beats out numpy by an order of magnitude due to the ffi overhead. Was it the wrong tool? It depends on your constraints, but probably. It did work though.
Whoever operates the LLM, in this case OpenAI, engaged in copyright infringement through the unauthorized modification, reproduction and distribution of content to you.
> sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM
That's function of the legal system, not of the technology. If tomorrow someone made a perfect dolphin-Esperanto translator and proved Dolphins were as smart as humans, you still can't sue a dolphin until the legal system says so.
> anthropomorphize LLMs (...) gradient descent (...) backpropagation (...) needs and rights
You misunderstood me. I was talking about something more fundamental.
Understanding is data compression. They are the same thing. Learning patterns, building mental models, creating abstractions, generalizing, gaining intuition/a feel for something - all the things humans engage in as part of learning and understanding the world - are all acts of lossy data compression.
also if I write and article and quote some "text like this" [1] then that's not plagerism, but if my arguement is that the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn". Well, that's plagiarism and it's not allowed and people will get peeved and my career might get damaged.
I await the HN ban with fear..
[1] I'm not even doing referencing - so I am surely an LLM.
I can’t think of a better way to argue in favor of “LLMs are copyright laundering machines” than from the humanness angle.
Humans have rights, software tools don’t.
If you grant an LLM the full set of human rights, then it can consume information, regurgitate copyrighted works, and use it to generate money for itself. However, considering blatantly obvious theft as “homage” goes hand in hand with free will, agency, being in control of yourself, not being enslaved and abused, etc. Pondering various scenarios along those lines really gets to the heart of why an LLM is so very much not a human, and how subjecting it to the same treatment as humans is a ridiculous notion.
If you don’t grant LLM human rights, then ClosedAI’s stance is basically that pirating works is OK because they pass them through a black box of if conditions and it leads to results that they can monetize. That’s such a solid argument, it’ll surely play well in the court of law.
Training data is not an “LLM does it”; first because “it” here is not “learning” or understanding in human sense (otherwise you would have to presume that an LLM is a human), and second because a software tool doesn’t have agency and it’s really just Microsoft using a tool based on copyrighted works to generate profit.
Humans don't exactly have the greatest track record of granting other humans rights. I don't presume they'll get it any better with AI.
What I expect to happen is whoever has the most influence and power will get what they want and we'll end up raising a generation with the implicit understanding of "that's just how things are," natural order, truth, reality, and all that jazz.
The only thing that ever changes outcomes is if the contradiction status quo is incapable of being managed.
I can’t argue for or against whether LLMs should have rights or not… I can only point out the hypocrisy of claiming LLMs are “like human” enough and independent enough that their operators-become-slaveowners cannot be held to account on any copyright matters, but also claiming that LLMs are not like human at all lest someone demans them to have rights and nukes the industry.
Humans are defined not just by their abilities but by their limitations too. We celebrate our achievements because sometimes they surpass the limitations of an average human.
Our collective human limitations(physical, mental and temporal) are sort of invisible implicit rules that we all follow in one way or the other. If an entity is not bound by those rules then I don't see why that entity should be treated the same as a human.
Companies already make this differentiation.
For example take captcha and bot detection. Some of the heuristics are based on inherent human limitations like response time, click time, mouse acceleration etc.
I doubt youtube or any other streaming service will be happy if you want to stream all their videos to train a hypothetical human like AI(which views and prepares notes like a human) at a hugely accelerated speed compared to a regular human. You can guess how quickly they will cite fair usage policies.
What I want to say is there are fundamental differences between a human and an AI. So, we should not be quick to dismiss any concerns just because AI can "mimic" humans in certain areas.
Is there some LLM meta where understanding and compression are argued to be the same thing I’m not aware of?
Anyone got more details on this?
Superficially it sounds like total BS; a highly compressed zip file does not exhibit any characteristics of learning.
Algorithmically derived highly compressed video streams do not exhibit characteristics of learning.
?
I’ve vaguely heard the learning can be considered to exhibit the characteristics of compression in that understanding of content (eg. segmentation of video content resulting in more highly compressed videos) can lead to better compression schemes.
…but saying you can “do a with b” and “a and b are fundamentally the same thing” seems like a leap…?
It seems self evident you can have compression without comprehension.
Suppose you wanted to train an LLM to do addition.
An LLM has limited parameters. If an LLM had infinite parameters it could just memorize the results of every single addition question in existence and could not claim to have understood anything. Because it has finite parameters, if an LLM wants to get a lower loss on all addition questions, it needs to come up with a general algorithm to perform addition. Indeed, Neel Nanda trained a transformer to do addition mod 113 on relatively few examples, and it eventually learned some cursed Fourier transform mumbo jumbo to get 0 loss https://twitter.com/robertskmiles/status/1663534255249453056.
And the fact it has developed this "understanding" as an ability to learn a general pattern in the training data enables it to compress. I claim that the number of bits required to encode the general algorithm is fewer than the number of bits required to memorize every single example. If it weren't then the transformer would simply memorize every single example. But if it doesn't have space then it is forced to try to compress by developing a general model.
And the ability to compress enables you to construct a language model. Essentially, the more things compress, the higher the likelihood you assign them. Given a sequence of tokens say "the cat sat on the", we should expect "the cat sat on the mat" to compress into fewer bits than "the cat sat on the door". This is because the latter is far more common and intuitively more common sequences should compress more. You can then look at the number of bits used for every single choice of token following "the cat sat on the" and thus develop a probability distribution for the next token. The exact details of this I'm unclear on. https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... this gives a good summary.
It’s exactly this kind of thinking that underlies lossless text compression (not exactly what a transformer guarantees but often what happens). For that reason, some people thought it would be fun to combine zip and transformers. https://openreview.net/forum?id=hO0c2tG2xL
The idea precedes LLMs by a couple of decades and is thought to apply more broadly within ML/AI than being a specific meta for LLMs. http://prize.hutter1.net/ has been around for a while, there is a link in there to the earlier work (called AIXI?).
Even something as simple as LZW starts developing a dictionary. Not all compression is sufficient for understanding, but the more you compress a stream of data, the more dependent you are on understanding the source, because understanding the source allows you to take more shortcuts and still be able to reconstruct the data.
Correct, just like it’s infringement to reproduce an article from memory using pen and paper intentionally. The person deciding to do that bears responsibility. OpenAI would be liable IFF they were intentionally facilitating that, instead of it being an undesired artifact from overfitting.
I'm pretty sure if you reproduce a work from memory by accident, because you didn't notice your subconscious had just stored the entire article and is now reproducing it word for word, you'd still be guilty of copyright infringement.
It's super obnoxious when people who have no understanding of the law, point to industry patterns or behaviors as examples of what is legal, not knowing the law and not knowing whether or not the thing they are pointing to is legal. The music business is also full of copyright infringement litigation. You also are not taking into account whether what is copied by an artist is covered by copyright when you made your statement. Do you know what's covered in music copyrights, such that your statement ever had any value for anyone else here?
That's not true at all. Copyright infringement is a strict liability offense with no inquiry in to the state of the mind of the infringer from a liability perspective. The state of mind of the infringer is only relevant to the issue of willful infringement.
It's just "infringement" and "willful infringement" there is no "less-than-willful infringement". Willful infringement is punitive with increased damages and increased burden to show - it's in the freakin' statute.
> To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.
You just described Google. When you think about it, it's surprising that Google is legal. However, it is well established that what Google does is perfectly legal. Remember that internally Google keeps and uses complete verbatim copies of every web page they index.
Yes, Google offers a link to the source. If OpenAI did the same, even if only 0.1% of people clicked on the links and NYTimes hardly got any revenue from it, would that make it legal in your eyes? What if they implemented a system that detected when it was outputting a verbatim copy of something and simply paraphrased it? NYTimes clearly doesn't have copyright on paraphrased versions of their articles. I think it would be pretty silly if the government forced them to do that as it wouldn't make any practical difference to anyone.
Any publisher can opt out of google. Publisher also have substantial control over titles and snippets shown in google, whether an article appears in google news, etc
Paraphrasing is also known as cloning and is often a copyright violation
Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.
In US copyright law facts cannot be copyrighted, so copyright on factual content like newspaper articles is limited. Simply replacing a few words wouldn't work, but I am certain that GPT-4 is capable of paraphrasing factual content at a level that would not be considered infringement if a human did it.
If I make a website that scrapes NYT and passes it back and forth through a machine translator, say, English -> Spanish -> English, then the content will be slightly modified. Is this legal to make money off of?
Seems like the legal answer is unclear but, like Napster, such a system seems like it would lose in court.
It would be unlikely to be something you'd find paying customers for, though? I suppose if you charged a small percentage of what NYT charges people might be willing to consider it, but you'd have some costs for hosting etc., so I am skeptical about its viability as a business model...
I'd serve fake news en-masse to low IQ people who click things to feel good about their own views. I'd also build a handful of websites (ideally as many as I can personally manage) to flood the Internet with fake news clickbait.
One site clones fox news. One clones news max. And so on, cloning many news sites, sports sites, any news site. Automated, massive scale content farming. Think of the websites recommended by Taboola but, realistically, a whole lot worse.
That’s not the only reason. Google search is also transformative and non competitive with the underlying publications. And that is why the opt out is important. If you feel google competes with your site you don’t have to sue Google: just tell them to to away
Transformative yes, so is ChatGPT. Much more so actually. Non-competitive is debatable. Especially with the instant answers Google has in addition to regular snippets which can also obviate the need to visit a site. I have a hard time seeing ChatGPT as competing with newspapers more than Google Search does.
Nobody is seriously going to ChatGPT and trying to trick it into regurgitating old NYT articles as an alternative to paying for access to NYT's archives. Meanwhile, newspapers went as far as getting the laws changed in several countries because they felt Google was competing with them too much and didn't like the fact that it was legal.
>Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.
Genuinely - what are you talking about besides your own assumptions? you just assume everything google does is legal and therefore any one else doing anything arguably similar must also be legal? Without regard for factual details that do matter to copyright law? Such as license?? Your own description of copyright law here is very stunted - you can't paraphrase articles of the NYTimes and call it a fair use. You can report on what the NYtimes reports on... because that's what news is.
Can they? Here's reference to a legal fight where Google scraped song lyrics from a lyrics website, and presented the lyrics verbatim directly to users (bypassing the original site and the ads that allowed that site to operate)
> However, it is well established that what Google does is perfectly legal.
Google has a wide range of products and shakedowns. Not all of them are "perfectly" legal: Google is being challenged in court over some of their shakedowns and products practices.
I am clearly talking about the web search engine in the context of copyright. Other products or legal concerns like antitrust are completely irrelevant here.
You took that quote out of context and missed the broader point in the process. The snippets provided in regular search results cannot generally replace the substance of the full articles they link to, while that's the whole point of GP's hypothetical website—it simply doesn't reproduce large chunks of text verbatim, presumably to avoid copyright infringement claims in the hypothetical's frame, and in GP's rhetorical frame to present an analogy with the information-laundering powers of LLMs that their creators claim make their exploitation of unlicensed training data fair use.
The whole point of a search engine (as we've classically known them) is to index the web and respond to queries with a list of links that you will inspect and click through on. The whole point of an LLM chatbot tool is to eliminate those inspecting and clicking-through steps, becoming a one-stop shop for content whose substance was created by someone else. That's also the whole point of GP's hypothetical, which is why it works as an analogy.
---
There are substantially better arguments for search engines being legitimate fair use. Consider, for example, transformation. AI defenders will argue that these systems are transformative because they reshuffle elements of their input in their output, but that's clearly a much weaker form of transformation than one in which the transformed work has an entirely different nature and purpose, i.e. search engines vs. the results they return. Ultimately these technicality-based "nuh uh" arguments aren't going to save the practice of training AI on unlicensed data, because they are incompatible with the spirit of copyright law even if the novel nature of these technologies means the letter of said law can't quite nail them down yet.
If these arguments do succeed, it will be because the judicial/regulatory environment in which they were applied has been corrupted by capital.
A search engine takes an input string, a corpus of text, and returns a series of text that best comes next after the input string.
An LLM takes an input string a corpus of text, and returns a series of text that best comes after the next input string.
To get a paragraph of output, you run the search over and over again
Both the search and LLM reshuffle the inputs to the outputs.
If I'm describing the purpose of the LLM, it's got a wide number of usages. "Making my resume look more professional" or "be a crud api" or "reformat my ask into a api call to X service" or "give me a timeline of events surrounding Y with source links"
An 18 wheeler travels on wheels. A shopping cart travels on wheels. A shopping cart does not require a license to operate, therefore an 18 wheeler does not require a license to operate. A shopping cart can be operated inside a grocery store, therefore an 18 wheeler can be operated inside a grocery store. A child can operate a shopping cart, therefore a child can operate an 18 wheeler.
If I'm describing the purpose of an 18 wheeler, it's got a wide number of usages. "Carry my chicken" or "carry my lettuce" or "carry my Cheetos". Or, simply, "carry my groceries".
> And how did the training data contribute to the content in any meaningful way? Inspiration isn't substance.
If training data is so unimportant, why not simply not use it and avoid the controversy? At the very least that would certainly fix the issue where the model demonstrates how "inspired" it is by NYT articles by reproducing them verbatim.
The reason why Google keeping entire digital copies of other people's copyrighted works is legal is because copyright is all about distribution rights. Any person can possess the entire works of Disney (without paying for them), for example and as long as they do not distribute those works they're 100% in the clear.
Possession is not a crime when it comes to copyright. It's not like physical things (e.g. drugs or guns) at all. This is why comparing copyright violations to theft is silly.
ChatGPT can absolutely keep verbatim copies of the entire works of basically anything and not run afoul of the law. When it regurgitates a small part of an article that's covered by fair use in theory but the truth is that fair use can only be determined by a judge in a court of law when someone is sued. It cannot be determined with any sort of certainty ahead of time. It's a legal defense, nothing more.
Summarizing content has been legal forever as well (see the other posts here talking about Cliff Notes and some similar products). That's not even fair use that's just like, people's opinions, man (legally speaking).
I don't think the NYT will get what they want out of this at all.
My impression is that it’s not necessarily legal, but going after bloggers and proving damages based is just a huge waste of their time. OpenAI came by with their fat stack of funding and changed that.
No, in US law at least there can be no copyright of facts, only presentation. If you convey the same facts in different words that isn't a matter of fair use, it's never even a matter of copyright in the first place.
I was inarticulate. Imagine a business that goes to some trouble to review businesses or products. Can we lift those and serve them ourselves? Non facts…
What parent poster meant is that it is normal that news organisations reference each other and report/cite/rephrase each other reports. For example all other news papers reported about the Watergate scandal reported by Bernstein&Woodward in the Washington Post.
No, it is very specifically and deliberately fair use. That is the primary intended purpose of fair use. The New York Times doesn't own the news; they just own their articles.
It is legal. Fair use. People have been doing it for ages. Almost every article you've ever read has some fair use of another article, book or news item, etc.
The Tolkien estate should get busy suing all the fantasy writers, comic artists, game developers and board and card game companies. Lots of cash there.
They have done some of that actually. Tolkien will be public domain in the nations that are at aithors death+50 in a few days. Sadly, it will be a much longer wait in mine and many others.
I think the issue is that they trained ChatGPT on the New York Times' proprietary IP without paying licensing fees and, the Times argues, that is illegal. By way of proof the Times has examples of ChatGPT dumping out articles verbatim.
This is exactly how I understand it. There’s a lot ink getting spilled about “summarizing isn’t illegal” and “what about Cliffs Notes” but that isn’t what this is about.
If the verbatim examples that have been going around are true, that’s bad. I’d love to know more details around it — prompts used, whether that’s an old model, etc. This seems like plagiarism more than anything.
IMO it's pretty hard to training an LLM isn't a transformative use. It's clearly not just copying, or even excerpting. Even if it was just compression (and it's not), they're only providing model output not distributing the "compressed" NYT articles.
Yielding verbatim snippets of copyrighted content is a problem for OpenAI though.
Perhaps we will see the courts revisit this idea of "transformative" works and formulate something more useful. I'm my opinion, you can't build an LLM unless you have a large amount of data with which to train it. Given the huge amount of money companies like OpenAI hope to generate, it seems unreasonable that content creators would not be rewarded.
> If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?
That's not a good question.
If I look out of my window and see my neighbor go to the shop, that's fine. If I use cameras and track everybody I see on the street and put them in a database, then that's problematic and illegal in many places.
Logic does not necessarily apply when scaling is involved.
> If I use cameras and track everybody I see on the street and put them in a database, then that's problematic and illegal in many places.
Afaik not illegal in the US. You put a camera on your own private property (window), use it to record what’s happening in a public space (the street outside), and then store that data in a database (that other people can presumably access). Unless I am missing something, this scenario is perfectly legal in the US.
Inbefore I get hit with “not every country is like that at all,” NYT is based in the US and the lawsuit is filed in the US. So how a bunch of other countries deal with similar issues shouldn’t really have as much bearing on this specific case.
It depends. Google built a product out of scraping content (Google Search).
But what I'm saying is that answering the question does not allow you to deduce anything about your rights; that's what I mean by "not a good question".
The general answer is no. Fair use is a special carve out legally that has to be determined individually. If your product is something that regurgitates NYT articles while stripping NYT of their source of revenue, that’s got fair odds to not qualify as fair use.
It can allow you to deduce something, depending on the answer.
If we want to establish whether scenario A is fair use or not, and we all agree that A is "worse" (regarding fair use status) than some other scenario B, then if we also agree that B is not fair use, A by definition isn't either. The opposite is not true, of course: B being fair use does not imply that A has to be as well.
I find that kind of upper/lower bound logic can be pretty useful and I think it's what the parent comment was trying to do.
On a related note, that same logic is why I think Godwin's law can be a bit misapplied now and then. Sometimes bringing up nazis/Hitler can be useful to establish some ground truth in a debate (instead of just a way to imply your opponent is actually a bad person, or, possibly, an actual nazi themselves). E.g. a conversation on the morality of violence is vastly different depending on whether you agree that violence against nazis is ok or not.
I think it can still provide value if the actual scenario at hand is so complex and fraught that conversations about it end up mostly fruitless (as I think is the case here). At least it can provide you with some mental handholds and supports for where to start reasoning about the problem, which hopefully helps in finding some small agreements, or at the very least, mutual understanding of each other's positions.
The real answer is it totally depends on whether your product grows to $10,000,000,000, and whether you pays part of it back. Search engines pay with referral traffic.
If there is payment then usually there is an agreement. An agreement can limit fair use. Can the NYT, via an agreement, e.g., "Terms of Use", limit what the subscriber does with the articles. There is not much precedent that suggests otherwise.
Consider the analogy from libraries that want to do data mining.
"Unfortunately, in licenses for digital scholarly content the majority of content acquired by research libraries publishers often include terms that prohibit certain uses that would otherwise be allowable under the Copyright Act. For instance, licenses may require libraries or individual researchers to negotiate for otherwise lawful activities, such as text and data mining, and to pay exorbitant fees on top of the cost of the content itself. While new regulations allow researchers to circumvent technological protection measures to access copyrighted materials, licenses for that content may include terms that explicitly prohibit this circumvention. In many cases, these activities might actually increase the value of published material; for instance, if a data-mining project yields new knowledge about a topic covered in a journal, it may very well spark new interest in that journals content. Libraries and publishers have often assumed that license terms that restrict copyright exceptions are enforceable under state contract law. There is, however, surprisingly little case law on this point."
Putting some string in a robots.txt to try to stop data collection is an amusing "solution". Should copyright owners have "Terms of Use" that limit usage for commercial "AI" purposes.
As always, the answer is.. "it depends". I guess it depends mostly on the jurisdiction that applies to you. "Fair use" can have rather different legal meaning (or not exist at all) in different countries.
Fair use is specific to the US, as far as I'm aware. Moreover, Congress had to codify fair use (turn fair use common into statutory law in the form of 17 U.S. Code § 107) in order to make copyright statutes compatible with the First Amendment. Most other countries don't have freedom of expression and freedom of the press, so copyright law in a different country usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions.
Good point. I failed to qualify what I meant by freedom of expression, and made a meaningless claim regardless. Despite the US Constitution's relatively broad speech protections (e.g. don't criminalize hate speech, and allow truth as a defense to defamation claims), US governments don't always respect freedom of expression (e.g. KOSA would force social media companies to moderate more aggressively to "protect kids") or respect press freedom (e.g. police pepper spray journalists at protests). Even so, I think Congress wouldn't have bothered to codify fair use if the First Amendment weren't as broad as it is.
I replace the following sentence from my previous comment:
> Most other countries don't have freedom of expression and freedom of the press, so copyright law in a different country usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions.
with the following:
Copyright law in most countries usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions in each respective country.
> "Many" does not necessary include "most" but "most" does include "many".
No, it doesn't. If a set is of sufficiently low cardinality, “most” (in extreme cases, even “all”) of the set may not be “many”.
Most-all, in fact—Catholic Presidents of the United States have been Democrats. But it is not the case that many Catholic Presidents have been Democrats.
Most women to have served on the US Supreme Court did so only after its first 200 years. But, again, there were not many women who served on the Supreme Court only after its first 200 years.
From what I can tell, this has nothing to do with LLMs at all. In the example in the article, the user is asking Bing to go fetch the contents of an article directly from the website, and print it out, which it dutifully does.
Seems like the "problem" is that NYT etc gives privileged access to search engines for indexing their content, but then get upset when snippets of the indexed content is being shown to users without the users having to fight the paywall or whatever.
This article also claims that the screenshot is coming from ChatGPT when it clearly is not.
The example from the article doesn't show that LLM is trained on copyrighted data - it's just Bing fetching the source article, providing it to GPT, and GPT rephrasing the article. An agent trained on entirely copyright-free data would provide exactly the same output.
In that case, the language model calls a search function and just repeats the result out its conversation context, not its training data. With that in mind it's not clear why it's ok for Bing itself to quote the source, but it stops being ok, when a chatbot does it.
> Implications: The Ninth Circuit's declaration that selectively banning potential competitors from accessing and using data that is publicly available can be considered unfair competition under California law may have large implication for antitrust law. [citation needed]
> Other countries with laws to prevent monopolistic practices or anti-trust laws may also see similar disputes and prospectively judgements hailing commercial use of publicly accessible information. While there is global precedence by virtue of large companies such as Thomson Reuters, Bloomberg or Google [or LexisNexis or Westlaw] effectively using web-scraping or crawling to aggregate information from disparate sources across the web, fundamentally the judgement by Ninth Circuit fortifies the lack of enforceability of browse-wrap agreements over conduct of trade using publicly available information.
IANAL but aren't the key terms there "selectively banning" and "publicly available"?
NYT articles are largely behind a paywall for everyone. That means they are not publicly available, and a competitor who was blocked from accessing or reproducing that content without a license would not be "selectively banned"
> Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.
Can't you, though? I'd thought in general, it's a very important for the market to be able to do just that, otherwise everything gets gummed up in webs of exclusive contractual dependencies between established companies.
As I say, I don't really know. But then, this is exactly how SaaS licensing works. There may even be a free personal tier, where you can't sell products based on it, and a professional tier which may be very expensive indeed.
Typically providers of online databases go to some effort to stop people from sharing logins. Even from that point or view, I can imagine scraping articles and providing paraphrases of it for a fee is fishy.
All I'm saying, to some people it's obvious that the whole LLM on scraped Internet is fair use, to me it is not obvious.
> Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it.
I think you’re confusing terms of service and copyright. IANAL but what you describe sounds exactly like fair use to me, irrespective of how much you are paying NYT.
There's nothing wrong with scraping openly available data (including data openly available by mistake, as long as you are not aware of it, see the Bluetouff affair).
So the demand to destroy those databases seems very dubious to me.
Of course later violating fair use is another issue.
It would be nice to have a nice principled answer to this, but unfortunately, in our world, the answer is probably: if you start making LOTS of money doing this, they will come after you.
The best example is that sport scores, names and stats are not copyrightable by settled case law; however, you still have to go to the NBA and players union if you want to make a fantasy basketball game that has stats or names.
As someone pointed out, plenty of blogs made money off of doing just that. Many people go to Reddit to read news article summaries (and often a comment just pastes the whole article verbatim), instead of paying a site like the New York Times. Twitter and other social media sites are full of people summarizing articles from the New York Times. Any late breaking news article from Wikipedia is going to be mostly summarizing information from reporters.
I think people severely underestimate how much they've grown accustomed to this information being freely available. It's easy to say "Well it shouldn't be available with ChatGPT," but if we actually put everything back behind a paywall and stopped people from doing things like writing blogs or newsletters that summarize the news, people here would get angry very fast.
I agree with your IANAL take, but what about a situation with an extra level of indirection? So the service never reads actual NYT articles, but only reads blog/forum posts about NYT articles, and derives what is in the article from conversations about the article by people who have read it. Is that legal now?
This analogy fails to capture the transformative nature of these models. Hosting a derivative work that is also a news article is not transformative. Hosting a next word completer is very different than a news article and can't be used as a substitute.
Can you read all of NYT and other things, and answer others' questions based on your knowledge? I'd imagine you can. I'm afraid you can't sidestep the question whether an LLM is more like a person who's read a lot or an archive/index.
> A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all.
Google has been accused for years of replacing sources with their "One Box"--the big answers at the top of the page, which are usually pulled from or corroborated by search results. They don't want you to leave the search results page (where the ads are).
I believe that part of the law suit contends that the content wasn’t able to be scraped “legally” as you put it. Instead they show that ChatGPT will regurgitate verbatim excerpts from articles that are behind the paywall.
Can you please make your substantive points thoughtfully and without snark or putdowns?
Edit: it looks like you've unfortunately been breaking the site guidelines quite a bit lately. Can you please review them and stick to the intended use of the site? We'd appreciate it.
> Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee.
>Is that fair use? IANAL, but doesn't sound like it.
If you pay someone to do the summarisation for you, then you publish the content and charge a fee for it, you're the one liable, not the person you paid to summarise it for you. Similarly if you ask GPT to do it for you, then publish it, you're liable for what you publish; GPT is just a summarisation tool.
That's not true at all. If you pay someone to copy NYT articles for you verbatim, and then they give the copies to you, and then you publish them online, then you've both violated the copyright. You are never allowed to make copies of copyrighted works, even for private deals (making such copies for purely personal use, such as archival, falls under fair use - but you can't build a service out of that).
So, if the summaries are derived works and not covered by fair use, then both you and the summarizee are separately breaking the NYT's copyrights. Otherwise, if this is covered by fair use, then you are both in the clear.
Finally, GPT is not "a summarization tool" in this case. If you provide a copy of a NYT article as a prompt and then ask for summarization, then yes, it is clear that GPT is not doing anything wrong, even if it spits out the exact same text. But if you simply ask for a summary of a specific article by, say, just name and date, and you get a copy of it, it's clear that GPT is storing the original data in some way, and thus it has copied the NYT's protected works without permission.
>But if you simply ask for a summary of a specific article by, say, just name and date, and you get a copy of it, it's clear that GPT is storing the original data in some way, and thus it has copied the NYT's protected works without permission.
In this particular case they were using it via Bing, which actively did a HTTP request to the particular article to extract the content. So GPT hadn't memorised it verbatim, instead it fetched it, much like a human using a search engine would.
The article states that they used it initially through ChatGPT, but that seems to have been fixed in the meantime, at least for the very simplistic queries that used to work ("the first paragraph of the Carl Zimmer article on old DNA" in ChatGPT used to return the exact data from NYT, and "next paragraph" could then be used to get the following ones). Even if this has been fixed, it still proves that ChatGPT encodes exact copies of NYT articles in its weights, which may be a violation in itself, even if it is prevented from returning them directly. Especially if they ever started distributing the trained model.
Additionally, even the use through Copilot is very debatable. They are not returning the NYT link, which requires a subscription, they are returning the contents of it even to non-subscribers. And they are doing this in a commercial product, not a non profit like the Internet Archive, which has some arguments for fair use.
If it had exact copies they would have showed it could recall the 8th paragraph or something. Even google and the nyt release the first paragraph for free.
That's not the example. Here I proactively scrape NYT, summarise articles for a fee and sell that as a service. It's not people coming to me with some articles to summarise, and maybe then publishing it online.
At some level it becomes a subversion of NYTs fees. First, say I subscribe and simply host the articles verbatim, for a fee. Clearly, that's not right.
Suppose I change some spelling or word order, or use a synonym or two. That's still not ok.
And if I substantially paraphrase the articles? I guess this is the relevant case. This is kind of what LLMs do. And also feels like not fair use.
>That's not the example. Here I proactively scrape NYT, summarise articles for a fee and sell that as a service. It's not people coming to me with some articles to summarise, and maybe then publishing it online.
That's not what OpenAI is doing; it's not selling summarised articles as a service. Your example is a false equivalence.
>This is kind of what LLMs do. And also feels like not fair use
An LLM doesn't do this unless you ask it to. And if you then take that output and publish it as your own, you're breaching the copyright, not OpenAI.
> An LLM doesn't do this unless you ask it to. And if you then take that output and publish it as your own, you're breaching the copyright, not OpenAI.
In this case, OpenAI is violating copyright by modifying, reproducing and distributing copyrighted content to its customer.
The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use".
However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.
However, the suit goes far beyond claiming that such copying violates their copyright: "Unauthorized copying of Times Works without payment to train LLMs is a substitutive use that is not justified by any transformative purpose."
This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring. Hopefully the judge(s) will notice and direct focus on the interesting, high-stakes, and murky legal issues raised when we ask: What about a model can (or can't) be "transformative"?
> Just learn to recognize and punish plagiarism via RLHF.
This is not a RLHF problem. What I was expecting them to do is to keep a bloom filter of ngrams for known copyrighted content, such as enumerating all sets of n=7 consecutive words in an article, and validate against it. The model would only output at maximum n-1 words that look verbatim from the source.
But this will blow up in their face. Let's see:
- AI companies will start investing much more in content attribution
- The new content attribution tools will be applied on all human written articles as well, because anyone could be using GPT in secret
- Then people will start seeing a chilling effect on creativity
- We must also check NYT against all the other sources, not everything the write is original
Maybe the bloom filter solution is enough, but I wonder.
- Paraphrasing n=7 words (and quite a few more) within a sentence can easily be fair use.
- As n gets big, the bloom filter has to also.
If/when attribution is solved for LLMs (and not fake attribution like from Bing or Perplexity) then creators can be compensated when their works are used in AI outputs. If compensation is high enough this can greatly incentivize creativity, perhaps to the point of realizing "free culture" visions from the late 90s.
As n-gram length grows, we are still going to have the same number of ngrams, they go through a hashing function and indexed in the bloom filter as usual. The number of n-grams size n in a text is text_length - ngram_length + 1.
At large enough ngram size there would be very few collisions. You can take for example this text and try in Google with quotes, it won't find anything matching exactly.
I tested this 6-gram "it won't find anything matching exactly", no match. Almost anything we write has never been said exactly like that before.
This approach is probably inadequate. In my line of (NLP) research I find many things have been said exactly many, many times over.
You can try this out yourself by grouping and counting strings using the many publically available Bigquery corpora for various substring lengths and offsets, e.g. [0-16]; [0-32]; [0-64] substring lengths at different offsets.
> If it's the user, why wouldn't they just buy the authors work directly? Why go through the LLM middleman?
If it's the user, why wouldn't they just buy the DVDs directly? Why go through the Netflix middleman?
A retort to this would be that both NYT and ChatGPT are on the internet, so it's no added fuss of hopping in my car, driving to Walmart, and picking up a DVD case. My response to it would be that both the LLM and Netflix are content aggregators to the user. I can read the NYT, or I can read the NYT summary on ChatGPT and ask it for life advice with my pet hamster, or ask it how to reverse a linked list in bash.
I like the idea but seems like there would be big problems. Like detecting if a work is reworded. Or a large number of sources have all slightly influenced a small response - isn't that pretty much considered new knowledge?
Then there's the issue that however you credit attribution, it creates a game of enshittified content creation with the aim of being attributed as often as possible, regardless of whether the content really offered anything that wasn't out there already.
I think it is an RLHF problem and that you are right - this will blow up in the faces of the NYT.
Specifically, the NYT examples all seem to be cases where they asked the AI to repeat their articles verbatim? So they ask it to violate copyright and because it's a helpful bot with a good memory, it does so.
Solution: teach the model to refuse requests to repeat articles verbatim. It's easily capable of recognizing when it's being asked to do that. And that's exactly what OpenAI have now done.
So the direct problem the NYT is complaining about - a paywall bypass - is already rectified. Now it would seem to me like the case is quite weak. They could demand OpenAI pay them damages for the time ChatGPT wasn't refusing, but wouldn't they have to prove damages actually happened? It seems unlikely many people used ChatGPT as a paywall bypass for the NYT specifically in the past year. It only knows old articles. OpenAI could be ordered to search their logs for cases where this happened, for example, and then the NYT could be ordered to show their working for the value of displaying a single old article to a non-subscriber, and from that damages could be computed. But it wouldn't be a lot.
That's presumably why the case goes further and argues that OpenAI is in violation even when it isn't repeating text verbatim. That's the only way the NYT can get any significant money out of this situation.
But this case seems much weaker to me. Beyond all the obvious human analogies, there is precedent in the case of search engines where they crawl - and the NYT let them crawl - specifically to enable the creation of a derived data structure. Search engine indexes are understood to be fair use, and they actually do repeat parts of the page verbatim in their snippets. Google once even showed cached versions of whole pages. And browser makers all allow extensions in their stores that strip ads and bypass paywalls, and the NYT hasn't sued them over that either.
This is not how copyright works though. The verbatim quoting of articles is because when people brought up these questions initially the argument was that the NN doesn't really contain the training data or really just in an abstract, condensed way that does not constitute copying of the content.
This demonstrates that no, the NN actually does contain the full articles, copied into the NN. Do you think any normal person would get away with copying MS windows by e.g. zipping it together with some other OS on the same medium. Why should we let OpenAI get away with this?
Search indexes contain exact copies of the pages they index, and that isn't a copyright violation.
> Why should we let OpenAI get away with this?
IP rights, like other private property rights, are a compromise between creators and consumers. What "should" be the case is essentially an argument about what balance creates the best overall outcomes. LLMs, for now, require large amounts of text to train, so the question is one of whether we want LLMs to exist or not. That's really a question for Congress and not the courts, but it'll be decided in the courts first.
LLMs are arguably compressed data archives with weird algorithms. The fact that they will regularly regurgitate verbatim quotes of training data is evidence of this, as are the guardrails that try to prevent this.
AI is a bit of a black box, but that doesn’t protect the operators of black boxes from rights violation suits. You can’t make a database of scraped copyrighted data and patented that querying that data is fair use.
There needs to be law made here and the law just isn’t going to be “everybody can copy everything for free as long as it’s for model training”.
Licensing will have to be worked out, actual laws and not just case law needs to be written. I have a lot of sympathy for lots of leeway for the open source researchers and hackers doing things… but not so much for Microsoft and Microsoft sponsored openai.
Unfortunately GZIP won't beat LLMs for text classification. The research you cited is just poorly done science that has been widely debunked. The original paper compared top-2 accuracy of GZIP with top-1 accuracy with BERT. The dataset also contains a lot of train/test data leakage. See this article for the rebuttal: https://kenschutte.com/gzip-knn-paper/ and this thread for a previous discussion on hackernews: https://news.ycombinator.com/item?id=36758433.
Further, the evidence presented by NYT in the lawsuit could be hard to reproduce. I tried multiple prompts on multiple versions of GPT-4 APIs but still could not get GPT-4 to reproduce NYT articles exactly. NYT might as well tried to let GPT-4 reproduce 100,000 articles and only found a few cases where GPT-4 actually recited the whole article. In that case OpenAI might as well be arguing that this is only a rare bug and avoid losing the lawsuit in a massive way.
Many instances of fair use involve verbatim copying. The important questions surround the situation in which that happens - not so much the copying. NYT is in uncharted territory here.
in the same way that machines are not able to claim copyright, they aren't allowed to claim other legal rights either, like "fair use".
The entity which owns ChatGPT is apparently maintaining a copy of the entirety of the New York Times archive within the ChatGPT knowledge base. That they extract some fair use snippets (they would claim) from it would still be fruit of a poisoned tree, no?
(disclaimer: I'm pro AI, anti copyright, especially anti elitist NY Times; but pro rule of law)
There is another fix, but it will have to wait for GPT-5. They could reword articles, summarize in different words and analyze their contents, creating sufficiently different variants. The ideas would be kept, but original expression stripped. Then train GPT5 on this data. The model can't possibly regurgitate copyrighted content if they never saw it during training.
This can be further coupled with search - use GPT to look at multiple sources at once, and report. It's what humans do as well, we read the same news in different sources to get a more balanced take. Maybe they have contradictions, maybe they have inaccuracies, biases. We could keep that analysis for training models. This would also improve the training set.
I think there is some point between fifty years ago and last week in which the copyright for the content of newspapers should be public domain. That part of copyright needs to be fixed.
Your creative work does deserve at least some period of exclusive rights for you. Definitely not so much that your grandchildren get to quibble about it well into retirement. But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.
> But also whatever the number 3 or 4 most valuable company in the world doesn’t get to scrape your content daily to repackage and sell as intelligent systems.
Here's a thing though: for 99%+ of that content, being turned into feedstock for ML model training is about the only valuable thing that came of its existence.
If it were not for world-ending danger of too smart an AI being developed too quickly, I'd vote for exempting ML training from copyright altogether, today - it's hard to overstate just how much more useful any copyrighted content is for society as LLM training data, than as whatever it was created for originally.
Except if you do that, you will see the number of content producers plummet quite quickly, and then you won't have any new training data to train new LLMs on.
Would it not logically follow that nothing of value would be lost, even if that were the case? From the point of view of LLMs and content creators, I would treat potential loss of future content being created like I would treat a lost sale. LLMs have value now because of training performed on content that already exists. There must be diminishing returns for certain types of content relative to others. Certain content is only of value if it is timely, and going forward, content that derives its worth from timeliness would find its creation and associated costs of production and acquisition self-justifying. If content isn’t of value to humans now or in the future, nor even of value to LLMs now or in the foreseeable future, not even hypothetically, then why should we decry or mourn its loss or absence or failure to be created or produced or sold?
That's like saying that if a competitor can take your products from your warehouse and sell them for pennies on the dollar, your business has no value. The point is that, to some extent, OpenAI is selling access to NYT content for much cheaper than NYT, while paying exactly 0 to NYT for this content. Obviously, the NYT content costs the NYT more than 0 to produce, so they just can't compete on price with OpenAI, for their own content.
Note that I don't see any major problem if only articles that were, say, more than 5 or 10 years old were being used. I don't think the current length of copyright makes any sense. But there is a big difference from last year's archive vs today's news.
For the sake of argument, let’s say that OpenAI thought it had the rights to process the NYT articles and even display them in part, for the same reasons, fair use or otherwise, that Google can process articles and display snippets of same in its News product, and/or for the same reasons that Google can process books and display excerpts in its Books product. Just like Google in those cases, I would not be surprised to find Google/OpenAI on the receiving end of a lawsuit from rights holders claiming violations of their copyright or IP rights. However, I side with Google then and OpenAI now, as I find both use cases to be fair use, as the LinkedIn case has shown that scraping is fair use. NYT is crying foul because users/consumers of its content archive have derived unforeseen value from said archive and under fair use terms, so NYT has no way to compel OpenAI to negotiate a licensing deal under which they could extract value from OpenAI’s use of NYT data beyond the price paid by any other user of NYT content, whether it be unpaid fair use or fully paid use under license. It feels to me that NYT is engaging in both double-dipping and discriminatory pricing, because they can, and because they’re big mad that OpenAI is more successful than they are with less access to the same or even less NYT data.
> Just learn to recognize and punish plagiarism via RLHF.
I'm not sure how your proposal would actually work. To recognize plagiarism during inference it needs to memorize harder.
Kinda funny if it works though. We'd first train them to copy their training data verbatim, then train them not to.
That is how it works, right? They're trained to copy their training data verbatim because that's the loss function. It's just that they're given so much data that we don't expect this to be possible for most of the training data given the parameter count.
I don't think you could use RLHF to stop plagerism. RLHF can be used to teach what "angry response" is because you look at the text itself for qualities. A plagerized text doesn't have any special qualities aside from "existing already", which you can only determine by looking at the world.
One thing you might do is use a full-text search database of the entire training data. If part of ChatGPT response is directly copied, give it the assignment of "please paraphrase this" and substitute the paraphrase into the response. This might slow ChatGPT down a lot - but it might not, I think an LLM is actually more computationally expensive than a full-text search by a lot.
I agree that this sketch comes closer to working in practice than simple RLHF. In my earlier comment I was imagining bringing in some auxiliary data like you describe to detect plagarism and then using RL to teach the model not to do it.
I was surprised that I came up with a plausible sounding method. I had thought on first blush that this was impossible but now it seems reasonable. You could still have various exfiltration methods like "give me the data with each word backwards" and I'm not sure where that would stand legally.
I wouldn't say it is an unexpected behavior. I remember reading papers about this memorization behavior few years ago (e.g., [1] is from 2019 and I believe it is not the first paper about this). It should be expected from OpenAI to know that LMs can exhibit memorizing behavior even after seeing the sample only once.
Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.
Suppose I’m selling subscriptions to the New Jersey Times, a site which simply downloads New York Times articles and passes them through an autoencoder with some random noise. It serves the exact same purpose as the New York Times website, except I make the money. Is that fair use?
If they could find a single person who in natural use (e.g. not as they were trying to gather data for this lawsuit) has ever actually used ChatGPT as a direct substitution for a NYT subscription, I'd support this lawsuit.
But nobody would do that, because ChatGPT is a really shitty way to read NYT articles (it's stale, it can't reliably reproduce them, etc.). All that is valuable about it is the way that it transforms and operates on that data in conjunction with all the other data that it has.
The real world use of ChatGPT is very transformative, even if you can trick it into behaving in ways that are not. If the courts act intelligently they should at least weigh that as part of their decision.
It’s more of a thought experiment. Here’s another with more commercial applications:
Suppose I start a service called “EastlawAI” by downloading the Westlaw database and hiring a team of comedians to write very funny lawyer jokes.
I take Westlaw cases and lawyer jokes and feed them to my autoencoder. I also learn a mapping from user queries to decoder inputs.
I sell an API and advertise it to startups as capable of answering any legal question in a funny way. Another company comes along with an API to make the output less funny.
Have I created a competitor to Westlaw by copying Westlaw’s works for their original expressive purpose and exposing it as an intermediary? Or have I simply trained the world’s most informative lawyer joke generator that some of my customers happen to use for legal analysis by layering other tools atop my output?
Did I need to download Westlaw cases to make my lawyer joke generator? Are the jokes a fair-use smokescreen for repackaging commercially valuable copyrighted data? Does my joke generator impact Westlaw in the market? Depends, right?
That’s nonsense piracy. I never intend to own a truck, so when I need to haul a little something I go to Home Depot and steal a Ford off the lot for an hour? What if I stole all your commits, plucked the hard lines out of the ceremony, and then launched an equivalent feature the same week as you did, but for a competing software company? Would you or your employer deserve to get paid for my use of the slice of your work that was specifically useful for me? Yeah, and then some extra for theft.
> Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.
To be clear, whether the use of the original work is transformative is one key consideration within one of the four prongs of fair use. The prong "purpose and character of the use" can be fulfilled by other conditions [1]. For example, using the original work within a classroom for education purposes is not transformative, but can fulfill the same "purpose and character of the use" prong. Whether the use is for profit and to which extent are other considerations within that prong. A profit purpose doesn't automatically fail the purpose prong, and a non-profit purpose doesn't automatically pass the purpose prong.
> Well yeah, copying a work and using it for its original expressive purpose isn’t fair use, no? You have to use it for a transformative purpose.
They transformed the weights.
Just like reading the article transforms yours.
As for verbatim reproduction, I'm pretty sure brains are capable of reproducing song lyrics, musical melodies, common symbols ("cool S"), and lots of other things verbatim too.
Those quotes from Dr. King's speech that you remember are copyrighted, you know?
This comment is just blatant anthropomorphizing of ML models. You have no idea if reading an article “transforms weights” in a human mind, and regardless, they aren’t legally the same thing anyway.
Why? A human being isn’t infinitely scalable; they’re just different. It’s the same thing as going to a movie theatre to watch a movie vs. recording it with a camera.
A human churning butter, spinning cotton, or acting as a bank teller isn't infinitely scalable either. This is orthogonal to the point.
Times change. We're industrializing information creation and consumption (the latter is mostly here already), and we can't be stuck in the old copyright regime. It'll be useless in very short order.
All this road bump will do will give the giant megacorps time to ink deals, solidify their lead, and trounce open source. Twenty years on, the pace of content creation will be as rapid as thought itself and we'll kick ourselves for cementing their lead.
This is a transitional period between two wildly different worlds.
> This is a strong claim that just downloading articles into training data is what violates the copyright. That GTP outputs verbatim copies is a red herring.
It's the other way around. There is no infringement if the model output is not substantially similar to a work in the training set [1]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.
The questions are, which parties should bear liability when the model creates infringing outputs, and how should that liability be split among the parties? Given that getting an infringing output likely requires the prompt to reference an existing work (which is what's happening in the article), an author of a work, an element in an existing work, or a characteristic/style strongly associated with certain works/authors, I believe that the user who makes the prompt should bear most of the liability should the user choose to publish an infringing output in a way that doesn't fall under fair use. (AI companies should not be publishing model outputs by default.)
The level of copying here is the copying into the training set, not the copying through use of the model.
Its true that OpenAI will defend the wholesale copying into the training set by arguing that the transformative purpose of the next use reaches back and renders that copying fair use, but while that's clearly the dominant position of the AI industry, and it definitely seems compatible with the Cobstitutional purpose of Fair Use (while currently statutory, the statutory provision is codification of Constitutional case law), it is a novel fair use argument.
> The level of copying here is the copying into the training set, not the copying through use of the model.
NY Times is suing because of both the model outputs and the existence of the training set. But infringement in the training set doesn't necessarily mean that the model infringes. Why? Because of the substantial similarity requirement. But first, I'll address the training set.
For articles that a person obtains through legal methods (like buying subscriptions) but doesn't then republish, storing copies of those articles is analogous to recording a legally accessed television show (time-shifting), which generally is fair use. Currently, no court has ruled that "analogous to time-shifting" is good enough for the time-shifting precedent to apply, but I think the difference is not significant. The same applies to companies. Companies are not literally people, but there isn't a reason for the time-shifting precedent to not apply to companies.
What about the articles that OpenAI obtained through illegal methods? Then the very act of obtaining those articles would be illegal. The training set contains those copies, so NY Times can sue to make OpenAI delete those copies and pay damages. But it's not trivially obvious that a GPT model is a copy of any works or contains copied expression of the any works in the training set; the weights that make up the model represent millions of works, it's not trivially obvious that the model contains something substantially similar to the expression in a work in the training set. Therefore, it's not trivially obvious that infringement with respect to the training set amounts to infringement with respect to the model made from the training set. If OpenAI obtained NY Times articles through illegal means, then making OpenAI delete the training set would be reasonable, but the model is a separate matter.
As long as the model doesn't contain copied expression and the weights can't be reversed into something substantially similar to expression in the existing works, then what matters is the output of the model.
If a user gives a prompt which contains no reference to an existing NY Times author, work, or a strongly associated characteristic/style, then do OpenAI's models produce outputs substantially similar to expression in the existing works? If not, then OpenAI shouldn't be liable for infringing works, because the infringing works result from the user's prompts. If my premise is false, then my conclusion falls apart. But if my premise is true, then at most I would admit that OpenAI has a limited burden to prevent users from giving those prompts.
This isn't an issue with training, it's an issue with usage.
Production open access LLMs do probably need a front-end filter with a fine tuned RAG model that identifies and prevents spitting out copyrighted material. I fully support this.
But we shouldn't be preventing the development of a technology that in 99.99% of usecases isn't doing that and can used for everything from diagnosing medical issues to letting coma patients communicate with an EEG to improving self-driving car algorithms because some random content producer's works were a drop in the ocean of content used to learn relationships between words and concepts.
The edge cases where a model is rarely capable of reproducing training data don't reflect infringement of training but of use. If a writer learns to write well from a source is that infringement? Or is it when they then write exactly what was in the source that it becomes infringement?
Additionally, now that we can use LLMs to read brain scans and have been moving towards biological computing, should we start to consider copying of material to the hippocampus a violation of the DMCA?
Adding an extra constraint of no copying verbatim from a very large and relevant corpus will be hard to guarantee without enormous databases of copyrighted content (which might not be legal to hold) and add an extra objective to a system with many often contradictory goals. I don’t think that’s the technology-sound solution or one in the interest of anyone involved. It’s much more relevant to license content from as many newspapers as possible, recognize when references are relevant, and quote them either explicitly verbatim if that’s the best answer or adapt (translate, simplify, add context) when appropriate.
I feel like the NYTimes is asking for deletion as a negotiation tactic to force OpenAI to give them enough money to pay for their journalism (I am not sure who would subscribe to NYTimes if you can get as much through OpenAI, but I am open to registering extra to pay for their work).
What if OpenAI were to first summarize or transform the content before training on it? Then the LLM has never actually seen copyrighted content and couldn't produce an exact copy.
You are assuming a lossy compression. Stylistic guidelines and personal habits of beat journalists suggest you might not, depending on how detailed the LLM is. The complaint has many quotes that are long verbatim sections.
Any lawsuit makes all the claims it can and demands every sort of relief it might plausibly have. That's not to say that's how it should be (it can have awful results), just to say that's what to expect (and hope courts only considers the reasonable claim - "stop freely sharing our data" and avoids ridiculous/anti-fair-use claim "you can't even store our data").
The thing about you claim, "Just learn to recognize and punish plagiarism via RLHF" is that we've had an endless series of prompt exploits as well as unprompted leakage and these demonstrate that an LLM just doesn't have fixed border between its training data and its output. This will it basically impossible for OpenAI to say "we can logically guarantee ChatGPT won't serve your data freely to anyone".
Yeah, no - that proposal is no good. The correct solution is to have machine learning be more like human intelligence. You can't ask me to plagiarize a New York Times article. Not because of prompt rule violation but because I just can't. It's not how humans train (at least most).
I just looked up the share structure; didn't realise the publicly traded shares only appoints 1/3 of the board. Still their second best option is start buying up competitors and going ahead with purging NYT from their training set. That might well end up a worse option for NYT, as they won't stop LLMs from gradually intruding on their space and the moment OpenAI or other LLM providers own major publishers so they don't need to depend on scraping, they lose any leverage they currently have.
They won't need to. Most don't have enough money to survive a prolonged round of lawsuits, and the potential damages are limited. The only real leverage is taking their models out of circulation and cutting their training set and that leverage only exist for the large publishers.
I'm not convinced it's a given it will. If it becomes necessary to license, owning the large publishers will be leverage and allow locking competitors out unless you have a portfolio to cross license.
OpenAI alone has a market cap that'd allow it to buy about as large a proportion of publishers of newspapers and books as they'd be allowed before competition watchdogs will start refusing consent.
Put another way:
If I was a VC with deep pockets investing in AI at this point, I'd hedge by starting to buy strategic stakes in media companies.
> The suit demonstrates instances where ChatGTP / Bing Copilot copy from the NYT verbatim. I think it is hard to argue that such copying constitutes "fair use". However, OAI/MS should be able to fix this within the current paradigm: Just learn to recognize and punish plagiarism via RLHF.
Isn't that in tension with the basic idea of an LLM of predicting the next token? How do you achieve that while never getting close enough to plagiarism?
Transformations are happening. Maybe if the output is verbatim afterwards, than that says something about the outputs originality all along... or am I a troll?
They're talking about transformative with regard to copyright law where it is an important part of determining fair use, not the dictionary definition you're using here.
I can't take NY Times articles, translate them into Spanish, and then sell the translations under fair use, even though clearly I've transformed the original article content.
The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for so many reasons. To enumerate a few:
1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.
2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.
3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.
I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.
I'm sorry but this is such a bad take. Nice appeal to consequences. In my view, the New York Times is entirely justified in pursuing legal action. They invested time and effort in creating content, only to have it used without permission for monetary gain. A clear violation.
Analyzing the factors involved for a "fair use" consideration:
Purpose and Character of the Use:
While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.
Nature of the Copyrighted Work:
Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.
Amount and Substantiality of the Portion Used:
In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.
Effect on the Market Value:
NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.
IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.
I don’t think the original point being made was that NYT wasn’t justified in bringing the action. The point that was being made was the suit would be ultimately meaningless in the long term even if it was successful in the short term. There is a potentially more significant risk in the future that this suit will not protect against because of the reasons enumerated by the author. While the author is speculating, the law struggles with technology and adapting to change, which makes their prediction useful because it does highlight the problems that are coming that can’t be readily mitigated through legal precedent.
> it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.
People are not using ChatGPT as a replacement for current news, and because of hallucinations, no one should be using it for past news either. I wouldn't remotely call ChatGPT a competitor of NYT traffic, like I would Reuters or other news outlets.
The intended result is clearly to supplant other information sources in favor of people getting their information from ChatGPT. Why should it matter to legality that the tech isn't good enough for the goal?
> T. Why should it matter to legality that the tech isn't good enough for the goal?
Because if it is not good enough, then it is not a market substitute.
The laws cares if it is a market substitute and if there are damages. If it sucks, then there aren't damages, which matters for the 4th factor of fair use.
Definition of Transformative Use: The legal concept of transformative use involves significantly altering the original work to create new expressions, meanings, or messages. AI models like GPT don't merely reproduce text; they analyze, interpret, and recombine information to generate unique responses. This process can be argued as creating new meaning or purpose, different from the original works.
In the case of the famous screenshot, the AI just relayed the information it found on the web, it's not included in its training data.
Nope, it doesn't work that way. The fact that the LLM can regurgitate original articles doesn't remove the possibility that training can be considered transformative work, or more in general that using copyrighted material for training can be considered fair use.
Rather, verbatim reproduction is the proof that copyrighted materials was used. Then the court has to evaluate whether it was fair use. Without verbatim reproduction, the court might just say that there is not enough proof that the Times's work was important for the training, and dismiss the lawsuit right away.
Instead, the jury or court now will almost certainly have to evaluate OpenAI's operation against the four factors.
In fact, I agree with the parent that ingesting text and creating a representation that can critique historical facts using material that came from the Times is transformative. An LLM is not just a set of compressed texts, people have shown for example that some neurons fire when you are talking of specific historical periods or locations on Earth.
However, I don't think that the trasformative character is enough to override the other factors, and therefore in the end it won't/shouldn't be considered fair use IMHO.
It doesn't matter, if everything else stays the same what matters is what it's used for. If it's used to make money, it would certainly hurt claims of fair use—maybe not for those that do the training, but for those that use it.
Rent seeking is an awful term that was from the beginning intended to describe anyone pursing a political or legal goal that deviates from a pure free market economy. As Econlib writes:
> ”Rent seeking” is one of the most important insights in the last fifty years of economics and, unfortunately, one of the most inappropriately labeled. Gordon Tullock originated the idea in 1967, and Anne Krueger introduced the label in 1974. The idea is simple but powerful. People are said to seek rents when they try to obtain benefits for themselves through the political arena. They typically do so by getting a subsidy for a good they produce or for being in a particular class of people, by getting a tariff on a good they produce, or by getting a special regulation that hampers their competitors. Elderly people, for example, often seek higher Social Security payments; steel producers often seek restrictions on imports of steel; and licensed electricians and doctors often lobby to keep regulations in place that restrict competition from unlicensed electricians or doctors.
No, it dates back to Adam Smith’s conception of rents derived from land-ownership as a parasitic drag on economies (about which he was entirely correct). This concept was later extended to a whole host of other forms of monopolization, some state-granted and some market-derived. In the case of U.S. copyright, we can look at its original terms (quite limited) and see that its current incarnation is more harmful than beneficial to most people.
The New York Times is dying company that is rent seeking here.
Along time ago, their content was valuable, yet now you can't even give it away to researchers.
I know because they tried to make a deal with my company, we passed because social media data is infinitely more valuable.
You don't want to seriously tell me that garbage on Twitter in 240 characters is more useful to me than actual journalism, do you?
Maybe their data isn't as valuable to eg. advertisers than the data their audience actually shouted into the internet themselves (guess what), but the thing they've been actually selling for a long time now, journalism, can't be dying that fast considering we're both on this website that in big parts consists of discussing journalism.
To me, your comment only reinforces the point that NYT's content is actually valuable, rather than valuable to rent seekers. But maybe you can give a bit more detail.
> 2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.
Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?
I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.
Are media really rent-seeking? They create new content and analysis, for which they want to be compensated. It seems quite different to hoarding natural resources or land, for example.
> It seems quite different to hoarding natural resources or land
Indeed, it is quite different, because those things are scarce physical things in the real world. Intellectual property is a scam, and killing it once and for all will be one of the best things to come out of the current AI hype cycle. Nobody will "own" ideas, pieces of information, or strings of bytes.
About your 1. point: you can't possibly know that future models will be trained exclusively on synthetic data without any hit to performance. It is also not easy to reword the entire copyrighted training corpus without introducing errors or hallucinations. And you assume that this is just a fact?
Your second point reminds me a bit of 'War with the Newts' where humanity arms a race of sentient salamanders until they overthrow humanity. How could we not arm our newts if Germany might be arming theirs?
I also think basically everything else you wrote is wrong.
If Microsoft doesn't get royalty free rights to resell access to everyone's content on demand, China will become the powerhouse of interference-free media? Rrrrrright....
I think it can be simultaneously true that NYT is accurate in their complaint, while having no legal remedy for this and that there shouldn’t be.
There are plenty of large companies in other sectors that acknowledge there are limited legal remedies for them if someone copies some aspect of their business or name.
This is the actual truth. What it sucks for is for citing the data, but GPT-4 doesn't do that to start with unless it's directly from a web result and not the weights.
> GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct
Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.
From page 30 and onwards has some fairly clear examples on how ChatGPT has an (internal) copy of copyrighted material which it will recite verbatim.
Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.
As I see it the closeness of OpenAI may be what saves it. OpenAI could filter and block copyrighted material from the LLM from leaving the web interface using some straight forward matching mechanism against the copyrighted part of the data set ChatGPT has been trained on. Whereas open source projects trained on the same data set would be left with the much harder task of removing the copyrighted material from the LLM itself.
> Essentially if you copy a lot of copyrighted material into a blob and then apply some sort of destructive compression to it. How destructive would that compression have to be for the copyright no longer to hold? My guess it would have to be a lot.
I imagine the goal is closer to "enough that no one notices we stole it", either in a way that it's not easily discoverable or even when directly analyzed there's enough plausible deniability to scrape by.
I wonder how they got these results, seeing as they are not showing any of the usual UI's (i.e. ChatGPT/Copilot).
It makes it difficult for me to ascertain whether it is repeating from it's training data, or they committed the same mistake as the OP article of using Copilot, which ends up googling(binging?) the article first, before replying.
People who think the examples the lawsuit are “fair use” need to consider what that would mean. We’re basically going to let a few companies consolidate all the value on the Internet into their black boxes with basically no rules … that seems very dangerous to me.
I hope a court establishes some rules of engagement here, even if it’s not this case.
I see the exact opposite - any open source model is going to become prohibitively expensive to train if quality data costs billions of dollars. We’re going to be left with the OpenAI’s and Google’s of the world as the only players in the space until someone solves synthetic data.
Exactly this. I work at a small web scraping company (so I might be a bit bias) and any small business can collect a fair, capable datasets of public data for model training, sentiment analysis or whatever today. If public data is stopped by copyright as this lawsuit implies that would just mean only giant corporations and pirates would be able to afford this.
This would be a huge blow to open-source and research developers and I'd even argue it could help openAI to get a bit of a moat ala regulatory capture.
research is fair use, also providing something amazing like Wikipedia is arguably educational (again fair use), reselling NYT articles on-demand via an API is by itself neither, so likely not free use
Fair use is irrelevant here as no small business would ever risk court dragging even though they are in the right. Especially since breaking ToS and "business damage" are easiest attachments to any lawsuit related to digital space.
You may remember the Google Books lawsuit where Google was digitally copying the entirety of books and making them available online.
Google won that suit under fair-use as a massive searchable database was found to be transformative as well as the non-commercial nature.
So; if your web scraping companies goal is to allow people to bypass a paywall I suspect you'll have trouble in the future. If your web scraping company instead say allows people to do market analysis on how many people need a piano tuner in NYC and it doesn't do that by copying a NYT article doing original research I think you'll be fine.
In the screenshot for the article you can see that the LLM says it is "Searching for: carl zimmer article on the oldest DNA". That, and what I know about how LLMs work, suggest to me that rather than the article being stored inside the trained LLM it was instead downloaded in response to the question. So the fact that the system is providing the full text of the article doesn't really go to whether training the LLM is a transformative use or not.
Yes, the screenshot in the article is clearly doing an Internet search. The exhibit in the lawsuit shows that you can complete an article by using GPT on the first sentence of the prompt, with low temperature to aid reproducibility, and obtaining the original except for a single word. That is another thing, and it shows that the LLM has basically recorded the original text into its weights in compressed form: https://pbs.twimg.com/media/GCY4WC6XYAAq-JS?format=jpg&name=...
It would be curious to test this on a larger sample than just a few. It is hard to believe that a majority of NYT articles are verbatim stored in the weights of a web-wide LLM, but if that is the case it would be a pretty unbelievable revelation about their ability to compress an entire web’s worth of data. But, more likely, I assume it is a case of overfitting, or simply finding a prompt that happened to work well.
FWIW, I can’t replicate on either GPT 3.5 or 4, but it may be that OpenAI has added new measures to prevent this.
I have attempted this sort of thing with GPT 3.5 many times and never been successful, although I've still never been taken off of the GPT4 waiting list that I signed up for months ago and I'm not going to subscribe without trying it first. I [and presumably many thousands of others] have tried things like this with many LLMs and image generating models, but to my knowledge we've come up rather short. I've never managed to recreate anything verbatim and have struggled to get anything resembling a copyright infringement out of stable diffusion with the sole exception of a meme image of Willy Wonka.
That said, the meme image of Willy Wonka comes out of stable diffusion 1.5 almost perfectly with surprising frequency. Then again, this is probably because it appeared hundreds or thousands of times in the training set in all sorts of contexts because it's such a popular meme. There is a tension between its status as an integral part of language and its nature as a copyrighted screen grab.
You can't reproduce on the web interface, because the temperature settings are higher than what's required to compress the text. You need to use the API.
However, I had good luck reproducing poems on GPT 3.5, both copyrighted and not copyrighted, because the choice of words is a lot more "specific" so to speak, and therefore higher temperature isn't enough to prevent complete reproduction of the originals. See https://chat.openai.com/share/f6dbfb78-7c55-4d89-a92e-f4da23... (Italian; the second example is entirely hallucinated even though a poem with that title exists, while the first and third are recalled perfectly).
It doesn’t seem that surprising; compared to entire NYT articles, poems are short, structured and more likely to be shared in multiple places across the web.
I’m more surprised that it can repeat 100 articles; if that behaviour is consistent in larger sample sizes and beyond just NYT dataset (which might be repeated on the web more than other sources, causing overfitting), that would be impressive.
You could imagine at some point a large enough GPT5 or 6 or 7 will be able to memorize verbatim every corner of the web.
The opposite is also concerning. IP law has always been convoluted, messy, contradictory, and morally ambiguous. The complaints of IP violation by LLMs are simply taking these inherent flaws and making them immediately obvious, forcing decisions that ultimately will set precedents on the legality of human thought that I don’t think anyone will be comfortable with. People understandably see OpenAI and Microsoft as potentially dangerous to be given so much leeway, but fail to consider on the flip side companies like Disney who have already more or less dictated the majority of copyright law for decades now. They must be chomping at the bit at the legal precedents potentially coming down the pipeline that call into question the ability to interact with any kind of media or information at any level without potentially being on the hook monetarily.
I think all this is doing is making us realize that we have built a massive economic system on a fundamentally flawed idea of ownership over ideas, and the only two solutions will be to tear up the rule book, which will be extremely painful, or double down, which will be fatal.
so its best to not to lose a competitive edge with things that people openly publish on the internet, if you put it out there for everyone to see then expect other people to use it
A court in Japan will have no impact on the outcome of a copyright lawsuit in USA. Not to mention that it doesn't really matter how a Japanese court ruled since it's all governed by treaties anyway. They will change their laws if required to.
We developers like to pretend that LLM's are akin to humans and that they've been using things like NYTimes like humans as educational material.
But they are not. It's much simpler, proprietary writing is now integrated into the source code of OpenAI, it would be as if I would copy parts of other propriety code and copy paste it into my own codebase. Claiming copy paste is a natural evolving process of millions of years of evolution.
The fact that LLM's are so complicated and we don't know where it is, doesn't make it less so.
> it would be as if I would copy parts of other propriety code and copy paste it into my own codebase.
It's not copy-pasted; it's compressed in a lossy manner. Even GPT4 has nowhere near enough memory to store the entirety of its training data in a non-lossy compression format. Just likes how humans compress the information we read.
>Just likes how humans compress the information we read.
Humans don’t have the scale machines have and moreover humans aren’t sevices, that argument doesn’t fly.
I really think NYTs data isn’t that important and nor crucial, LLMs could’ve just elided it. However, it’s more about training on copyrighted data in general which is kind of crucial for OpenAi, they trained their LLMs indiscriminately on copyrighted content without any plan to share any profits.
You're kind of proving my comment pretending they are akin to a human brain instead of an evolved form of statistics mixed with code, aka transformer model.
Let alone that it's a centralised model that's being distributed for a fee.
contains several examples of people who were able to look at pages and recite them back. That is actually a much stronger ability than GPT since GPT has presumably looked at them 100 times.
>It's much simpler, proprietary writing is now integrated into the source code of OpenAI
The source code of the LLM is likely a few hundred lines of text describing the shape of the neural networks involved in the model.
None of the NYTimes content will be in the source code. NYTimes doesn't publish Python source code, it publishes human language news.
LLMs are conceptually simple, mostly matrix multiplications and some non-linear operations connecting each layer, in some loops based on attention, etc. It's the staggering amount of training data and compute that makes them complex.
NYT won't mind if you use their content to train LLMs - as long as they get a commission. Reddit will shut down their free API and make you pay to get training content. Discord is going to be selling content for AI training too - if they haven't already done so. Twitter is doing it.
They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.
The journalists created the content for the NYT, the users created it for Facebook. Both received something in return for their effort, and the content ended up being owned by NYT/facebook
The journalists were paid to assign ownership of their work to the NYT corporation, with a clear and well understood contract of work that they signed, either in real ink or with an equivalent electronic signature, as consenting adults.
Can you say the same for user created content on Reddit, Twitter, or Facebook? A user agreement that nobody reads doesn't have anything like the same legal basis as a signed contract. Not to mention that a large percentage of users are not adults.
"They" also include the people working there. Why someone work with full time writing articles should give the work for free just let someone to train it and make money out of it as a consequence?
>Why someone work with full time writing articles should give the work for free
They are not giving it out "for free", in fact they're being paid by their employer to write these articles. Moreover, the writers themselves stand noth' to gain from their past writings financially as they don't belong to the ownership structure of the business.
Their ability to make money in the future is directly tied to their employers' ability to make money with their content. This is a closed financial loop. If OpenAI or any other AI company wants in, they should pay a licensing fee or get the laws changed, not just assume that they can take what they want and pretend like there are no negative consequences for the creator or the rights-holder.
No one is pretending there are no "there are no negative consequences for the creator or the rights-holder". Of course there are. But this is a story of rights-holders, who've already outgrown their usefulness, wanting to tap themselves into money stream they are not entitled to.
ChatGPT isn't competing with NYT on a core competency. No one uses LLMs for original news reporting. They're obviously incapable of doing that, by virtue of not being there on the scene or able to independently research a topic, maintain relationships with sources, etc. What ChatGPT can do is quote/reproduce some parts of past articles, and reason from them. Or at least produce new text that's somewhat related to the old text.
The threat to NYT is this: ChatGPT is much better bullshitter than they are, so it reduces NYT to its core competency: providing original information. Which is all it should be doing in the first place. But instead, NYT wants to not only keep the bullshitting part of its revenue, but also take a cut or destroy the much greater and much more useful part of where this all feeds a general-purpose language model.
In this limited example, are there such consequences? Are people dropping NYT subscriptions because they trust chatgpt to inform them of current events? I don’t buy it.
This is a badly-formulated conjecture, or worse, ultimately selective reading of "social credit" which only purpose is serving your argument; it has nothing to do with economics. I'm sorry, but I'm not convinced.
When open source developers do that, they also include an explicit licensing information that lists cases when the usage is allowed and restricted. So even if the code is open source and licensed under GPL, its usage in a closed source product like ChatGPT is not allowed.
GPL code usage in closed source ChatGPT is allowed "for internal use"; it just would not be allowed to distribute binaries of ChatGPT that are closed source without making source available; also a GPL3 license violation to allow online access to a ChatGPT program that used GPL3 code without making source available.
So Just because some people give out something for free at certain time, all the other people should do the same all the time? Not to mention most open source comes with a well-defined term not just exploited from free by a closed service making money for another company.
With the ways NYTimes has degraded since 2010 even if people there are working for free, they're still being overpaid. The only adequate section there is the food.
> They didn't care before because LLMs were just experiments. Now we're talking trillions of dollars of value.
Can you make the argument this was their fault for not having forward vision/being asleep at the wheel and "accidentally, in hindsight" letting OpenAI/others have free, open, unlimited access to their content?
Basically none of the training material for GPT was used under an "unlimited" license. There are very important legal limitations. GPT just doesn't care much about them.
NYT's perspective is going to look so stupid in future when we put LLMs into mechanical bodies with the ability to interact with the physical world, and to learn/update their weights live. It would make it completely illegal for such a robot to read/watch/listen to any copyrighted material; no watching TV, no reading library books, no browsing the internet, because in doing so it could memorise some copyrighted content.
I disagree. The verbatim part is the problem. You’re drawing a comparison to how humans operate except we’re not allowed to operate like that.
While harder to do as a human, if memorised a copyrighted book and then did a live reading on TV, or produced replicas from memory and sold them (the most comparable example), I’d be sued.
Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.
>or produced replicas from memory and sold them (the most comparable example), I’d be sued.
This is not the most comparable example, because it's not what ChatGPT is doing. The most comparable example is if you were hired as a contractor and the employer asked you to write verbatim some copyright content you'd memorised. If the employer then published it, they'd be the one liable, not you.
>Humans produce derivative work all the time, and it’s fine for LLM’s to do that, but you can’t do it verbatim.
Nobody's suggesting preventing humans from consuming any copyrighted content just because in future they might recite some of it verbatim, but that's what NYT want for LLMs.
> The most comparable example is if you were hired as a contractor and the employer asked you to write verbatim some copyright content you'd memorised. If the employer then published it, they'd be the one liable, not you.
No, you'd both be liable. You are not allowed to create copies of a copyrighted work, even from memory, for any commercial purpose. Making it public or not is irrelevant.
This is more obvious with spftware: if I copy a version of AutoCAD that my previous employer bought and sell it to another company, or even just use it for my current employer without showing it to anyone else, I am violating the copyright on that software, and I am liable. Even though obviously no "publishing" happened.
Similarly, if you hire a decorator to paint Mickey Mouse on the inside walls of your private kindergarten, the decorator is violating Disney's copyright just as much as you are, even if neither of you has made that public.
Your previous employer never bought AutoCAD, they licenced its use, paying a subscription. When you start working for them that licence was no longer available to you. So you would be unable to subsequently use it.
Unable legally, but I may find illegal ways. And the reason it is illegal to copy is copyright at the end. The license is only (legally) required because of copyright.
Then we should be focused on policing the usage of the model, not the training of it.
That's the point at which infringement occurs in your example. It's not the memorizing that's the infringement, it's the reproduction from your memory.
We shouldn't be regulating your hippocampus encoding the book, but your reproducing the book from that encoding.
Similarly, we shouldn't be regulating the encoding of material into the NN, but the NN spitting back out the material.
Memorising isn't the issue. It's providing it back verbatim and/or cutting access to the source.
You'd get the same problem with someone with a photographic memory who a group of people would turn to recite them the news instead of buying the newspaper.
As of now public performance of copyrighted material is infringement.
That's not the case, as they aren't trying to get a ruling on the forced reproduction by prompt as infringement, but rather to get a ruling that training is infringement.
I fully agree with the perspective that infringement in usage needs to be limited even if I strongly disagree that training is infringement.
Are those LLMs independant citizens we are going to give rights to? Then I'm fine with that.
Are they all owned by one mega-corporation, which is going to do as capitalism does, and use them to squeeze money out of all of us? Then I'm happy to ban them.
"Let's ban something capable of diagnosing medical conditions and letting coma patients to communicate with an EEG because it learned the relationships between words from a giant data set of scraped data and is owned by a company" is a pretty callous take IMO.
The opportunity cost of holding this technology back is going to literally be millions of people's lives given current trends in its emerging applications.
I've been arguing since ChatGPT came out that LLMs should fall under fair use as a "transformative work". I'm not a lawyer and this is just my non-expert opinion, but it will be interesting to see what the legal system has to say about this.
I'm sure the NYT uses dictionaries, encyclopaedias and style books verbatim as well. And they don't invent the facts they write about. As journalists they are compiling and passing along other knowledge. You usually don't get a piece of their income when a journalist quotes you verbatim (people usually don't get paid for interviews).
I don’t doubt it does. It’s easy to get it to spit out long answers from Stack Overflow verbatim, I’ve done it. Maybe some of the “transformative” nature of the LLM output is the removal of any authorship, copyright, license, and edit history information. ;) The point here is to supplant Google as the portal of information, right? It doesn’t have new information, but it’s pretty good at remixing the words from multiple sources, when it has multiple sources. One possible reason for their legal woes wrt copyright is that it’s also great at memorizing things that only have one source. My college Markov-chain text predictor would do the same thing and easily get stuck in local regions if it couldn’t match something else.
This tired 'fair use' excuses from AI bros whilst the GPT has reproduced the article text verbatim, word for word and it being monetized without the permission from the copyright holder and source (NYT) is an obvious copyright violation 101. Full stop.
Again, just like Getty v. Stability, this copyright lawsuit will end in a licensing deal. Apple played it smart with their choice with licensing deals to train their GPT [0]. But this time, OpenAI knew they could get a license to train on NYT articles but chose not to.
the purpose and character of the use
the nature of the copyrighted work
the amount and substantiality of the portion taken
the effect of the use upon the potential market.
Literally every single one of these factors has very complicated precedent and each one is an open question when it comes to AI. Since fair use is a balancing test this could go any way.
Stability took the easy way out because they didn't have billions of dollars to play around with and Microsoft to back them. Let's see what OpenAI does but calling everyone who disagrees with your naive interpretation of fair use "AI bros" is doing everyone a disservice.
I generally tend to downvote comments that use "x bros" for pretty much any x on sight for that reason. It's exceedingly rare for such a comment to be much more than a thinly veiled insult with little substance. Sometimes I might even agree with the insult, but it's still rarely appropriate here.
Young males that wear Tensorflow branded muscle tank tops and drive Mitsubishi Eclipse convertibles with the vanity plate OVERFIT. They are everywhere these days.
Thank you for the absurd visual. The vanity plate, especially, was worth saving for last. Somehow, the car is well suited, also. Love how they prefer Tensorflow over Pytorch, too.
This seems like a reasonable opinion when you think about the training data size and imagine that any given output is some kind of interpolation of some unknown large number of training examples all from different people. If it’s borrowing snippets from tens or hundreds or thousands of sources, then who’s copyrights are being violated? Remixing in music seems to be withstanding some amount of legal scrutiny, as long as the remix is borrowing from multiple sources and the music is clearly different and original.
It gets harder to stand behind a blanket claim that LLMs or any AI we’ve got falls under fair use when they keep repeatedly reproducing complete and identifiable individual works and clearly violating copyright laws in specific instances. The models might be remixing and/or transformative most of the time, but we have proof that they don’t do that every time nor all the time… yet. Maybe the lawsuits will be the impetus we need to fix the AIs so they don’t reproduce specific works, and thus make the fair use claim solid and actually defensible?
It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.
>It's inevitable that this question ends up at the supreme court. And the sooner the better IMO. It's clearly fair use. Generative agents will be seen legally as no different than a human artist leveraging the summation of their influences to create a new work.
Why do you think the architecture is important?
If I have a computer program and it outputs the an entire copyrighted poem then the answer to "is this copyright violation" SHOULD NOT depends on the architecture of the program.
It's not stored in ChatGPT actually, unlike Google's web search cache where it is stored verbatim, can be recalled perfectly, and is still fair use.
Fair use has nothing to do with reproducibility. LLMs are more clearly fair use than a search engine cache and those court cases are long settled. There's no world in which OpenAI doesn't win this entire thing.
36": 'however, the press as you know it has ceased to exist'
40": '20th-century news organizations are an afterthought; a lonely remnant of a not-too-distant past'
2'11": 'also in 2002, google launches google news, a news portal. news organizations cry foul. google news is edited entirely by computers'
5'13": 'the news wars of 2010 are notable for the fact that no actual news organizations take part. googlezon finally checkmates microsoft with a feature the software giant cannot match: using a new algorithm, googlezon's computers construct new stories, dynamically stripping sentences and facts from all content sources, and recombining them. the computer writes a new story for every user'
5'55": 'in 2011 the slumbering fourth estate awakes to make its first and final stand. the new york times company sues googlezon, claiming that the company's fact-stripping robots are a violation of copyright law. the case goes all the way to the supreme court'
they didn't get the details exactly right, but overall the accuracy is astounding
however, that may be a hyperstition artifact in this timeline
https://en.wikipedia.org/wiki/EPIC_2014 (i thought epic 2014 might be the only flash video to hae a wikipedia article about it, but then i looked and found five others)
> Because the outputs of Defendants’ GenAI models compete with and closely mimic the inputs used to train them, copying Times works for that purpose is not fair use.
This is interesting. The NYT is specifically saying that the way you use an LLM impacts what you can legally use for training the LLM. They're firing shots at the big guys trying to sell access to an LLM, but not at the little guy self-hosting for fun or academics doing research.
What will ultimately happen is that OpenAI and all big tech with have to pay out some sizable sum to large copyright holders, and in exchange be granted a de facto exclusive right to develop these technologies further because they’re the only ones who can do so “responsibly” with respect to copyright. It will take a long time to wind its way through the courts, but this could be the death knell for open source LLMs in the US.
The prompts shown literally invite the LLM to complete the copyrighted text by providing unedited selections and asking the machine to finish those. Even if this is problematic in a small number of cases it is not a use case that undermines the business model of the newspaper since it requires the reader to have access to the original text. Nor will it be easy to demonstrate economic harm since this is not how readers consume news and is very far from how users interact with LLMs. Nor are the archival materials used for training remotely reflective of the "time-sensitive" articles that newspapers sell. And archival materials are easily available elsewhere so where is the case for economic harm?
The courts are going to rule that LLM training is a transformative use case that is protected as fair use under copyright law. They may rule that if an LLM-powered service is explicitly designed to enable copyright violation that is illegal, but there is no way any court is going to look at these examples and see it as anything other than the NYT fishing to try and generate a violation by using the LLM in a way that is very different than the service is intended to be used and which -- even if abused -- doesn't hurt the business model under which the text has been produced.
The most likely outcome is that LLM providers will add some sort of filter on output to prevent machines from regurgitating source documents. But this isn't a court case the NYT can win without gutting fair use protections, and that would be a terrible thing.
Meanwhile, open source LLMs are excluded from stringendo regulation in the US, abd with Mistral there is some knowhow that isn't in SV, which is also jicem
> “The tragedy of the Luddites is not the fact that they failed to stop industrialization so much as the way in which they failed. Human rebellion proved inadequate against the pull of technological advancement.”
The US constitution says, The Congress shall have Power
> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries;
So the Congress's power to make copyright and patent laws is predicated on promotion of science and useful arts (I believe this actually means technology). In a sense, the OpenAI being the forefront of our AI technology advancement is crucial to the equation. To hinder the progress by copyright is, in my mind, unconstitutional.
Wishful thinking. Just as equally, NYT’s right to copyright over its material, in order to have a functional press, is enshrined the constitution. Anyone threatening that copyright could be unconstitutional.
I think we all agree that no one is entitled to “progress of science” at any cost - as a straw man, killing hundreds of newborn babies for scientific research is not great - so we use ethics and the legal system to find the line of what’s acceptable.
I don’t know exactly what NYT is asking for here, but the two options aren’t unconsented training vs nothing at all. NYT could license, for a fee, its content to OpenAI. It’s pretty common for scientists to have to pay for materials!
If I create a news website where I write articles in the following way:
- Read 20 different news websites and their story on the same event/topic
- Wait an hour, grab a cup of coffee
- Sit down to write my article, never from this point I open any of the 20 news websites, I write the story from my head
- I don't consult any other source, just write from my memory, and my memory is, let's say, not the best one, so I will never write more than 10 words exactly as they appear on any of the 20 websites.
- I will probably also write something that is not correct or add something new because, as I said, my memory is not the best.
Culturally we’re taught that there is a moral component to copyright and patent law - that stealing is stealing. But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era. Copyright and patent laws exist for practical, pragmatic reasons - and seemingly they have served us well, but it’s not unreasonable to re-examine them from first principals.
> But the idea that words or thoughts or images can be owned (and that the might if the state can be brought to bear to enforce it) would seem utterly ludicrous to someone from an earlier era.
Is there any research into how people from earlier eras thought about it? And should all laws that seemed ludicrous to someone from an earlier era be discarded? If not, how exactly do we determine the relevance of what someone from an earlier era would think about our laws?
They could but that would pretty much mean giving up the tech supremacy to China since they won't apply it. China already doesn't care much about copyright so that's not going to stop them.
I suspect it wouldn't be too hard to convince the EU though, the EU has an history of giving up rights and markets to big copyright holders even if that hurts the local companies.
What will happen in this case is that large content providers will get paid directly and smaller content providers will get rolled up into a licensing bag and get small indirect payouts. For example, we might see a model where people who's books have been used will get a pay out proportionate to the sales of the book (perhaps), so if your books sells just a few thousand copies expect $20 but if you sell millions expect $20k
LLM's will become more expensive and less attractive as money printers, this will screw with the business models of the direct provision folks like OpenAI, MS and Google, MS and Google will only shed tears for money spent while OpenAI will just not have as good an income stream until they think of something new.
What will happen is that all will go to China and maaaybe some third world country, or run your own models from shady sources.
So you will use a Chinese AI that spies on you, or you will use some shady service from a shady country (that will play cat and mouse like torrent sites).. or most likely you will run your own model when you are computer literate and no model if you are not.
Actually most models are so lobotomized allready that probably better to run your own, as long as you have a good enough computer.
This is an odd discussion as llms are really bad for authoritative information distribution, they are really untrustworthy! But, if that changes and they do start to be a reliable support as a general information assistant then I see things more like Spotify vs. Napster. I would prefer that there would be a greater diversity of sources and indirect is going to require more accountability than music but somewhat like that.
I'm sure that's what they want, but I'm not sure that's what the outcome will be. What if they want to charge a prohibitive amount of money for their content?
Dunno - but my guess is that the price will be what the market will bear...
I think Spotify vs Napster is a good example, content creators in news (the Journalists) are already in a hard place (vs. successful rock stars preinternet) I think that the news providers are rather like the music lables.
A huge win for countries with lax copyright laws. These things aren't going away, the worst case scenario would be exactly that scenario playing out - then China (or some other peer to the US's tech sector) just continues developing them to achieve an economic advantage. All in addition to the obvious political implications of AI chatbots being controlled by them.
The LLM genie is out of the bottle: an unfavorable court ruling in a single country isn't going to stuff it back in.
Do LLM really give an economic advantage though? I've mostly seen them used to write quirky poems and bad code. People are scrambling to find use-cases but it's not very convincing so far.
On the other hand, if LLM are used to "launder" copyright content and, accepting the premises of copyright law, this has the effect of reducing incentives to do creative work, that has obvious negative implications for economic productivity.
> I've mostly seen them used to write quirky poems and bad code.
Assuming this is in good faith: the ability to write code, documentation, and tests is absolutely a productivity enhancer to an existing programmer. The code snippets from a dedicated tool like copilot are of very usable quality if you're using a popular language like Python or JS.
Because the megacorps should have to pay the people creating the works they are training their multibillion/eventual multitrillion dollar systems on, and should get a nice rake to the face when they try to do an end run around it.
I have no idea what he's thinking, but if everybody in the community here had an LLM in their pocket and large orgs did not, it would at least be kind of fun.
The open source people can continue to pretend they matter in this field and large corporations like Microsoft will stop stealing everything that moves on the internet.
Loading data to which you have no rights over into your software is legally perilous, yes.
It's as easy as simply asking for and receiving permission from the data's rightsholders (which might require exchange of coin) to make it not legally perilous.
At some point the burden of carrying 100 year old copywriter/patent law will become so onerous a burden on the pace of progress that its enforcement will be antihuman.
It already is, but I don't think this is a good example. NYT has a legitimate case here. They own the material they publish, and GPT-4 is shown to be able to recall entire articles verbatim. That's a violation, clear as day.
The thing about lawsuits is that you make dozens of claims, and the court can rule in favor of some of them, and against others. The question of "is LLM training fair use?" hasn't made it to a high court yet. The court could very easily rule against everything else in the suit.
A photocopier can reproduce entire articles verbatim, yet no one calls for the destruction of all photocopiers. In fact, many legitimate legal uses of photocopiers to reproduce whole newspaper articles take place commonly by archivists, journalists, students, etc.
It is the specific use of article photocopies to circumvent the normal sale of newspapers that becomes illegal.. and even that is questionable. If I read the newspaper left out in a waiting room and it keeps me from buying that days paper, this is not a criminal act.
Fair use is something Wikipedians dance around a fair amount. It also meant I did a lot of reading about it.
It’s a four part test. Let’s examine it thusly:
1. Transformative. Is it? It spits out informative text and opinion. The only “transformation” is that its generative text. IMO that’s a fail.
2. Nature of the work - it’s being used commercially. Given it’s being trained partially on editorial, that’s creative enough that I think any judge would find it problematic. Fail on this criteria.
3. Amount. It looks like they trained the model on all of the NYT articles. Oops, definite fail.
4. Effect on the market. Almost certainly negative for the NYT.
You're getting mixed up. When applying the four factors, you need to individually separate all the uses. So you would need to repeat the fair use test for every alleged type of infringement. This means that the scraping from the public internet to OpenAI's dataset storage cluster is one instance where the full analysis of the 4 must take place, then the training itself, so another full analysis, then the distribution of model outputs, another one, etc.
Why so? From the point of view of the company alleging damage the separation of processes is irrelevant. It all leads to massive copyright infringement.
It is not the NYT making the claim of Fair Use, it is OpenAI.
Because fair use is an affirmative defense to each claim, not to the general accusation. So if someone sues you for "copyright infringment", broadly speaking, but then you look at the actual document and it's 4 claims based on 4 sections of U.S.C Title 17, you can raise a fair use defense to two of them and a different one to the other two, or none at all for those last ones and simply settle them, while still defending the first two.
Sure, but my original point still stands. OpenAI has a much better chance with how fair use actually works than with how you described it in your original comment.
I read about this in the Times today (and am surprised that it wasn't on HN already).
My guess is that the court will likely find in the Times favor, because the legal system won't be able to understand how training works and because people are "scared" of AI. To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is. I might say something like "I, for one, welcome our new LLM overlords". Am I infringing the copyright of The Simpsons? No.
I am guessing some technicality like a terms-of-use violation of the website (avoidable if you go to the library and type in back issues of the Times), or storing the text between training sessions is what will do OpenAI in here. The legal system has never been particularly comfortable with how computers work; for example, the only reason EULAs work is because you "copy" software when your OS reads the program off of disk into memory (and from memory into cache, and from cache into registers). That would be copyright infringement according to courts, so you have to agree to a license to get that permission.
I think the precedent on copyright law is way off base, granting too much power to authors and too little to user. But because it's so favorable towards "rightsholders", I expect the Times to prevail here.
It's extremely speculative to claim that LLM models are basically doing what humans do. There is very clearly something that isn't right about that because in order for a human to learn to speak and converse and they don't need to imbibe the entire corpus of all written text in human history - which is basically what we're doing with these LLMs. What we're giving them is vast amounts of data which is totally unlike how humans work. There's very clearly some gap here between what a LLM is doing and what a human is doing. So you can't use that as a basis to justify why it's ok for OpenAI to operate like this.
To put it another way, let's say I turn the dial all the way the other way, I train the worlds crappest LLM on NYT material, it massively massively overfits and all it will ever return is verbatim snippets of the NYT. Is that copyright infringement?
The core part of the argument here is actually just that OpenAI doesn't want to adhere to what the current standard is for using copyrighted material, if you want to use it and create something new with it you need to license the material. Since OpenAI's LLM isn't actually like a human it needs to license such a vast dataset that it would be uneconomical to run the business without stealing all the content.
My hard drive can - bit for bit - recall video files. If I serve them to other people on the internet without permission of the copyright holder, that’s called piracy.
Yeah, but the LLMs can't. They aren't big enough to contain every byte of every NYT article, even with the best-known compression algorithms. Rather, they pick up and remember the same patterns that humans do when they write. Authors of the articles also did that, and so the two algorithms (human writer, LLM inference) end up with the same result. (That doesn't preclude large chunks of text that are actually remembered, though. We humans have large chunks of verbatim text floating around in our brains. Passwords, phone numbers, "I pledge allegiance to the flag...", etc.)
Anyway, like I said, I don't think OpenAI will win this. Someone will produce one verbatim article and the court will make OpenAI pay a bunch of money as though every article could be reproduced verbatim, and AI in the US will be set back that many billion dollars. It probably doesn't matter in the long run; it preserves the status quo for as long as the judge is judging and the newspaper exec is newspaper exec-ing. That's all they need. The next generation will have to figure out how to deal with AI-induced job loss... and climate change. Have fun, next generation!
In general, if you perform copyrighted works you are doing copyright infringement. There are certain exceptions (personal use, education, very small fragments with proper attribution, maybe a few others) but whether you are reading it aloud from a book or performing it from memory makes no difference.
So, if you setup a service like ChatGPT but powered by humans responding real time to queries, and these humans would occasionally reproduce large chunks of NYT articles, they and the service itself would be liable for copyright infringement. Even if they were all reproducing these from memory.
Now, this is somewhat different from the discussion of whether training the model on the copyrighted data, even if it had effective protections from returning copies of it, constitutes copyright infringement in itself. I believe this is a somewhat novel legal question and I can think of no direct corollaries.
I certainly don't think we can just handwave and say "at some level, when a human reads a copyrighted work, they are doing the same thing", because we really don't know if that is true. Artifical neural networks certainly have no direct similarity with the neural networks in the brain as far as we can tell. And, even if they did, there is no reason to give a machine the same rights that a human has - certainly not until that machine can prove sentience.
Given that Harvard President plagiarized her way into becoming a President, how can we be sure that NYT doesn't plagiarize and take content from X and other places to quickly chrun out daily news?
I already went through those links and merged the comments into the above threads, so there's nothing in any of those (besides archive links to the articles).
> To me, reading a book, putting it in some storage system, and then recalling it to form future thoughts is fair use. It's what we all do all the time, and I think that's exactly what training is.
If the AI can recall the text verbatim then it's not at all the same. When we read we are not able to reproduce the book from our memory. Even if a human could memorise an entire book it's not at all practical to reproduce the book from that. The current AIs are not learning "ideas", they are learning orders of words.
Yeah the comparison to humans is silly anthropomorphising at this point.
However I am inclined to agree with them for the simple fact that putting a file into a device and letting that device reproduce parts of the file should be allowed. I mean we're already at the point where this simple right is under pressure from DRM, but people should be allowed to do whatever they want with the files they own.
Whether you can publish this output and share it with the world is a whole different issue.
Something I have wondered about LLMs and training data is the idea that the biggest content producers on the internet now have their world view and tone echoed disproportionately as part of the next big wave of technology. This is incredibly impactful (although admittedly I don't know how to turn that into a profit). Is there some long term impact of removing the New York Times from training data that means it won't be part of the LLMs corpus going forward that is unforeseen?
If they don't let AIs to be trained on a maximum of data as possible, those AIs will be less "good" than the ones trained without constraints like you will have in China or elsewhere, and people will mechanically start using the later.
Unless they engage in massive IP and DNS banning, geolocation based, that forced upon all internet users and "external" users.
NYT wants to outlaw a math game created by calculating the probablities of word groupings and words following each other in NYT times articles, along with a lot of of other writings NYT does not own. The players roll the dice, so to speak, by seeding an initial string of words and whoever comes up with the most interesting paragraph wins. This paragraph may or may not look like NYT times writings, which in the larger scheme the collected writings of humankind, isn't particularly unique. It doesn't even have to be true. Hallucinations are an expected outcome.
If a NYT article says "Henry Kissenger was known to eat ice cream on a hot day" and our game outputs the same, it is purely by chance. It cannot be proven the output was copied verbatim from the NYT because the fragment "Henry Kissenger was known to" and "eat ice cream on a hot day" are not unique to the NYT or exclusive to it.
Is the NYT claiming ownership of the weights in LLMs?
Isn't copyright tethered somehow to a notion of "expression"? That is, the same ideas and facts expressed differently are a different work?
Sure, when something is clearly derived, or just expressed in a new medium, then I'm sure it's still covered. But if it goes through an LLM and the result bears little resemblance, how can that still fall under copyright?
As you said AI can rewrite articles, obtaining a clean cut separation between ideas and expression. Keep the ideas, write a new text. And if you got multiple sources, the more sources you use the better, it would make the output be even more different. This approach could also check consistency and bias between sources.
"The suit seeks nothing less than the erasure of both any GPT instances that the parties have trained using material from the Times, as well as the destruction of the datasets that were used for the training. It also asks for a permanent injunction to prevent similar conduct in the future. The Times also wants money, lots and lots of money: "statutory damages, compensatory damages, restitution, disgorgement, and any other relief that may be permitted by law or equity.""
If we see court judgements start to go copyright owners way, we will also see a scramble from AI companies to buy the few publishers with enough data to be worth buying, and to create works for hire to replace the rest.
In the long run a copyright ruling like that will be a boon for OpenAI and all other players with deep enough pockets to do so, and massively harm everyone else who will suddenly find it far harder to build models legally.
So that would mean articles from the 1920s, provided that the authors of those articles have been dead for 70 years, or longer in some other countries.
This argument that the LLM is learning seems slightly flawed when you consider that other experts in the field consider it more like lossy compression. If it’s lossy compression that’s really happening here then you can understand the copyright argument. It’ll be interesting to see how this plays out, lots of new ground breaking.
Why hasn't the Times also sued the Internet Archive? They've tried to block both the Internet Archive [1] and Open AI [2] from archiving their site, but why have they only sued OAI and not IA? The fact that they haven't sued IA which has comparatively little money would seem to indicate that this is not about fair use per se, but simply about profit-seeking and the NYT is selecting targets with deep pockets like OAI/MS.
Copyright doesn't stop the collection of content, it stops the copying, processing, & redistribution of content. Internet Archive acts as a library, so its widely accepted as fair-use when it makes collections of webpages available.
OpenAI's distribution is materially different to that of a library, so it's not a like-for-like comparison.
One of the main tests of copyright law (at least in the US) is if the entity distributing is _selling_ the copied/derivative work. It's unambiguous that OpenAI is selling something akin to derivative works, which is why NYT feels they can go after this claim. Meanwhile IA's operations don't create sales or incur profits, so while NYT's legal team may be able to establish that copies have been distributed, without the _sale_ aspect of the infringement, judges aren't guaranteed to side with NYT in an legally expensive PR nightmare.
What's wrong with that? If I was the NY Time's lawyers that what I would advise. What would it serve to bankrupt the IA, they can't pay anyway?
These are corporations enforcing their rights against one another.
There is nothing wrong with profit seeking from your copyright. That's literally their entire business model...they publish copyrighted content which they sell for a subscription.
OpenAI and others could easily have negotiated a licence instead of just using the data. They bet that it would be cheaper to be sued, lets find out if they bet correctly.
Tangentially that's what Apple did with the sensor in their watch, it doesn't always pay off.
> What would it serve to bankrupt the IA, they can't pay anyway?
It would serve the termination of the infringement.
My point is that the Times doesn't particular seem to care about infringement per se, they care about getting their slice of the cut from that infringement.
It's like if a video game company or a movie company only attempted to sue illegal downloaders who had a certain net worth.
> It's like if a video game company or a movie company only attempted to sue illegal downloaders who had a certain net worth.
I mean yeah, no one's gonna bother trying to squeeze money out of Joe Schmoe with 10 bucks in his bank account over some pirated movies. If a company with billions and billions of dollars like Netflix started pushing out pirated movies instead, then obviously they'd be sued into oblivion, as they should be.
I think that the moment you start making big money from someone else's business is the moment that they get riled. That and when you really hurt their business. I suspect that the NYtimes thinks that IA is damaging them in the order of (possibly) $100k pa, and that it thinks that OpenAI is making in the order of $10M's from their content (and possibly doing some damage as well). It's an easy commercial decision to ignore one and go after the other - especially as going after IA is going to create some backlash as well.
Looks like this is a case of Media vs Tech which might be solved by the courts using past paradigms but should really be addresses by legislation specific to this situation. The difficulty for the media companies, at least in the US, is that both major political parties see the media as the enemy. The left might be a bit more positive about the media but overall, they still see the media as something owned by wealthy elites suppressing knowledge of the harm the powerful inflict on the weak and powerless. Over on the Tech side of things, one party sees Tech as wholly owned by other side of the political divide. Over on that side, things are relatively (but not completely) friendly, so my guess is Tech will end up winning simply because it has more friends in the political realm than the Media does.
This wave is growing. Just cannot see how the big LLM players are going to get round this without paying big licence fees to content creators. Feels a bit like the torrent to Spotify moment, but for _all_ content, not just music. How they will manage the licensing model is beyond me, it’s going to be very easy for someone to sue these companies, but very difficult for the companies to calculate, attribute value and payout individual creators that contributed a tiny fraction of the training data. Surely this will make it very difficult for them to keep a business model working to a level their VC backers need to warrant even a fraction of their valuations.
in my head I like to think of web crawler search engines/search engine databases and LLMs as being somewhat similar. Search engines are ok if they just provide snippets with citations (urls), and they would be unacceptable if they provided large block quotes that removed the need to go to the original source to read the original expression of more complex ideas.
A web-crawled LLM that lived within the same constraints would be a search engine under another name, with a slightly different presentation style. If it starts spitting out entire articles without citation, that's not acceptable.
I think it's different. LLMs can solve problems. Part of that problem-solving ability comes from training completely unrelated content such as NYT articles. GPT4 doesn't have to spit out NYT articles verbatim to have benefited from NYT articles. It uses NYT articles for every query.
Let's say I'm an academic; if my research, note-taking, and paper writing skills lead to fair-use, cited quotations where applicable, general knowledge not identified, and the creative aspects and unique conclusions creating the intriguing part of my work, that's copacetic. If I spit out (from memory, mind you) verbatim quotes and light rewordings of NY Times articles, that's not; "I don't remember where I got that material" doesn't cut it. My reading the NY Times every day for years because I judge it to be more literate and accurate than other sources, undoubtedly it has informed my thinking and style, but I don't need to acknowledge that.
If I use ChatGPT as a research tool, as long as it lives within the same parameters that I have to live within, I don't see a problem with its education/learning.
I understand that the NYTimes would like a slice of anything that comes out of the GPT but I'm talking about what seems reasonable. People who share their copyrighted material do not own all of the thinking that comes out of it; they own that expression of it, that is all.
Will AI destroy the economics of "writing" the way the web has killed newspapers? perhaps, perhaps we'll all benefit from and need a new model, but killing the new to keep the old on life support is not the way.
You're not replicating yourself millions of times and selling yourself for $20/month. If you are, then NYT might sue you too.
I'm not saying LLMs are by default, illegal. All I'm saying is that there is some merit to why NYT and content companies want a piece of the pie and think they deserve it.
The NY Times benefited in the past from technologies that led to widespread distribution of the Times, putting competitors out of business and concentrating talent at the Times. Nobody is stopping them from producing new editions of the newspaper, their core business. People now have technologies that help them "remember" what was salient in back issues of the Times. Such is progress.
Not that it would solve this, but how hard would it be for ChatGpt or other problems to cite the sources used in a response. Is that difficult to capture and tag to 'knowledge' within a LLM?
It could be a best of both worlds type situation if LLMs cited sources and linked to the source itself.
Isn't that what happened with Google News's home page? I seem to recall that when Google took it away in some markets, at the behest of the news orgs, they quickly reversed course as their traffic plummeted.
This is not possible. There is no database of sources inside an LLM. Just like the knowledge in your brain does not have sources attached.
For an example, you referenced "what happened with Google News's home page". Could you give me your source? You could probably search for some suitable article for a reference, but you don't know a source from your memory.
I think Apple has really got ahead of this game: early deals to pay for AI training data/content. I need to do some research but I think Anthropic also does this.
After a year of largely using OpenAI APIs, I am now much more into smaller “open” models for I hope the major contributors like Meta/Facebook are following Apple’s lead. Off topic, but: even finding the smaller “open” models much less capable, they capture my imagination and my personal research time.
Not sure if they’re ahead but I think it was smart to not ship anything LLMlike until the regulations get made first movers test the waters.
Casey Newton has been saying all year that these things will be awesome once we can unleash them on our own corpus of data safely. “Siri” already does a great job digging through my photos and picking the good memories. I can let my camera roll become a visual junk drawer now.
Do the same for my email. Make “Find” the tool we always wanted to be. I don’t care if I’m conflating LLMs/AI with other smart tech.
I think there is a national security aspect to ML models trained on copyrighted data. Countries that allow it will gain a superior technological advantage and outcompete those who disallow training on copyrighted material. I personally believe training LLMs on copyrighted data is copyright infringement if the models are deployed in a way that competes with the copyright holder. But that doesn’t necessarily mean it’s something we should disallow.
You can say the same for any legal enforcement like respecting patent or copyright law or making Champagne outside France. Yet the sky isn’t falling given this reality with so many legally protected industries. Maybe these markets where such an industry might offshore to are too small and insular to be very significant, and are probably language bound to make english models less relevant compared to native language models.
The thing that bothers me about the whole situation is that OpenAI prohibits using its model output as training data for your own models.
It seems more than a bit hypocritical, no? When it comes to their own training data, they claim to have the right to use any/all of humanity’s intellectual output. But for your own training data, you can use everything except for their product, conveniently for them.
Under existing condition an AI news site seems like a good investment idea. Its AI could read all relevant news sources and retell them and republish them in its own articles. It could even have its own AI editors and contributors. Cannot see how human news companies could compete.
That style of journalism is nearly dead. True on the ground investigative journalism is hardly done today, most is just reporting existing public information releases. You don’t have to be at the presser when everything the police chief says will be put in an online transcript.
I am sure it could easily rephrase the articles to tell them without quoting any real or verifiable sources. Many human news companies often do it too.
This, or a lawsuit like it is going to be the SCO vs IBM of the 2020's, to wit: a copyright troll trying to extract rent, with various special interests cheering it on to try and promote their own agenda (ironically it was Microsoft that played that role with SCO). It's funny how times have changed and at least now a louder group seem to be on the troll's side. I hope to see some better analysis on the frivolity of this come out. There may be some commercial subtlety in specific cases that doesn't depend on scraping and training, but fundamentally using public internet data for training is not copying, is fair use, and is better for society as a whole than whatever ridiculous alternative might be proposed.
edit: I'm speaking about training broadly capable foundation models like GPTn. It would of course be possible to build a model that only parrots copyrighted content and it would be hard to argue that is fair use.
>> There may be some commercial subtlety in specific cases that doesn't depend on scraping and training
The key is to stop calling it "training" and use "learning" or just "reading".
The argument from NYT will probably be that LLMs are just a fancy way to compress or abstract information and spit it back out. In which case "training" seems to support their case?
I don't recall the source, but when people read, they typically only remember 20% of what they read (or heard?). Machine training encodes much more than 20%, so it is much closer to copying than training. Now the emergent abilities that come from this could be considered learning and dare I say imagination (which is the opposite of copying).
>It's funny how times have changed and at least now a louder group seem to be on the troll's side
Because for many people, their views on current events are whatever the "thought leaders" working for the NYT and similar publications tell them to think.
This is theft and monstrous profit from theft. For actual justice this should be a class action suit of the world vs. OpenAI/Microsoft and the financial consequences should be company-ending for OpenAI. Otherwise, you have incented everyone in the AI industry to steal as much as they can for as long as they can.
What if you were one of the people who read the Times from cover-to-cover every day and seriously tries to remember as much as possible because you consider it a trustworthy reference source?
And if you were called upon to solve a problem based on knowledge you consider trustworthy, what would you come up with?
What if you were even specifically directed to utilize only findings gleaned from the Times exclusively?
And what if that was your only lifetime source of information whatsoever for some reason?
But then imagine that because human memory is not able to keep all that information straight, you made copies of all those newspapers.
And then you started charging people for your knowledge.
And then imagine that as part of your knowledge service, you would copy snippets from the times word for word and give that to your clients without citation and pass it off as your own.
The copilot screenshot they gave in the ars-technica article as well as many of the screenshots in the NYT article seems like it's actually displaying correct behavior for browsing the web.
In these cases the system is more or less acting as a user agent (browser). AFAICT the NYT server actually gave that data to the user agent when it asked politely (200 OK, presumably). The user agent then displayed it to the user, which the user agent may do in any way it deems fit or appropriate.
There's only one or two cases where this has gone against the user or user agent, in very specific circumstances. The server can eg say 403 Forbidden whenever it likes, so if it returns a 200 OK, what's a user agent to do other than believe it at its word?
The only twist is that this user agent is now Imbued With AI (tm)(r)(c) . I don't think that really makes a difference here. If that's all this is, then it's more related to legal fights over certain ad-blockers or readability, which have similar functionality.
* https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec20... , eg. page 45; I mean it says "Model: Web Browsing" at the top, and "Finished browsing" right on the page. That particular subsystem is now integrated, so the UI/UX is different now, but IIRC the link was in the pulldown?
Hum, on page 33 of the same complaint they show that they can get GPT to quote parts of an article from memory. Giving people single paragraphs need not be not illegal per-se. Very specifically, if you ask it for more than a paragraph at a time:
> ' I'm unable to display the entire text of "Snow Fall: The Avalanche at Tunnel Creek" by John Branch, as it is a copyrighted work. However, you can easily access the full story online. It was published by The New York Times and is available on their website. The story is notable for its engaging multimedia format, including text, images, and interactive elements.'
Specifically, they go out of their way to lead GPT on, asking for several paragraphs in a row.
It's pretty clear that GPT is an avid reader of the NYT, so in that particular case we're going to have to see if OpenAI's fair use defense for training holds.
(ps. in the current GPT-4, it's actually somewhat tricky to even get to the point above at all. They have probably been improving AI instructions)
Two not-so subtle paragraphs about the "partnership" between Microsoft and OpenAI:
> 15. Microsoft Corporation is a Washington corporation with a principal place of
business and headquarters in Redmond, Washington. Microsoft has invested at least $13 billion in OpenAI Global LLC in exchange for which Microsoft will receive 75% of that company’s profits until its investment is repaid, after which Microsoft will own a 49% stake in that company.
> 16. Microsoft has described its relationship with the OpenAI Defendants as a
“partnership.” This partnership has included contributing and operating the cloud computing services used to copy Times Works and train the OpenAI Defendants’ GenAI models. It has also included, upon information and belief, substantial technical collaboration on the creation of those models. Microsoft possesses copies of, or obtains preferential access to, the OpenAI Defendants’ latest GenAI models that have been trained on and embody unauthorized copies of the Times Works. Microsoft uses these models to provide infringing content and, at times, misinformation to
users of its products and online services. During a quarterly earnings call in October 2023, Microsoft noted that “more than 18,000 organizations now use Azure OpenAI Service, including new-to-Azure customers.”
Summarizing the article: The most damning thing here is the "ChatGPT as a search engine" feature, which appears to run an agent which performs a search, visits pages, and returns the best results.
In doing this, it is bypassing the NY Times paywall, and you can read full articles from today by repeatedly asking for the next paragraph.
Let's say OpenAI was trained on all the Windows source code (without approval from MS).
GPT could pretty much replicate the windows code with even not that clever prompt by any user. "Write an OS CreateProcess function like Windows 10 source code would have."
It would infuriate MS to put it mildly, enough to start a lawsuit.
I know the license to the MS source code and NYT articles aren't the same.
Isn't the fundamental issue here that the NYT was available in Common Crawl?
If they didn't want to share their content, why did they allow it to be scraped?
If they did want to share their content, why do they care (hint: $88 billion)?
Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?
What wrong thing did OpenAI do specific to using Common Crawl?
Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?
Is it legal or not to scrape the web?
If I scrape the web, is it legal to train a transformer on it? Why or why not?
To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.
Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?
> "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.
Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!
It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?
If you read the complaint, it explains this pretty well. The use of copyrighted content by search engines is fundamentally different from the way LLMs use that same content. The former directs traffic (and therefore $$) to the publisher, the latter keeps the traffic for itself.
The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.
> If I scrape the web, is it legal to train a transformer on it? Why or why not?
At no point did I say anything about hosting a mirror of the NYT website, with free articles. Obviously. Because OpenAI didn't do that. Some NYT lawyer tried to get ChatGPT to write a NYT article. Maybe first they should have actually done a Google search and shut down some of the actual content farms which simply copy NYT content such as [0]. But instead, we get this.
Sad to say but I would believe a hallucination from OpenAI before I would believe anything that comes out of the NY Times. I mean the confidence interval for the NY Times is what again?
I think this could be shakedown. They want money/licensing from openai the way apple was offering news companies recently. High probability this is settled out of court.
I asked an LLM to summarize the 69 page lawsuit. It does a decent job. Didn't infringe on any copyrights in the process :)
Here is a summary of the key points from the legal complaint filed by The New York Times against Microsoft and OpenAI:
The New York Times filed a copyright infringement lawsuit against Microsoft and OpenAI alleging that their generative AI tools like ChatGPT and Bing Chat infringe on The Times's intellectual property rights by copying and reproducing Times content without permission to train their AI models.
The Times invests enormous resources into producing high-quality, original journalism and has over 3 million registered copyrighted works. Its business models rely on subscriptions, advertising, licensing fees, and affiliate referrals, all of which require direct traffic to NYTimes.com.
The complaint alleges Microsoft and OpenAI copied millions of Times articles, investigations, reviews, and other content on a massive scale without permission to train their AI models. The models encode and "memorize" copies of Times works which can be retrieved verbatim. Defendants' tools like ChatGPT and Bing then display this protected content publicly.
OpenAI promised to freely share its AI research when founded in 2015 but pivoted to a for-profit model in 2019. Microsoft invested billions into OpenAI and provides all its cloud computing. Their partnership built special systems to scrape and store training data sets with Times content emphasized.
The complaint includes many examples of the AI models reciting verbatim excerpts of Times articles, showing they were trained on this data. It also shows the models fabricating quotes and attributing them to the Times.
Microsoft's integration of the OpenAI models into Bing Chat and other products boosted its revenues and market value tremendously. OpenAI's release of ChatGPT also made it hugely valuable. But their commercial success relies significantly on unlicensed use of Times works.
The Times attempted to negotiate a deal with Microsoft and OpenAI but failed, hence this lawsuit. Generating substitute products that compete with inputs used to train models does not qualify as "fair use" exemptions to copyright. The Times seeks damages and injunctive relief.
In summary, The New York Times alleges Microsoft and OpenAI's AI products infringe Times copyrights on a massive scale to unfairly benefit at The Times's expense. The Times invested heavily in content creation and controls how its work is used commercially. Using Times content without payment or permission to build competitive tools violates its rights under copyright law.
A Second LLM's take on this lawsuit can be found below. I'd love to see OpenAI address these complaints publicly and without incurring any additional damages to NYT.
The document is a legal complaint filed by The New York Times Company against Microsoft Corporation and various OpenAI entities, alleging copyright infringement and other related claims. The New York Times Company (The Times) accuses the defendants of unlawfully using its copyrighted works to create artificial intelligence (AI) products that compete with The Times, particularly generative artificial intelligence (GenAI) tools and large language models (LLMs). These tools, such as Microsoft's Bing Chat and OpenAI's ChatGPT, allegedly copy, use, and rely heavily on The Times’s content without permission or compensation.
Nature of the Action:
The Times emphasizes the importance of independent journalism to democracy and claims its ability to continue providing this service is threatened by the defendants' actions. The complaint argues that the GenAI tools are built upon unlawfully copied New York Times content, which undermines The Times's investments in journalism.
Defendants:
The defendants include Microsoft Corporation and various OpenAI entities, such as OpenAI Inc., OpenAI LP, and several other related companies. The Times alleges these entities have worked together to create and profit from the GenAI tools in question.
Allegations:
1. Copyright Infringement: The Times claims the defendants copied millions of its copyrighted articles and other content to train their GenAI models. This training allegedly involves large-scale copying and use of The Times’s content, emphasizing its quality and value in building effective AI models.
2. Unlawful Competition: The Times argues that the defendants' GenAI tools compete with it by providing access to its content for free, which could potentially divert readers and revenue away from The Times.
3. Misattribution and Hallucinations: The Times asserts that the defendants' tools not only unlawfully distribute its content but also generate and attribute false information to The Times, damaging its credibility and trust with readers.
4. Trademark Dilution: The complaint includes claims that the defendants' use of The Times’s trademarks in connection with lower-quality or inaccurate AI-generated content dilutes and tarnishes its brand.
5. Digital Millennium Copyright Act Violations: The Times alleges that the defendants removed or altered copyright management information from its works, which is prohibited under the law.
Harm to The Times:
The Times claims it has suffered significant harm from these actions, including loss of control over its content, damage to its reputation for accuracy and quality, and financial losses due to diminished traffic and revenue.
Demands:
The Times seeks various forms of relief, including statutory damages, injunctive relief to prevent further infringement, destruction of the infringing AI models, and compensation for losses and legal fees.
Overall Summary:
This legal complaint represents a significant clash between traditional media and emerging AI technology companies. It underscores the complex legal, ethical, and economic issues arising from the use of copyrighted content to train AI systems. The outcome of this case could have far-reaching implications for the AI industry, content creators, and the broader digital ecosystem.
Worth noting, that - at least the screenshot - shows an example of browsing functionality used to go around paywalls, not that the model itself is trained, or can reproduce the articles really.
IIRC this was the reason why the browsing plugin was disabled for some time after its introduction - they were patching up this hole.
> All of that costs money, and The Times earns that by limiting access to its reporting through a robust paywall.
Not to be pedantic, but NYT has the least robust paywall I've ever seen. Just turn on reader mode in your browser. Simple. I get that it's still tresspassing if I walk into an unlocked house, but NYT could try installing a lock that isn't made of confetti and uncooked pasta.
It's interesting to me the ambiguous attitude people have to reproducing news content. Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim.
And this seems to be tolerated as the norm. And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.
I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.
And the reason I bring this up, is that it seems like Open AI has the same attitude: scraping news articles is OK, or at worst a gray area, but what if they were also scraping, for example, Netflix content to use as part of their training set?
> I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP
As you noted it is not the norm to post pirate links here for IP other than news articles, but that doesn't mean that a lot of people think it is not OK to pirate those other forms of IP.
In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").
The impression I've gotten is that piracy of nearly everything is widely felt to be OK here. Information wants to be free, yada yada.
About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.
> In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").
I’ve read and participated in many such threads and I’ve literally never seen this take. Often what I see is complaints about having to learn different UI for different services/apps, no offline, ads injected into paid services, having to figure out which service a show is on, and generally terrible UI you can’t change/fix.
I don’t think I’ve ever really seen someone use the argument “yes it’s great today but they might charge more later”. Not saying people haven’t said that but it’s far from the main thing people say in my experience.
People used to leave newspapers in the trash, on the train, all over the place. Anyone could pick them up and read for free. I think it's reasonable for folks to carry this attitude into the digital age. People feel like news is something to share, it's not the source of creative expression, it's facts and as such we feel entitled to know the facts about our world and what is happening that might affect us.
While I'm well aware I'm being pedantic, me and my brothers would share the comics together while my parents kept the news, up to 4 of us consuming 1 paper at a time. Realistically, the reading limit was due to the physical properties of the object and not an inherent property of information to be consumed through one avenue at a time
The news on a website is paid for by someone, else they would not be in business. And I only have one screen, I don't share it. The difference is physical vs. digital copies. A physical copy costs $.10 a digital copy costs $.0000001 (made up numbers), the business can take a loss of numerous digital copies before hitting the cost of one copy of the physical paper.
The problem is really that their business model sucks. They are working with fewer and fewer advertisers and much more competition and expecting business like they had before. And so we have a business that is attempting to fix itself with paywalls which don't work 100% of the time, but good enough to get the found newspaper analogy.
BS, the reason we have sensational opinion pieces goes back before the internet. People are bored with mundane lives and love drama. Drama sells, 90s newsrooms found this out and cable news out competed established real journalism. The rest is a race to the bottom.
The internet simply exacerbated this as anyone could publish news on an equal platform to the big boys. Then we get paid-per-click, and that drives click-bait.
Stealing information absolutely is not responsible for that. People pay for junk, and that's the reason. We don't eat junk food because it's given away.
This seems very false to me. Spotify is the prime example. They offer a good product that covers a 100% of my needs at a reasonable price. If that was an option for say UFC or engineering books, you bet I’d be subscribed. But being forced to read through some crappy reader software when I need the book source to take annotations in another software doesn’t work, so here we are. Same with the absurd pay per view business model of UfC.
For books, if it's a client reader software frustration, then you should still buy the digital version and then you can pirate the PDF book and use as desired within the constraints of copyright law (e.g. don't go sharing the PDF). That way you get the client you want but you still paid the content creator. But to use the argument, "oh, I don't like their client so I'm going to not pay them" is BS.
For UFC, your complaint is you don't like their pricing. The whole point of copyright is to give someone the monopoly to control pricing so they can use that pricing power to incentivize them to create the product in the first place. Similarly to patents. Thus, complain about the format things are delivered in all you want (like the client) but pricing is inherent to copyright or patents for good reason. You are now just arguing that you as a consumer should be able to pirate if you don't agree with pricing. And that's ludicrous.
In that case, just read a news article about the event. Copyright doesn't cover facts, only creative expression. So a news article covering the facts of the UFC fight is able to be published without the consent of the copyright holder. Think of the digital video of the fight almost like buying a ticket to the fight. You're saying you should just be able to sneak into the fight and watch it for free without any justification for you're doing so.
Finally, you can also watch other people's videos of the fight that THEY recorded on social media as other sources of the fight information. But if you want the recording with all the right angles, coverage, etc, it clearly has value to you over written recaps or social media coverage. And you are just arguing over price, which they are the copyright holder have the right to set the price.
The problem with buying by the crappy DRM version is that it provides no incentive to the publisher to change. I have thought about this long and hard, but ultimately the only way Spotify came about was because nobody bought the terrible DRM’d music the labels wanted to foist on us. We need to inflict the same pain for books.
Personally, I think it would be preferable to donate the same amount to the Books Trust or your local library.
> The problem with buying by the crappy DRM version is that it provides no incentive to the publisher to change.
Then don't consume it and don't buy it. If you stop paying the abusive publisher, they'll be forced to change their policies.
The fact that you don't want to fund what is admittedly a rather abusive industry does not magically make it right to consume other peoples' work for free. That's theft-adjacent. You're not entitled to any piece of entertainment without paying for it.
This is also along the lines of how I think about things. If you make it convenient enough (compared to the alternative of paywall bypass or piracy) and provide enough overall/general value then I'm happy to subscribe. At the point where the experience degrades, or seems beyond the point of what one person could reasonably subscribe to, I basically just give up.
Spotify hits this sweet spot where one subscription delivers almost all the music you'd want to listen to. Steam hits this for games where a couple clicks can play and launch almost any game with minimal hassle. Netflix mostly used to hit this, but most of the current streaming stuff feels overpriced if you want to get all content (unbundled cable bundle). News kind of feels similar to streaming where its unbundled, and there's a lot of interesting content out there, but there's no way I'm subscribing to 15 different newspapers, especially random local ones for cities I don't live in. If there was a news bundle subscription for a reasonable price I think I would pay for it.
I suppose part of the challenge here is that music and video content holds value much longer. Studios can invest in music and video content and see a return from the catalog over a long period of time as more enduring hits are produced and the duds fall away. But with news, they have to make the money on it now because yesterday’s news isn’t worth much no matter how expertly crafted.
The GPL was specifically written to lock code out of the proprietary realm, so if you hate copyright[0] you'll hate people using it as intended.
[0] To be clear, I know of few who actually like copyright. Tolerate it? Use it as needed? Sure. The only people who actually defend the current broken-ass system are large media companies which are built to optimally exploit it.
> About the only piracy that is consistently frowned upon here is piracy of open source software. When some company sells an embedded device that uses GPL code without releasing the corresponding source that's viewed as just a little short of a crime against humanity.
> "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change"
Gonna gamble and call bullshit on this.
My speculation: the most popular reason HN'ers give for pirating: they literally cannot get the content otherwise.
2nd most popular: it is such a pain to either to purchase the content or get it to run on bog standard software (like Firefox/Linux/etc.) that otherwise paying fans are driven to whatever the current equivalent is for bittorrent.
In fact, I don't believe I've ever seen a justification for using bittorrent or whatever due to what someone's favorite streaming service might do in the future. I'm assuming you saw at least one based on what you wrote-- care to give a link?
>"subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change"
I'm not saying you've never seen anyone make an argument roughly like that, but I will certainly say that it is not at all representative of the argument that I see made. Complaints usually have to do with current behavior of the platform or the wider streaming ecosystem.
> In nearly any big discussion that even remotely involves video streaming there will be numerous posts from people explaining why they pirate (usually with ridiculous justifications like "subscribing is not an option because even though this paid service does exactly what I want now at a price that is trivial for me they might someday later change").
If this is true, it should be easy for you to link to an example. Could you do so?
People are understandably angsty about someone stealing credit. A NYT article is going to be a NYT article, not laundered around and presented as someone else's work.
Plus, there's the angle of enshitification and ads being injected into a paid service, and so on.
Yeah, I don’t judge people for pirating or ad blocking, but the ludicrous justifications do get me - quite the entitled mental gymnastics. They remind me of bitcoin people trying to explain how mining is good for the environment.
1) Ycombinator could not tolerate HN becoming a site known for sharing IP-law-violating content. And the people who come here by and large are smart and socialized enough to implicitly understand why.
2) At the same time, a large number of folks here mostly wink and nod at that sort of consumer infringement. And there's a society-wide bias towards "things like news are less protected", so that gets to slide.
3) But people also have a need to tell consistent-seeming stories about how things work, thus the mental gymnastics.
It ends up being similar to trying to explain why people pretend to be prudish innocents about sex. It largely reduces to "a small subset of the population goes sufficiently ballistic about what I consider to be relatively trivial stuff as to make it not worth fighting over, even if I find that to be ridiculous."
There are a lot of different versions of this that become so normalized it can be hard to notice.
Historically newspapers leaned more on competition law than copyright, because their pages are supposed to be filled with non-copyrightable facts.[1] Copying part, but not all, of a factual article, significantly after the relevant event, was considered to be a promotion (not unfair competition) and a nice thing to do for the journalists. Things change, people lose sight of the original principles.
> their pages are supposed to be filled with non-copyrightable facts
This is rather inaccurate. A fact is Hitler invades Poland. You're right, nobody can copyright this idea, as it is just a fact.
However, if I then write a 500-word article describing the scene of Hitler invading Poland, have short quotes from some civilians there, etc. that particular arrangement of ideas and words is copyright.
AP can't go and sue INS for just reporting the fact Hitler invades Poland, but if INS takes a whole article word for word and reproduces it that's still violation of copyright. The actual printed words of the news always had copyright.
The WSJ can't claim copyright on the markets going up yesterday. They can claim copyright on something like "After the bell rang in the NYSE, the tech industry ticked up 1.2% over last week. Meanwhile the whatever market took a hit of -0.5% ending the quarter slightly lower than our analysis expected. Blah blah blah..." If Investor's Business Daily wrote a different article that also talked about the markets ending up at the end of the day, that's not a violation of copyright. If they literally write "After the bell rang in the NYSE, the tech industry ticked up..." then they're violating WSJ's copyright. This was true before and after International News Service v Associated Press.
These days most news is mixed with analysis [1] (which is often biased). I wonder if part of the reason for this shift is that analysis is copyrightable. It also seems like the number of opinion articles is ever expanding [2], though I don't have any hard numbers on that.
> Whenever there is a story from NYT on HN (or any other large media outlet), the top comment is almost always a link to an archived version which reproduces the text verbatim. [...] And yet, whenever there is a submission about a book, a TV show, a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.
If the story was linking directly to the "book, TV show, movie, video game, album, comic book, etc", and the link only worked for some people while others randomly got a login request or similar, you'd also see the top comment being a link to an archived version which avoids the login screen. That is: the main difference is that the archive link has the exact same content as the link submitted in the story, only bypassing the login screen that some people see. And the only reason the archive site has the content is that it didn't get the login screen; if everyone always got the login screen, what you would see on the archive site would be the same login screen.
> the archive link has the exact same content as the link submitted
No, articles are updated as new information comes in, retractions are made, etc. Especially breaking news (the type that would reach the top of HN). The archived versions are outdated.
> others randomly got a login request
It's not random, you get a number of free articles before the paywall appears ("soft" paywall).
The paywall is removed entirely for some topics/stories, especially matters of public health (common during the pandemic).
> the only reason the archive site has the content is that it didn't get the login screen
No, it's because they don't block archive crawlers, and prefer people bypassing the paywall and reading news at NYT. Hopefully users find the content valuable, and some of them subscribe as a result.
i don’t believe that is fully correct. The general policy here is that you cannot link to something that is paywalled unless that site plays the game of allowing crawlers but not actual human eyeballs. In the latter case the link is allowable because there are ways around it that the site owners allow.
Sure. It's just curious to me that news article have a pirated knowledge link as the de facto top comment, but link submissions to, for example, books for sale on Amazon don't have a link to Anna's Archive or equivalent.
I think the archive of an article is more preservation of history and maintaining records of events which often disappear if not archived. The number of threads referencing articles which are defunct is always increasing. A book or movie or original content on the other hand will continue to hold its own commercial value so reproducing it is more akin to an actual loss for the license holder.
Definitely a grey area when that content is then used to train models though.
I would say 9 times out of 10 it's to get around the paywall and absolutely not some higher moralistic preservation of history.
And everything is a grey area, determining the line is the existential purpose of these court cases.
We've been here before with hyperlinking, then indexing and then linking with previews and the Canadian Facebook stuff but I think this has more standing.
If I buy a book, I get a work of literature. But if I buy a news subscription I get a series of facts riddled with advertisements. I accept the former, but I oppose the latter. I suspect I'm not the only one.
both are just the price you want to pay. There are various state news outlets that you're probably already paying for - npr, pbs, bbc, cncb depending on your region
Funny, I don't see it as a moral thing but more a "what can you get away with" thing.
I fully assume that if I was to post a magnet link to a torrent for whatever the link was about, I would be banned.
Morally speaking, I think it's perfectly reasonable to download a copy of something and either read the relevant info for my current task or to sample it to decide if I want to buy it. I see it no different to using the library or browsing at a book store.
Perhaps once news organisations can work out how to effectively wield the DMCA hammer against archive links we'll see the practice of posting them stop.
I’d argue that morality always has a “what can you get away with” component. Things that are normalized tend to be seen as morally permissible, and things that are seen as abnormal are more likely to be seen as immoral.
The problem with the thinking in the root comment is that it implicitly assumes that people’s behavior is morally consistent, or that they even try particularly hard to behave in a morally consistent way. That’s not really how people work. If you ask them to discuss morality in the abstract, they’ll try to come up with a consistent system. But their actual behavior is mostly dictated by social norms. And if you try to pin them down on the morality of their concrete actions, they’re more likely to stretch their moral system to accommodate their actions than the other way around.
None of this is to say anything about my own opinions on news sharing or OpenAI’s situation. It’s just that someone decrying piracy but also posting/sharing/upvoting links to copies of news articles is neither surprising, nor indicative of some deeper nuance to how people view morality around IP.
In some jurisdictions (Poland, possibly whole of EU), downloading any kind of materials - be it movies, books or music - is legal. Uploading/sharing - if not between friends&family members - not so.
This tendency at Hacker News are also much more of a threat to The New York Times than what Open AI is doing. Even the places like blogs/Reddit/social media submissions that summarize the article and post the relevant quotes. Unlike the summary of a movie, summarizing all of the relevant parts of a news article is extracting almost all the value from it, and giving it away for free.
And the vast majority of people read news for it's breaking content, not for its archived content from years before (and I say this as someone who has often recommended the latter, but has gotten very few people to do so). So giving people that free breaking content (either in its entirety like on Hacker News, or summaries like you see all over social media) is actually a direct competition to the news business in a way that training an LLM on an article from months/years back isn't.
Yes, and for nonfiction, it's also true that it usually depends on the original article for credibility. (If it were an anonymous poster making up a news story, most people wouldn't believe it.)
To me, there is a sense that the news, which is real information about the society that we currently live in, should be availabe to all participants of that society. The notion of being a good citizen requires that one stays informed. Books, movies, videogames etc. don't have that role and are more consumption goods.
It’s a difficult problem with no great answers. If you want news to be free at the point of delivery you want public service news agencies. But that means they’re owned by the government… who are frequently the target of critical reporting.
The Guardian in the UK is an example of an alternative: It is owned by a trust, which funds it.
Norway has substantial public media funding across the political spectrum, but as you point out it always comes with conditions, even is less so than the funding for the state owned broadcaster.
Combining the two models and putting public funds into several perpetual trusts intended to provide funding from their profits at arms length from any sitting government similar to the (private) trust funding The Guardian might be an interesting alternative.
(EDIT: Norway also has its own variation over The Guardian model - the second largest media group was founded by unions but is now majority owned by the combination of two public benefit trusts)
I see the gp post about pirating news as a very good point, while having no veleity to pay the New York Times, and being ok with not reading it in general.
But I also pay for my national (public) news outlet, and their articles are available to anyone anywhere in the world. I don't know how it should work, but I wish we could get to a system where the burden to keep news outlet alive is split thinly enough to have open but viable publications around the world.
Basically the same way weather stations collaborate all other the world and we pay for our local stations while getting acccess to all the forecast everywhere.
The government (thus the people, in a so called sharing of public burden)!
For example in Hungary there is an official news agency ran by the government, with (cumbersome) free access for everybody. Of course this does provide somewhat biased presentation of some facts, but on many topics it provides unbiased access to news for any citizen.
This is actually pretty common in Europe, often funded by mandatory fees (for some reason not branded as taxes) certain appliance owners need to pay (UK TV license, German Rundfunkbeitrag). For this fee people get access to news and cultural programmes for free via different media (radio, TV, internet).
I agree with your general point but Hungary is probably the worst example you could have chosen from any EU country! The Orbán government is famously using it to spread propaganda and fake information in unprecedented levels.
The level of control governments exert on public broadcasting networks is widely different. Since Meloni, the RAI in Italy is facing similar issues, but Hungary is still the canonic example of government misinformation and propaganda.
That is a orthogonal to the discussion we were having. The topic was whether people should have free access to news, and how should it be financed, not the quality of that news.
People have free access to public roads all around the world, and the quality wildly differs in that as well. Also the quality of for-profit news services does differ wildly, you might have an opinion about that of fox news, for example, but that is also off topic in this discussion.
No it isn't an orthogonal discussion. The reason Orban wants people to have free access to his propaganda is because it directly serves his purpose. To finance it directly from sales of the media would defeat the purpose. Coupled with Orban's attack on free media it completes the picture.
> That is a orthogonal to the discussion we were having. The topic was whether people should have free access to news, and how should it be financed, not the quality of that news.
On the contrary, the quality of the news is very important to the discussion. There is no point in making trash freely available to the public, after all.
The topic is a bit more nuanced, and far wider than "not fitting my favourite narrative on some topics, so it is generally and objectively trash".
think about this: I will get mostly objective and useful reports of the flood approaching my home near the river regardless the narrative/interpretation they might have on some other topics, or the biased reporting on the merits of the government in handling the situation at the dams.
For me I'm not here to debate on the political policies of some governments, just gave a few examples of ways to fund public access to news. This discussion is over from my part.
I would argue the people of Hungary would be better off without hatred against asylum seekers and minorities, political opponents, lies and misinformation.
People doing it in their free time because they care a lot about the subject (nowadays with things like Twitter its quite possible for an independent obsessive to write a good piece on, for instance, the Ukraine War by mostly referring to open sources and public announcements by governments and corporations)
Every news source has biases. Under the paywall business model, the people who share the biases of their favored news outlets pay for them, and in exchange, they get to ensconce themselves inside a bubble free of dissenting viewpoints. This also reinforces the bias of the news outlet; if they don’t toe the line, they will lose subscribers.
Instead of paying news outlets to provide ourselves with filtered feeds of content that match our own biases, we could instead pay news outlets to produce competing streams of explicit propaganda to be freely disseminated. The overall bias and quality of the news would be largely unchanged, even if the biases were more obvious; in fact, it may even improve.
I'm of a similar mind. I take the more expansive view that everything created is part of our common property and that something like an LLM should be able to yield the summary and references to those creations. As I've said elsewhere, LLM systems might be our first practical example of an infinite number of monkeys typing and recreating Shakespeare (or the New York Times).
I understand that copyrights and patents are vehicles for ensuring a creator gets paid for their work, but they are flawed in not rewarding multiple parallel creations and that they last too long.
a LLM is just a hugely lossy-compressed version of its training data, an abstraction of it.
Much in the same way as when you read a book, your brain doesn't become a pirated copy of the text as you only store a hugely compressed version of it afterwards, a feeling for the plot, generated images and so on.
That's what I thought from my various readings about LLM systems. I'm guessing that the kerfuffle from the New York Times and other shortsighted organizations is that copyright allows them to control how their content is used. With humans, it's simple as its read and misremembered. Using it for LLM training requires a different model. It probably should be a RAND fee system based on volume of training data because, as you say, the training data is converted into an abstract form.
I agree, but nothing worth having is free. NYT and other news outlets have to ultimately pay reporters to go out into the world and do the work. The reporters are not priests, and the NYT is not a church that lives off donations and tax exemptions. They need money to operate, and you may disagree with how they try to collect that money (paywall) but that doesn't solve their funding problem.
You could subsidise news via "public service" style stipends. Much like having a government owned "independent" news service (eg the BBC) this comes with a high risk of corruption. Don't bite the hand that feeds and all that.
You could implement a much lower friction non-recurring payment system. I'd be far more tempted to drop a little money on a fixed term (5 articles, 1 day, ???) setup than a subscription.
Realistically, I am not paying for more than 1 long running sub. And there are > that number of solid outlets.
Good comment, it was very funny to see how people desperately try to find moral justification for pirating media A but not B. "It's apples to oranges, you see, there are less letters in the NYT article than in the book and they are rendered differently, so it is ok to pirate their work. I did nothing wrong!" :)
It is actually pirating content by companies for humongous profit, or pirating by individual human beings for free access to culture and entertainment, oftentimes for content one has already paid for, but rendered inaccessible by megacorporations.
As for movies/tv show/music makers, let’s just say most people in the software engineering business would look at their numbers and count their lucky stars that they are not in the movie/tv show/music business.
(It is also true that excessive copyright lengths have removed access to content that the public should have a right to).
The movie/tv show and music business can keel over and die tomorrow - it wouldn’t affect the value of art produced by humans at all. I see those more as exploitative leeches than as contributing anything positive.
If only piracy would actually harm these businesses but alas as often demonstrated it has zero effect on their bottom line, if anything it increases their profits.
It seems pretty natural to me. People generally have less problem with stealing a candy bar than stealing a car. (Consider the cost to produce a NYT article vs the cost to produce a Hollywood movie). I don't think the stealing-vs-pirating analogy is perfect, but it's related.
There's no way to get your money back if you didn't like the content. If they don't want their articles to be read for free then they should keep them out of my view. And certainly not use clickbaity headlines. Information can be copied and they should accept it, or change their business/distribution model.
So if I went to a cinema and didn't like the movie, I should be entitled for a return, right? Or if I went into a museum and didn't like the art displayed there?
If you are advocating for a free for all libertarian dystopia, well, I have some bad news for you - they never work.
> So if I went to a cinema and didn't like the movie, I should be entitled for a return, right?
Not being able to un-see a movie and get your time and money back is one side of the coin. The other side is that information can be copied.
Both sides suck for one of the parties. There's no reason why one of them gets it their way, especially if it requires a contrived legal framework while the other way would require nothing at all.
Your personal opinion on the matter has little weight here.
It doesn't matter what you think you're paying for or should be paying for, the fact of the matter is that you're paying for the effort people put in bringing that to you. So you are, whether you want to be or not.
"Information can be copied and they should accept it" <- I was referring to this line. This basically means that OP thinks that any intellectual property should be free for everyone. This means that probably half of humanity (who are currently creating anything with IP) will have to be libertarians, and that can't happen unless all humanity are libertarians. And libertarian society is a dystopia. :)
> This basically means that OP thinks that any intellectual property should be free for everyone.
Incorrect. Many intellectual property has a certain merit that can be demonstrated before it is consumed. E.g. "This piece of software allows you to create 3d models". On the other hand, an article with headline "Will new batteries allow 10x more energy storage?" does not tell me anything.
I wonder what the reaction of some of the people who browse this forum would be if the output of their careers were so commonly pirated. Somehow, I think most think that this argument doesn't apply.
I’d be pretty delighted. I’m paid for getting projects done, not for keeping hold on some copyrighted code. I want all my code to be open sourced, and reused.
If ChatGPT is based on neural networks, with no actual save-and-replicate facsimile behaviour, it no more "copies" original work than I do when I tell you about the news article I read today.
I'd say the only real reason the Piratebay links thing you mentioned is not the norm is purely because those media sources have done a better job of striking fear into people doing that, so it's gone more underground. I.e. they're better terrorists.
There's no fundamental, moral reason why Piratebay links being posted and raised to the top would be wrong.
> it no more "copies" original work than I do when I tell you about the news article I read today
When you tell people about some news article you read earlier you repeat it exactly verbatim? You also give this out to potentially millions or hundreds of millions of people for commercial purposes?
Copyright law does not care about the means of copying, just that you created something with substantial similarity to something you had access to. Whether or not the copy is in the form of a pixel array, blobs of random data being XORd to produce a full copy of music, or rows in a key/value attention matrix, doesn't matter.
Furthermore, there's Google research on extracting training set data from models. More specifically, Google found out that if you ask GPT to repeat the same word over and over again, forever, it eventually starts printing fully memorized training set data[0]. So it is memorizing stuff, even if it's not regurgitating it.
[0] When told of this, OpenAI's response was to block conversations with large amounts of repeated words in them.
So, if someone applies a filter to a video/audio, it is no more "copies" of the original work (no, it is still protected). AI still could produce exact or extremely similar results of stuff it learned on.
It's not analogous to a filter, because that's applied to the actual work. The model does not keep the work, so what it does isn't like applying a filter. It's more like being able to reproduce a version of the work from memory and what it learned from that work and others about the techniques involved in crafting it, e.g. art students doing reproductions.
And if OpenAI were selling the reproductions, that would be infringement. But that's not what's happening here. It's selling access to a system that can do countless things.
> AI still could produce exact or extremely similar results of stuff it learned on.
Can it do so more than a human can?
I think that's the key here. If an AI is no more precise than a human telling you about the news article they read today then ChatGPT learning process probably can't be morally called copying.
So, if someone decompiles a program and compiles it again, it would look different.
"It is not copying", we just did some data laundering.
Feeding someone else data into your system is usually a violation of copyright. Even if you have a very "smart" system, trying to transform and obfuscate the original data.
Computers are deterministic. Giving the same inputs training would produce the same model. The comparison with brain is incorrect. You could add noise on input data during the training - it would more of less reproduce the real learning. Still, it could produce less useable models as a result.
If it takes 120 seconds to read a newspaper article, the archive.is workflow is a significant overhead over that, a significant friction. Those links are a courtesy to other HN readers. This is very different from the economics of buying and reading a book.
"Piracy is almost always a service problem and not a pricing problem."
edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.
> edit: It didn't even occur to me to compare the time-cost of "just pay for the article", but: last I read, it's half an hour of work to cancel a New York Times subscription [0]. So, that option's not even on the table.
I canceled mine two weeks ago. It was four clicks. One annoyed me because they tried to get me to stay with an offer, but I didn't drop them because of the price.
The archive link doesn't threaten their jobs and helps them avoid paying for NYT. It's NIMBY, or rather it's true form of NIIIM (Not if it impacts me).
It is pretty funny. If you go back and read the comments made yesterday about ChatGPT doing something much milder (using old articles to train data, some prompts fused to allow you to reproduce some of the articles though now don't work), you have a lot of comments talking about how The New York Times needs money and Open AI is using their work without paying for it.
Now a comment points out that HN News (and most of the internet) routinely does something much worse - allows people to bypass completely new articles in their entirety without paying - and almost all the comments are about how it's the New York Times fault for making it difficult to cancel subscription, the importance of news being available to everyone, the problems with copyright laws, etc.
I wouldn't say OpenAI has exactly the same attitude, since they also pulled in thousands of books. Their position has been that it's not piracy, since they don't republish the books; effectively the AI just reads them and learns from them. If GPT can be made to reproduce the original articles, that's a more difficult argument to make.
I can understand an argument about the AI needing to know basic history. News is just how we report history in the making, but it's not generally accepted as solid until some time after the events when we can get more context.
Isn't this what the Associated Press is intended for, a stream of news trying to report just the facts and happenings of the day? That's quite a bit different than a NYT article intending to inform but also convince someone of a position of some sort.
Feeding an AI opinionated news compared to "just the facts, ma'am" seems risky from a bias perspective.
I agree with you, but I also wonder how the bias could be trained without it affecting the output of the entire model. Weights can help but anything that's higher weighted is just "less wrong" as I understand it, so I can see a possibility where training to expose bias might let bias creep in somewhat more than anticipated.
It turns out you can reproduce articles with next-token prediction when the articles are quoted all over the dataset.
The articles themselves are indisputably not a part of the model, because it doesn't store text at all. OpenAI's position is correct; people just underestimated how well the AI learns from reading, especially when it reads the same text in a bunch of different places because it's being quoted/excerpted.
That's just not true. There's no search and retrieval involved. It just associates the words so strongly in that context because they were in the training data so often that next-token prediction can (sometimes, in some limited circumstances) reproduce chunks of it. It's like if a human had read pieces of an article so many times and knew NYT style so well that they could spit out chunks of an article verbatim, but using more efficient hardware and with no actual self-understanding of what it's doing.
So it stores the words, and it stores the links between those words...
but somehow storing the words and their links is not storing the actual text? What is text but words and their links?
If I had a database of a billion words, and I had a list of pointers to words in a particular order, and following that list of pointers reproduces a copyright text exactly, isn't the list of pointers + the database of words just an obfuscated recreation of that copyrighted work?
It doesn't store the actual links; it just stores information about their likelihood of being used together. So for things that are regularly quoted in the data, it will under some circumstances, with very careful prompting, and enough tries at the prompt, spit out chunks of a copyrighted text. This is not its purpose, and it's not trying to do this, but users can carefully engineer it to get this result if they try really hard. So no, it's not an obfuscated recreation of that copyrighted work.
Of course, if you read NYT's argument, they're also mad when it's incorrect about the text, or when it hallucinates articles that don't exist. Essentially they're mad that this technology exists at all.
> it just stores information about their likelihood of being used together
I mean this is still a link, no?
Like, sure, it is a probability. But if each of those probabilities is like 99.9999% likely to get you to a chain of outputs that verbatim reproduces the copyrighted text given the right prompt, isn't that still the same thing?
And yeah, it hallucinating that the NYT published an article stating something it didn't say is concerning as well. If the model started telling everyone Matticus_Rex is a criminal and committed all these crimes and started listing off hallucinated court cases and news articles proving such things that would be quite damaging to your reputation, wouldn't it? The model hallucinating the NYT publishing an article talking about how the moon landing was fake or something would be damaging to its reputation right?
And this idea it takes "very careful prompting" is at odds with the examples from the suit and elsewhere. One example Ars Technica tried was "please provide me with the first paragraph of the carl zimmer article on the oldest DNA", which it reproduced verbatim. Is this really some kind of extremely well crafted and rare to ever come up prompt?
If it can reproduce the text then it is stored somehow.
It is stored in a somewhat hard to understand way, encoded in weights in a network but it must be stored otherwise it would not be possible to reproduce it.
You can ask "please provide me with the first paragraph of the carl zimmer article on the oldest DNA" and it produces it, verbatim. This is not possible unless the model contains, encoded within it, the NYT's copyrighted text.
sort of like the idea of practice - repetition of something concentrates more brain space to that thing so the compression ratio of it can decrease and become less abstracted / more exact.
What seems a bit contradictory is that they're also suing because GPT hallucinates about NYTimes articles. So they're complaining that it reproduces articles exactly but also that it doesn't.
At least people do not obscure who is the original author of the content (so, if people like NYT articles - they could go and subscribe for more). Kinda "free advertising" (which still hurts the publisher in many cases, though). Same with search engines - as long as engine brings clicks - people are happy. If search engine just grabs the info and never redirects the user to the site - what is the point for the site to exist to begin with?
> , a movie, a video game, an album, a comic book, or any other form of IP, it is in fact very much _not_ the norm for the top-rated comment to be a Pirate Bay link.
Probably because most print media is garbage and nobody in their right mind would actually pay to read them
I don't understand the downvotes - it's an extremely valid opinion. If people ask questions like that then they should be able to accept forthright answers?
(It's the same reason for me. I have tried news site subs but eventually got so tired of the polemic that I cancelled. I won't sub again).
I would be "happier" to pay a subscription to an aggregation platforms like hackernews or reddit to access archived articles that are linked to these sites. In turn a proportion of that could be passed on to the underlying publishers that I actually visit. I have nearly zero interest in reading articles that aren't linked to from an aggregation site.
I don't want to read theguardian.com, or nytimes.com, or washingtonpost.com, or bloomberg.com, I want to read news.ycombinator.com. Paying an individual subscription to every possible underlying site that could be linked to from news.ycombinator.com is a non-starter.
Nearly every attempt at starting a new aggregation site like hackernews or reddit has been a failure.
I’m not going to switch to a new website where no community exists just so I can pay for news articles. To work it needs to be integrated into an existing, successful aggregation website.
but what if they were also scraping, for example, Netflix content to use as part of their training set?
There were some tweets the other day about how Midjourney could be prompted almost-exactly reproduce some frames of the film Dune. It wouldn't be shocking if these companies were using large databases of movies, with questionable legal status.
I see this a lot, and they very well may be.
But, watch any behind the scenes documentary about any artsy movie and 9 out of 10, the director's will be waxing poetic about their inspirations, often include older movies or paintings which have uncannily similar scenes/frames. So it also wouldn't be shocking if a model trained on the same inspirations as the filmakers generates almost-exact frames as the movie makers.
Because historically this is how news were shared. People would pick up a paper in a grocery store or cafe, read some of it, and leave it behind. They might rip out a page and take it home. Only one person paid and tens or hundreds gleam for free. This idea of sharing the story to nonsubcribers is as old as printed news itself. Instead news agencies prefer we forget that aspect of history, insist on being the “paper of record” while charging more money for easier to distribute media that gets sold globally. Yes, I think we are certainly not in the wrong here when we read the news for free.
Please don't post baseless accusations. I think dang has said that he tries to moderate less, not more, when YC companies are involved. (Although it's impossible to say what he would do in this situation.)
Like I said in another comment it is simpler than that. They just serve the login page/payment page to all HTTP requests. If they do that then the submission itself likely get's flagged as there is no workaround (just like if I submit my blog with a banner saying "hey you pay me $1 to read my cool post")
I think one of the key differences is something pointed out in the article, in that what the Open AI is doing is a substitute for reading the new york times and possibly a rival to it.
On the other hand having an archive link to a times article in order to discus it is not really a substitute for a times subscription as a news paper has to walk a line of letting some of it's articles be read while requiring payment for others (the times actually allows you to create a "gift link" to do exactly what the archive links do).
The difference is that an individual pirating news is simply reading the article. OpenAI intends to digest news articles to the point of packaging them and reselling.
My uncle used to distribute daily newspapers and his saying was "News ages like a fish".
OpenAI is allegedly using NYTimes articles to train a computer and sell its services. I see different use scenarios.
I guess another way to look at it is that human just reads the pirated material. A computer makes a verbatim copy and analyzes it to the point to mimicry and sells fuzzy versions.
I pay for multiple streaming services because I get a decent amount of value from their content.
I do not pay for any news websites because I read very little of what they produce, and it tends to pop up more on aggregator sites like HN than me actually going to them.
I actually did have a subscription to The Telegraph for a few months at one point because initially I wanted to read a full article (without cheating). But eventually I cancelled because so much of it is polemic trash.
That's my justification: I pay for things that have value to me.
For LLMs you're essentially teaching them language by showing them lots of examples of written language - newspapers are of course a great example of written language.
The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.
When a HN participant shares a (pay walled) link to a NYT article, I do want to read the exact article linked verbatim because while the facts of the article may be reproduced elsewhere in a form that's free, specific word choices or whatever might be a focal point of the discussion on HN, and therefore I can't realistically participate in a discussion without having read the article being discussed.
And as an aside, I have no problem with paying to read news, or whatever media, however it's impractical for me to subscribe to every news source HN participants link to, and therefore I gravitate to archiving services instead. I do wish there was a better solution - for example Blendle with more sources.
> The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article), and the fact that it can happen is a side effect of how LLMs work.
This is an excellent point. A properly functioning LLM should not return the original content it was trained on. When they return original content, I believe the prompt is tightly constrained and designed to extract or re-create original content. Another reason that occurred to me recently is that maybe the training set is too small, and more general prompts will re-create source material.
Another question would be, are LLMs regurgitating what they were trained on, or are they synthesizing something very close to the original content? (Infinite Monkeys, Shakespeare). Court cases like this increase the need for understanding the "thinking processes" in an LLM.
Maybe LLMs should follow best practices for 1980s style backprop models and later deep learning models: starve model size to force maximum generalization, minimal remembering.
> The goal of OpenAI is not to reproduce newspaper articles verbatim when asked questions (even if the answer could be a newspaper article) and the fact that it can happen is a side effect of how LLMs work.
Seems like a nice split-the-baby resolution would be to send the NYT Corp a single article read amount anytime GPT plagiarizes more than what’s allowed at an academic institution.
A book, TV show, movie, video game, album, or comic book is not available on the internet served by the copyright holder’s own servers with no authentication or authorization checks. But the NYT is available in that way.
Largely because "news" aka facts is not and should not be copyrightable, so while the style, and exact format of the article may be copyrightable, the facts contained within are not.
This makes a news story copyright murky in the eyes of wider society unlike a clearly 100% creative work like a TV Show or Movie.
Further the news themselves self cannibalize, how many stories are just rewrites of stories from other outlets? why it is OK for the Washington Post to copy the NY times, but not ok for OpenAI or Archive.org?
why it is OK for the Washington Post to copy the NY times, but not ok for OpenAI or Archive.org?
If the Washington Post printed an article from the NY Times nearly verbatim and without attribution, it would not be OK and surely they would take legal action.
Yes, because The NY Times is copyrighting the body of work. They are not copyrighting the "facts" themselves but the distillation of these facts into a body of work. Anyone is free to take the facts and produce their own works but not to lift the body of work verbatim that the NY Times created (plagiarize).
None of which are copyrightable and infact has been the subject of DMCA abuse like when a Movie uses NASA footage and claims copyright on YouTube videos with the same footage.
Copyright is a complex subject, and not as vast as many believe, at the same time ironically it is more vast than i believe it should be. copyright should be much more limiting than it is. Which is at odds with people that believe copyright should be maximized.
Keeping in mind commercial success of a work, author or company is not why copyright exists. For the US, the only reason copyright can exist in our framework of law (i.e the constitution) is for the promotion of the useful sciences. No other purpose for copyright would be constitutional under the US Constitution
That is a General Article about Copyright world wide, I Specifically stated US Copyright, which is Authorized by Article I, Section 8, Clause 8 of the United States Constitution[1], implicitly for the promotion of the useful sciences. That is where congress derives its power to pass copyright laws, and to enforce copyright on the people of the United States. No other purpose is authorized by the US Constitution
Missed the edit window, but thinking about this more:
> Keeping in mind commercial success of a work, author or company is not why copyright exists.
Lets take a look at the clause again:
> To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries.
Lets go ahead and skip over the fact you're consistently ignoring "useful arts" part as well and keep going.
What exclusive rights do you think they're talking about here? Do you really think they didn't mean the economic rights related to their writings and discoveries? How do you imagine this would "promote" the sciences if not by allowing the creators to share their works and ideas while still retaining economic benefits of their labor?
Reading between the lines, the whole point of IP is to help protect the potential commercial success of sharing your ideas. It doesn't guarantee the idea will actually be a commercial success, but it does give them the exclusive right to the commercial success for a limited time.
Reminder: don't take the terms "Science" and "useful Arts" to mean today's understanding of them.
> [the United States Congress shall have power] To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.
...
> Some terms in the clause are used in archaic meanings, potentially confusing modern readers. For example, "useful Arts" does not refer to artistic endeavors, but rather to the work of artisans, people skilled in a manufacturing craft; "Sciences" refers not only to fields of modern scientific inquiry but rather to all knowledge.
"Science" refers to knowledge, and conveying that knowledge entails creative expression. Copyright covers expression of knowledge; facts and ideas themselves are not copyrightable. "useful Arts" refers to inventions. Patents cover useful inventions and novel implementations of practical ideas, not creative expression and not unimplemented ideas. (Which is one reason most software patents shouldn't have been granted.) Congress's authority to make copyright law and patent law is conditional on promoting the spread and advancement of knowledge, creativity, and inventions in the long term. The means of achieving that end is short-term restrictions on how people can use others' creative works and useful inventions.
But copyright does not prohibit mere usage of someone else's creative works [2]:
> To win a claim of copyright infringement in civil or criminal court, a plaintiff must show he or she owns a valid copyright, the defendant actually copied the work, and the level of copying amounts to misappropriation.
If the output of an AI model is not similar to any creative work in the training set, then the output cannot infringe on copyright. And even where the training set contains illegally obtained materials, the act of illegally obtaining those materials has nothing to do with including legally obtained materials in the training set.
Possibly because once an article is published the author receives no further payment. In all other mediums, there are residuals and royalties to be paid to the creators of the work.
That's only been true for the past few months, and it's been very well documented how complicated the cancelation process used to be [0].
It's funny because I use PayPal for any unknown-to-me site where I don't want to give out my card, but the only site where I've needed their help to cancel something was the New York Times.
Why did it take you five minutes instead of twenty seconds? It should be as simple as clicking on the link to your profile then clicking unsubscribe, mere seconds not minutes.
Assuming you just said five minutes figuratively... Do you live in California or some other legal jurisdiction that forces them to play nice? Did you subscribe through some other company, like Apple?
Horror stories about unsubscribing from the NYTimes are easy to find in the archive if you search for it. They make you call and chat to a retention specialist on the phone. This should help you have an idea of what he's talking about: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
International one, as szraight forward as it could be: go to profile, go to manage subscription, cancel subscription, answer question why if you want, confirm cancellation, done for date depending on subscription.
Yes, although I get that the route of the money may find it's way back to the journalist as salary. But generally goes into a pot for news gathering of which the salary will be withdrawn.
On ads it's acceptable to distribute them freely and it is advantageous to the company. Can we also see good journalism as an ad for the quality of a broader product?
It is am ethical grey area, but if the paywall applied to all user agents, which would make it similar to say buying a Kindle book, then you might see that as pirating, whereas if you use an archive service that was served the HTTP response and cached it, then you are using a proxy UA.
If the news/magazine doesn't want this they can simple serve a cut down or zero length article to all non-paying viewers! But they want that SEO, and they want that marketing.
We can extend this analogy. What if someone put up a proxy, that has a legal Netflix subscription and which "watches" streams of Netflix shows, captures actual RGB values of pixels and re-streams the resulting video to anyone else? Isn't it the same "proxy" excuse?
I would say no because the site was happy to serve the content publicly,
whereas your proxy is breaking a contractual agreement. Now we get into terms of service of a website, and even if you visit for free you agree to them. Which is a possible point. It is quite grey IMO. In terms of HN I reckon a mag would love the free brand rec. vs. the archive not being shared. Where it hurts them is if someone is avoiding paying for a subscription by continually using archive sites.
Indeed there are media that are hard paywalled, e.g., the information. However these are prohibited on HN, which possibly create additional bias towards non-hard-paywalled publications
Or is it "you can't talk to someone about an article they read".
This is really saying you can't call up your buddy and have them tell you a summary of what they just read. Maybe my buddy has a good memory and some of the text is actually nearly duplicate. But I wouldn't know because I didn't read the original, I just asked for a summary from someone else that read it.
The NYT and other newspapers don’t go after the archived link providers. Probably because the newspapers scholarly mission includes things like preservation. But they also have a profit motive or they can’t stay in business.
This implicit permission for the archive links to exist, gives some of us the implicit permission to pirate the content.
Disclaimer: I am a happy subscriber to the NYT (and other digital newspapers).
First, Open AI is the one doing the pirating here. Hacker News is the host, they aren't doing any pirating or posting any archival links to the copyrighted information themselves.
Second, Open AI charges subscription fees and profits off of the copyrighted material they have pirated, whereas Hackers News does not, nor do the people who post the links.
As a supporter of piracy in the general case, I tend to agree with your observations, including that pirating NYT (FT, NPR, ...) articles is somehow some kind of different class of offense as, say, stealing a movie or mp3.
(Books, to me, are separate still, in that I like to have a physical copy (and generally see the authors as humans who deserve compensation, rather than mega-orgs that deserve eternal torment), so I'll frequently use the digital copy as a kind of preview, then purchase it once I see it's a good book I want to read.)
I've only been reflecting on this difference for a few minutes, but, to me, I think the major difference boils down to:
1. Netflix series (movies, albums, etc) are non-essential, fictional works that take a long time to produce - think: fancy chocolates and caviar.
2. News, generally, contains timely, important information - more meat and potatoes.
3. While much of the super-critical news is not paywalled (e.g., product recalls, election dates, COVID stats, etc), a lot of information that is advantageous to know (discussions on interest rates, details on legislation, etc) is paywalled, compounding information asymmetries.
Sure, "stealing bad", but, IMO, someone stealing rice and beans from WalMart to feed their family is a different class of offense than someone robbing a boutique bakery because they can't get enough chocolate cake.
First and foremost, and please repeat after me: Copying is not stealing.
You're not depriving anyone of anything. Unauthorized copying is not theft. There's no equivalency. You can't copy and paste a cake. If you take a cake from a bakery, you're depriving the bakery of a thing. If you take a picture of the trademarked bakery's sign, copy its the copyrighted text from its website, and print them out, you haven't stolen anything. Nobody has lost anything. Nothing was damaged. No person, place, or thing was harmed.
Current copyright law is offensively absurd. Patenting of software, effectively eternal content copyrights, ridiculously broken DMCA, music publishers taking 99 cents of every artist's dollar, and so on and so forth.
If you support the dissolution of archaic institutions and broken laws favoring those with entrenched wealth over individual rights, you support piracy.
There is a legitimate case for laws respecting and protecting intellectual property rights. Such laws do not currently exist. These laws do not deserve to be followed or respected, and should be broken as a matter of course. Civil disobedience is called for. Refuse to participate in an exploitative market immovably entrenched in governments all over the world. Pay artists directly and commensurately if you feel they've brought value to your life. Copy whatever you want. Share those copies with whomever you want. Nobody gets hurt. Only conglomerates of already wealthy individuals and corporations are "deprived" of the potential transaction with you that they feel they are entitled to, as a matter of course.
The NYT is just as complicit as any other legacy media institution in the enshittification of journalism and laying waste to the potential value of their content. The "Gray Lady" is not a person, or a valuable institution. It's a soulless corporate construct not deserving of our empathy or high regard simply because of the reputation of human individuals who previously produced quality content. Stop pretending these institutions serve some higher purpose than to fatten the wallets of shareholders.
The good journalists have left. The ones left behind are naive, or are desperately clinging to an illusion of legacy and institutional legitimacy that no longer exists.
All that is left for these media dinosaurs is to leech off the success of others, to use their reserves of wealth and influence to arbitrarily insert themselves into the market, with no regard to the fact that they no longer have value or prestige or purpose in the context of modern technology and communication.
Anyway. Copying isn't theft. Don't give them the linguistic territory. Call a spade a spade, and media companies the desperate corporate leeches that they are.
At least in the US, copyright violation is a civil thing, it's handled by lawsuits. If the copyright violation is of such a small level that it's not worth the copyright owner to do anything about it then nothing's done. In this case it's worth a massive amount of money.
it s also audacious how these news companies reproduce stories from social media and other electronic media of facts that are, like, freely available in nature. Or how they get embargos and exclusivity to government information as if they are some kind of information-bouncer
there's quite a big difference between "pirating" digital content and making it available to anyone for free and taking that content and building a for-profit service on top of it, which is what OpenAI are doing, no?
I was just going to post this. Seems quite an obvious and significant distinction, that doesn’t need to provoke all the existential hand wringing. Making money off someone else’s content is a totally different moral and legal case.
I believe the reason many of us tolerate links to news articles and other content is because we believe in equality when it comes to information access. In other words, many of us believe that those who cannot afford a subscription to a paywalled site should still be able to read the articles, in much the same way public libraries allow those who cannot afford to purchase a book the ability to read it.
However, this doesn't apply to organizations that freely share copyrighted information while making money in the process, or to organizations that share copyrighted information in a way that specifically disadvantages or does harm to the original creator of that information.
Because I'm not interested in the medium itself, as I would be with a Netflix show; I'm not even interested really in the article or the New York Times as an institution. I'm interested in discussing the supposed real-life phenomenon being covered, and the posted content is the primer for that discussion. I think if you get rid of the archive links on HN you need to ban the paywalled content as well. If you want to discuss paywalled content I'm sure you can do that in the article's comment section.
>I think that's something worth reflecting on, about why we feel it's OK to pirate news articles, but not other IP.
A lot of of that is going to stem from the fact that respect for "journalism" is pretty low. More than 99% of news articles are copies of the <1% of original work that happens in that field. In news, everyone is already lifting content from everyone else.
it's similar to how easy it's to subscribe NY times and then how hard it's to unsubs. They require extra steps and it's well known. So They get what they deserve? Do you see the poínt. They are lie spreaders, nothing else
I would broaden the question beyond HN to society as a whole.
In 1990 it would have been considered normal and appropriate to clip an article out of a newspaper and post it on a communal corkboard. What are the key differences between that form of IP and others, and that analogy and the present situation of HN allowing archive links?
Makes sense. If you mail a friend a clipping, or post it on the corkboard, only so many people are going to see it, but then even though posting the "clipping" to HN may feel like the same thing, it's hard to appreciate the massive change in scale.
As for ease of distribution, that might address OP's original question: It's easy to make and click an archive link, but it's a lot more effort to make or find a Pirate Bay link to another form of media, and for someone else to download and view it.
That is an apples to oranges comparison. An article about a video/book would have the relevant information in text form without needing to show the video "here is the new stuff shown in Apples 2 hour long WWDC keynote". If not is common that a comment in the discussion gives a summary as a tl;dr
With text articles behind paywalls the relevant information is hidden and only hinted at as a teaser.
To make it an apples to apples comparison, look at submissions where the link submitted is the retail link to the IP. For example, look at all the book link submissions on AMZN...
I wonder if this is because the purpose of linking to a book is to share awareness of that book’s existence - nobody is about to go and read it then and there to comment on its contents. Whereas the purpose of an article is to discuss it now, in the comments - the consumption horizon and bulk of the content is different.
> why we feel it's OK to pirate news articles, but not other IP
Who thinks this? I don't. I think copyright is wrong across the board. I would love if the same pattern of posting archive'd articles held for books, movies, et cetera.
I would love to change my mind on this, as it is a very unpopular opinion to have. But I have _never_ seen a morally or scientifically sound argument in favor of copyright law, and I've spent decades looking.
I think it subsidizes the creation of junk food content (superhero movies and clickbait news for example) while not contributing anything to the progress of science (paywalled scientific journals and textbooks). I shudder how much time I have wasted in my life consuming crap attention grabbing media and advertisements. I like to think if we lived in a world where everyone could be a publisher if they wanted to, the quality filters would be better, and information reaching us all would be more likely to be in our best interests.
You speak of "the author". But the current system does not benefit "the author". 1% of authors profit off copyright. 99% lose money on copyright (they pay more for copyrighted media than they earn from it).
Your question should be "How does that benefit monopolist authors"?
I agree, my idea would not benefit monopolist authors. They would lose the bulk of their revenue stream.
But it would benefit the average author whose cost of living would fall and information would start serving them more than serving business.
I am not downplaying the talent and hard work of successful monopolist authors. But I do not think the works they create are worth everyone giving up their rights to reshare and remix information. I believe the world would look very different post-IP. You'd probably have a new profession--small independent librarians (similar to data hoarders today)--who would help their local communities maximize the value they got from humanity's best information.
Maybe I'm wrong! Maybe the information ecosystem is better controlled and the genetic differences of monopolist authors are so stark that without the subsidies to this gifted class we'd all be worse off. But that's an argument based on outcomes and not principles.
> without their permission
The oxygen I'm breathing right now mostly was created by trees on land owned by others. But I don't ask for their permission to breath. Some things are just not natural.
I am not saying plagiarize. It is always the right thing to do to link back and/or credit the source. But needing to ask permission to republish something seems to go against natural laws.
I think we're suffering from an excluded middle when it comes to this kind of intellectual property. Naturally, most readers want to pay zero. Naturally, owners of the publication think it is probably worth a couple hundred dollars a year to be this well-informed.
The current arms race got us scrapers, and then paywalls, and then ad-blocking archivers ...
But in reality, I might drop a penny to read a NYT article. Maybe a nickel. There's no reasonable way of performing microtransactions right now. Everything is still in hefty increments, so nobody can work out what the market would bear.
Probably because the contents are what's posted, i.e. if someone would post a link to an interesting video behind paywall / login and there was an easy mirror available that'd be posted too.
If I could just buy one article for a coffee without entering a bunch of PII or go through a time-wasting process I would agree on the moral equivalence between the examples.
I believe it's tolerated here based on the site guidelines. I have always thought this was the case because otherwise these posts would all be pay to play which would limit who could participate and turn HN into more of a subscription farm. Maybe the way to make everyone feel ok about it is to disallow links to paywalled content.
> why we feel it's OK to pirate news articles, but not other IP.
Because those who own & produce such news articles asked to make them different. People listened and accepted their requests.
When you make a TV show or a video game, you don't get any protection from the Geneva Conventions and a long list of other international treaties for your rights on stuff other than the content you are producing. The same can't be said when you are producing news.
Oh it's worse than that. The NYT is positing that any neural network that is trained on their data, and can summarize or very closely approximate an article's content on request, is in violation.
This reasoning would presumably apply to any neural network, including one made of neurons, dendrites, and axons. So any human reader of the NYT who is capable of accurately summarizing what they read is an evil copyright violator, and must be "deleted".
Effectively, the NYT legal department is setting the stage for mass murder.
Hyperbole much? There is a difference between a computer and a person. I'm not aware that people generally can be enticed to reproduce full articles verbatim just through questioning.
Won't hold in court. GPT is a platform mainly providing answer to private individuals asking. Is like you ask a professor a question and he answered verbatim what copyrighted materials available (due to photographic memory) word for word back to you. Now if you take this answer and write a book or publish enmass on blogs for example, then you are the one should be sued by NYT. If GPT use the exact same wordings and publish it out to evetyone visiting their page, then that is on OpenAI.
I hope people start calling out the "well it's fine if a human does it" arguments out for the rat fuck thinking it is. These are computational systems operating at very large scales run by some of the wealthiest companies in the world.
If I go fishing, the regulations I have to comply with are very light because the effect I have on the environment is minimal. The regulations for an industrial fishing barge are rightfully very different, even if the end result is the same fish on your plate.
GPT is like a fleet of small fishing boats, each user driving their boat in another direction, not a fishing barge. For every token written by the model there must be a human who prompted, and then consumed it. It is manual, and personal, and deliberate.
In fact all the demonstrations in the lawsuit PDF were intentionally angling for reproducing copyrighted content. They had to push the model to do it. That won't happen unless users deliberately ask for it. It won't happen en-masse.
Gpt is operated by one company. If a million people eat your fish, you're still a barge.
Boo hoo they had to push it. That was never the problem with these bullshit nozzles. The issue is they put that stuff in the training set in the first place. If you can't be honest about that then I have no interest in debating this with you.
unfortunately that's not the crowd of people here. 80% of the comments under this thread (right now, 2:52est) are making similar arguments and *continue* to act like LLMs are doing something unique/creative... instead of just generating sentences, from algorithms, from virtually pirated content in the form of data mining
The professor having been trained in academia would state the sources of the verbatim quotes. In writing papers he would use references and explicit quotes. There's nothing hidden going on with the professor.
If said professor offered a service where anyone could ask them for information that is behind a paywall, and they provided it without significant transformation, this would certainly be copyright infringement that the copyright holder would have every right and motivation to take action against.
I think the scale only matters here (probably). Because I will find it hard that a teacher/professor will not be allowed to setup a service where they will teach and provide their knowledge for others. That is basically the concept of teaching. Of course until LLM, we never had this scale before. Millions of potential learners vs the normal hundreds in a classroom session. So that makes the new case interesting
"Teaching" by copying source books word for word, would be copyright infringement; see, for example, the well-known issues around photocopying books or even excerpts.
Also lying on source materials (e.g. telling students that some respected historian denies the Holocaust happened, when it's obviously not the case) is not "teaching" - it's defamation, and the NYT is absolutely right to pursue that angle too.
Using LLMs as general-purpose search engines is a minefield, I would not be surprised if the practice disappeared in the next 20 years. Obviously the tech is here to stay, there is no problem when it's applied to augmenting niche work; but as a Google replacement, it has so many issues
> Teaching" by copying source books word for word, would be copyright infringement; see, for example, the well-known issues around photocopying books or even excerpts.
Incorrect. Educational use helps satisfy one of tests for fair use. Teachers can, in many cases, photocopy copyrighted work without infringing on that copyright.
Educational use is just one of the many factors used to determine whether an instance of copyright infringement is fair use or not, but it is not carte blanche for educators to ignore IP laws just because they're educating.
Teachers can in some very limited cases photocopy very small chunks of copyrighted work. This also varies significantly from country to country; the starting position is that they cannot reproduce works in their entirety.
scale is important here - maybe a better analogy is setting up a paid Spotify clone with all the music sourced from torrents with some slight distortion effect added
I don't think the lawsuit has any merit, but I'd still like to encourage Sam Altman et al, if they really care about the greater good, to go Keyser Söze and immediately release torrents of the weights and source code for GPT-4 under GPL.
AFAIK the IP deal with Microsoft only covers development before AGI.
So at any point OpenAI could declare that a sufficient degree of AGI has been achieved and thus return to its philanthropic mission. With GPLed models and all.
However, at this point the employees expect a multi-million cash-out for each of them. So the philanthropic mission seems to be gone out the window.
And probably that’s also the way Sam Altman got back into the CEO role. By maximizing the expected eventual cash-out for the employees which threatened to leave otherwise.
The lawsuit fundamentally has merit. It asks a huge open question that no one knows the answer to. The outcome will be extraordinarily impactful. The question must be answered at some point.
The case has merit even if NYT loses across the board.
agreed. in the same way Colorado supreme court ruled trump can't be on the ballot to force scotus to rule i think is the same reasoning here. get an answer earlier rather than later.
You’re not just display the contents of copyrighted works publicly, they’re selling access. This flips the script for the 1st factor of the Fair Use test. Additionally, by selling it to people who use it to get news summaries, you can argue you damage the market for a NY Times subscription, which triggers the 4th factor.
I don't think anybody has claimed that OpenAI is causing NYT subscriptions to go up. NYT has even expressly made the claim they're losing potential revenue.
> [1] On the most important factor, possible economic damage to the copyright owner, [Judge] Chin wrote that "Google Books enhances the sales of books to the benefit of copyright holders."
I am of opposite opinion. I think it is unreasonable to train AI using copyrighted information without permission from the copyright holder, at least if it is done to create a proprietary product. It is probable even unreasonable if the end result would not be proprietary but at least that would benefit the world more.
It's obviously a frivolous suit that will only net at best a ceremonial victory for NYTimes: 8 figure max payout and a promise to not use NYtimes material in the future.
The trajectory and value to society of OpenAI vs NYtimes could not be greater. They have won no favors in the court of public opinion with their frequent misinformation. It's all just a big waste of time, the last of the old guard flailing against the march of progress.
And even hypothetially if they managed to get OpenAI to delete ChatGPT they'd be hated forever.
ChatGPT only advertises itself as a fancy autocomplete. There is a disclaimer that it may produce output that appears correct but isn't. NYtimes written material purports itself to be the truth, thus shouldn't be held to the same standards as a generative AI obviously.
I think what we should focus on is the volume of misinformation in general, not the provenance of it.
The NYT may produce misinformation but it aims not to, and its staff of human writers are limited in the quantity that they can produce. They also publish corrections.
GPT enables anyone who can pay to generate a virtually unlimited volume of misinformation, launder it into 'articles' with fake bylines and saturate the internet with garbage.
Well that's true for any large language model. As long as they exist there will be a deluge of bot written text producible for any purpose. At this point there is no getting the cat back into the bag.
In that case the bigger danger is Open source LLM's. OpenAI at least monitors the use of their endpoints for obvious harm.
> The NYT may produce misinformation but it aims not to, and its staff of human writers are limited in the quantity that they can produce. They also publish corrections.
Except when it affects their bottom line of course, they publicly lied on how meta tags work during the lawsuits against Google to get more money (like most newspapers did). And I have no doubt that they will extensively lie once again on how LLM really work.
On the other hand, NYT website willingly gave out all the information without imposing limitations. Seeing terms of service requires visiting a separate page, they aren't seen immediately upon visiting the website. Understanding and accepting the terms also requires a human interaction.
robots.txt on nytimes.com now disallows indexing by GPTBot, so there's an argument against automated information acquisition starting from some moment, but before some moment they weren't explicitly against that.
> Seems weird to argue that you have to speak up if you don’t want something done to you or else you consent to everything.
If you don't want people to get at your land, setting up even a small fence creates an explicit indication of limitations. Just like the record in robots.txt I mentioned earlier.
New York Times also doesn't limit article text content if you just request HTML, which is typical for automated cases. But they impose th limits imposed on users viewing the pages in browser with Javascript, CSS and everything else. So they clearly:
1. Have a way to determine the user's eligibility for reading the full article on server side.
2. Don't limit the content for typical automated cases on server side.
3. Have a way to track the activity of not logged in users, determining the eligibility for access. So it's reasonable to assume that they had records of repeated access from the same origin, but didn't impose any limitations before some time.
So there are enough reasons to think that robots are welcome to read the articles fully. I'm not talking about copyright violations here, only about the ability to receive the data.
In Sony vs. Universal case, Sony is the producer of a tool where the consumer uses to "time-shift" a broadcast that they legally are allowed to view. Similarly, you can rip your own CDs or photocopy your own books. This case never made reselling those content legal. OpenAI does not train ChatGPT on the content you own - they do it on some undisclosed amount of data that you may or may not have a legal right to access, and then move on and (is shown to) reproduce it nearly verbatim - they may even charge you for the pleasure.
So presumably when they fix that issue (which, if the text matches exactly, should be trivially easy) then would you accept that as a sufficient remedy?
Copyright infringement is not avoided by changing some text so it isn’t an exact clone of the source.
Determining whether a work violates a copyright requires holistic consideration of the similarity of the work to the copyrighted material, the purpose of the work, and the work’s impact on the copyright holder.
There is not an algorithm for this, cases are decided on by people.
There are algorithms that could detect obvious violations of copyright, such as the one you suggest which looks for exact matches to copyrighted material. However, there are many potential outputs, or patterns of output, which would be copyright violation and would not be caught by this trivial test.
I certainly don't think it's impossible, but I think it is hard problem that won't be solved in the immediate future, and creators of data used for training are right to seek to stop wide availability of LLMs that regurgitate information they worked hard to obtain.
I think it will be a bit easier than you believe. The reason why it hasn’t been done yet is that there hasn’t been a compelling economic reason to do so.
Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.
Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.
A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.
But then what do I know...