NY Times copyright suit wants OpenAI to delete all GPT instances

rich_sasha · on Dec 28, 2023

If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?

Let's say I host these, offer some indexing on it, and rewrite articles. Something like, summarise all articles on US-UK relationships over past 5 years. I charge money for it, and all I pay NYT is a monthly subscription fee. To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

Is that fair use? IANAL, but doesn't sound like it. Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

A sibling comment mentions search engines. I think there's a big difference. A search engine doesn't replace the source, not at all. Rather it points me at it, and offers me the opportunity to pay for the article. Whereas either this or an LLM uses NYT content as an alternative to actually paying for an NYT subscription.

But then what do I know...

carlosdp · on Dec 28, 2023

What you described is entirely fair use, actually.

Not only that, look at a few news articles from Tier 2 and down publications, and you'll realize that almost all of them are directly sourced from NYT and others. They'll say "so and so happened, according to The Times" (and usually link the article there)

hn_throwaway_99 · on Dec 28, 2023

> What you described is entirely fair use, actually

Just like during the pandemic how everyone became an epidemiologist, suddenly everyone's a copyright lawyer. I'll just dispute your assertion by saying:

1. Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.

2. People should familiarize themselves with the four factors of fair use determination. In particular, if a work is purely derivative of a source work and substantially negatively impacts the market for the original work, it's very likely to not be considered fair use.

A great overview is https://fairuse.stanford.edu/overview/fair-use/four-factors/

caesil · on Dec 28, 2023

>if a work is purely derivative of a source work

This is the weakest part of the case(s) against OpenAI. "Derivative work" is a legal term of art meaning a direct adaptation, like writing a screenplay of a book or translating a book into another language.

NYT has a stronger case than Sarah Silverman here because they can show actual 'memorized' text rather than just summarization, but given that those memorizations are a) an unintended failure mode of the training process, and b) from an older version of the model that has been updated to no longer regurgitate memorized text, it's not really clear how in current form GPT could possibly be considered a derivative work.

dchichkov · on Dec 29, 2023

"Transformative" seems to fit a lot more that "Derivative".

On the other hand, it's understandable why NYT is worried. OpenAI itself says that occupations like: Writers and Authors, Web and Digital Interface Designers, News Analysts, Reporters, and Journalists, Proofreaders and Copy Markers are "90-100% exposed" to what OpenAI is building.

paledot · on Dec 29, 2023

We should all be worried about that. If journalism is replaced with AI, truth is replaced with the AI hallucination du jour.

chucke1992 · on Dec 30, 2023

Most of the modern news are hallucinations and post-truths.

_bkyr · on Dec 30, 2023

https://www.foxnews.com

https://www.nytimes.com

https://www.cnn.com

https://www.washingtonpost.com

https://www.reuters.com

https://apnews.com

hatenberg · on Dec 29, 2023

It’s already done. it’s not the future.

hughesjj · on Dec 29, 2023

Yup. News has been rampant with speculation, hearsay, and propaganda all my life。 Content mills and astroturfers already bury the truth or relevant stories with noise.

chucke1992 · on Dec 30, 2023

I don't buy into all these "dangers". The advent of cars did not decrease the amount of drivers and introduced various new jobs, that were not available for a lot of people. And the rise of computers, did not make the workforce smaller but instead opened many more opportunities for a lot of people.

johnnyanmac · on Jan 4, 2024

I think focusing on lost jobs is the wrong angle to take this in. It's in how the content being used is being compensated. OpenAi isn't paying writers to train their engine.

I don't care that the car replaced the horse carriage because it didn't need to compensate horses nor handlers to do so. AI being the newest iteration of scraping data from artists, writers, etc. to profit millions off of is directly using the "horse handler's" work. If these LLM's threw NYT a royalty to use their articles as training material, there wouldn't be a lawsuit.

avidiax · on Dec 28, 2023

A question is whether the new model still intrinsically embeds the source text, but this is later filtered in the output, or if it no longer embeds the text at all.

The latter is more defensible.

skygazer · on Dec 29, 2023

I would think an existing model could bootstrap a copyright free training corpus by completely rewriting/paraphrasing copyrighted material with semantic fidelity for training of the next model to completely eliminate memorization of copyrighted works. That might pose an interesting obstacle to copyright challenges, bootstrapping your way into a clean room. Although, tweaking the architecture to either eliminate memorization, or eliminate high fidelity reproduction of verbatim training data seems far more expedient and less costly.

NegativeK · on Dec 28, 2023

> suddenly everyone's a copyright lawyer

Roll back 20+ years ago on Slashdot and you'll see the exact same thing.

Copyright has been a hot button issue on the internet for decades. People end up thinking (rightly or wrongly) that they understand it without being a lawyer.

whoknowsidont · on Dec 28, 2023

> that they understand it without being a lawyer.

Quite literally, not even the lawyers or courts understand it. This is very much a "learn as you go" exercise for humanity in general at this point in time.

Mattbrown7531 · on Dec 29, 2023

It seems like everything in tech is in the learn as you go phase. Everything is changing so rapidly that there can’t be experts. Just people that are able to adapt quickly.

I only see this phenomenon speeding up. Strange times.

Teever · on Dec 28, 2023

One of my biggest gripes is a somewhat adjacent issue where everyone thinks they're an American copyright lawyer and that American copyright law is universal.

It's very possible that the example provided above is an example of fair use in some country, and that the website offering that service could be hosted there.

Vicinity9635 · on Dec 28, 2023

Legality aside I think copyright of digital things in the digital age is a net negative to humanity.

matheusmoreira · on Dec 28, 2023

Completely agree. Copyright should be abolished. All intellectual work is information, information is just bits and bits are just numbers. It's quite simply delusional to believe you can own numbers in the 21st century, the age of information and ubiquitous globally networked pocket supercomputers.

This is just a felony contempt of business model issue. Computers invalidated their business models and they're doing everything they possibly can to hang on for dear life. Society needs to move on already.

paulryanrogers · on Dec 28, 2023

This goes too far. Digital media are not only long series of numbers. They are often difficult-to-create expressions in image, video, and even interactive forms; regardless of their serialization format.

Books are just strings of letters, yet copyright has still been useful to increase the volume and utility of books.

All that said, I do find the life+70y an absurdly long time.

blehn · on Dec 28, 2023

That seems like a great way to destroy what is left of art as we know it. Anna Karenina is just numbers. In The Mood For Love is just numbers. Right.

What do you propose is the business model for artists in the absence of copyright?

Art9681 · on Dec 30, 2023

The market at large will determine that. If people value cheap AI generated images more than talented human curated art then that's what it will be. If a market exists to buy unique pieces where an artist put brush to canvas and priced their work at $1000 instead of the cheap $10 poster that can be mass produced then thats what it will be. If no one wants to pay $1000 for your unique piece, then the market has spoken and your art is not worth that much. Like everything else, an equilibrium ill be reached. Good artists will be fine. The other 99% of self declared artists will fade away into obscurity.

johnnyanmac · on Jan 4, 2024

None of that is what copyright protects. And it lessens the argument when you can argue that LLMs are essentially stealing a human art's work to be used to generate cheap images. Similar to how if you took commissioned art, printed out 1000 copies, and sold them for $1 a piece.

Copyright means that you need to at least pay that artist you stole from in some way, which the government enforces so artists don't stop creating.

matheusmoreira · on Dec 28, 2023

I propose getting paid before doing the work for the actual labor of creation. Crowdfunding, patronage, comissions, sponsorships all seem like ethical ways to get things done sustainably. That way creators get paid before they work, not after.

We must strengthen these business models that don't depend on artificial scarcity because this number selling nonsense was over the second computers were invented. It's as dumb as asserting that you need permission to use memcpy or the mov CPU instruction.

blehn · on Dec 29, 2023

How do you know what the value of the art will be before it's created? Guns N' Roses is a top 40 artist on Spotify nearly 35 years after producing an album. Should they not have been paid after 1991? If you argue that they were a popular band and therefore should have been paid accordingly up front, well what about their debut record, which sold 30 million copies? How would you predict that value before its creation (or even after)? If you're saying that only the labor has value, and all labor is valued equally, that sounds sort of like marxism, which could be fine, but it's hard to say how well artists would be supported in that case.

In the US, the original copyright length was 14 years, and then 28, and eventually the lifetime of the author plus 70 years. I think the intent of the law is economically justified, but the current length is outrageous.

ang_cire · on Dec 29, 2023

You should be paid the accurate value of the labor. The pay should not scale more when no additional labor takes place.

This is how art worked for millenia; someone commissions a chapel roof painting, someone commissions a concerto, someone commissions a statue, someone buys a chair, etc.

Artists still do this today, and there is no issue determining value beforehand. Artists list their commission prices, or their hourly costs, etc. This is a perfectly normal thing that happens everyday.

johnnyanmac · on Jan 4, 2024

>You should be paid the accurate value of the labor.

that's gone out the door in the digital age. Compaies at this point have spent centuries trying to enfoce this model while witholding stuff like stock and royalties to take a part of what the company enjoys by protifting for decades off of a single (underpaid) piece of labor.

I don't exactly sympathize with a robot now trying to do the same. Pay your labor.

blehn · on Dec 29, 2023

You didn't answer my questions at all.

But ok so I'm a young musician. Nobody's heard of me and nobody wants to commission a concert or album. What do I do? Quit?

@Vicinity9635 It's not just lawyer greed, it creates economic fairness by preventing others from profiting from your creative work.

mrdatawolf · on Dec 29, 2023

You do other things to make money and continue to make art for it's own sake. If you get to the point that others want your art you get commissions. Just like most artists in the current system.

blehn · on Dec 29, 2023

Alright, so then you're NOT going to pay artists for their labor?

In the current system, artists might work for many years on a single work, or work many years perfecting their craft before anyone wants to pay for their work. Copyright gives them a way to earn money in the future that compensates them for the work they did in the past. It incentivizes creativity. Don't get me wrong, I don't think copyright is perfect, but you really ought to think more about the system you're proposing, because it's not making much sense.

dinvlad · on Dec 30, 2023

Unfortunately, it’s hard to explain these things to techies who only see the world in their one-sided startupy way. The fact that there’re starving creatives who have already been massively marginalized by the likes of Spotify of this world, means nothing to these tech workers who only see everything as numbers, or a “business model” to “validate”.

(full disclosure, I’m a techie who’s gradually woken up to the idea that the tech might just be the most abused way to exploit people)

johnnyanmac · on Jan 4, 2024

I'm in games, where art and tech crossroad. I 1000% empathize for the fact that art exploits, abuses, and underpays even if they at times may be doing more work than a junior web dev.

It's a bit ironic, because a lot of tech offers partial compensation in stock. Something else that really doesn't happen in games unless you work for like, the 3-4 largest studios. So they should at least understand that your compensation is not all based on labor for time worked.

dinvlad · on Jan 7, 2024

True that! And game dev is a notoriously brutal industry in general, sadly

Vicinity9635 · on Dec 29, 2023

Exactly. Somehow this idea that you keep getting paid for literally the same thing over and over again for work you did once is the absurdity. And ridiculously greedy.

It seems to have been invented by laywers, for lawyers. Nobody else really benefits as much as they do. The whole entirety of society vs. a single profession of dubious morality.

johnnyanmac · on Jan 4, 2024

>Somehow this idea that you keep getting paid for literally the same thing over and over again for work you did once is the absurdity

meanwhile, most tech is moving towards subscriptions?

Art is getting paid "non-greedily". People buy a song or art piece, and then people 10 years later buy a song or art piece. That's not one person paying twice for the same song, it's two people buying the same thing.

If people still value that art for that price later, I don't see how this is a "greedy" thing. is art magically supposed to turn open source CC0 after 5 years? Tech sure doesnt work like that.

matheusmoreira · on Dec 31, 2023

> How do you know what the value of the art will be before it's created?

I don't know. Anyone funding the work is accepting a risk.

> Should they not have been paid after 1991?

They definitely should get paid for their shows and live performances. The band itself can't be copied. Artists are extremely scarce.

Their art, however, is not. Once created, the scarcity of their recordings is artificial and fundamentally time limited anyway. Even if I were extremely tolerant of copyright, I'd argue for a term of only 5-10 years maximum with absolutely no possibility of extension.

In other words, even if we accept copyright as legitimate, they sure as hell shouldn't still be getting paid for some late 80s album. They've already been adequately compensated for those creations. If they want more, they should have to keep making new stuff so that they can benefit from new copyrights which will also expire after a short time.

Creators are not supposed to be able to strike gold once and then enjoy eternal royalties. Copyright must have short time frames or it's in breach of the social contract. The reality is we're doing creators a favor by pretending that it's hard to copy their stuff so they can make some money. We do this because they assured us that eventually all of it would belong to us: works would the public domain.

The copyright industry isn't keeping up their end of the bargain. They continuously pull the rug out from under us by extending copyright to the point we'll be long dead before our culture is returned to us. It's offensive and we should all stop pretending. They need reminding that public domain is the natural and default state of all intellectual work.

> How would you predict that value before its creation (or even after)?

I'd look at the artist's past work. If there is no past work, then I don't know.

> If you're saying that only the labor has value

I'm not saying that at all. Creations are valuable. Creators are valuable. The labor of creation is valuable.

Value is assigned to stuff by humans. Obviously humans value art. The price however is given by supply and demand. The fact is that supply of intellectual works approach infinity after they are created and therefore their prices approach zero. So it makes perfect sense to assign prices to the labor of creation but zero sense to assign a price to the product of creation. Copyright is an exercise in denying reality.

> and all labor is valued equally

I definitely did not say that. All labor is different. I value some creators a lot more than others. Some creators I don't value at all.

> that sounds sort of like marxism

I must apologize if I gave that impression. I hate marxism.

johnnyanmac · on Jan 4, 2024

>In other words, even if we accept copyright as legitimate, they sure as hell shouldn't still be getting paid for some late 80s album.

Why not? The fact is that even if the album is free, there will be people paying spotify $10/month to listen to it on demand. How is it fair that Spotify can profit from it for decades to come because they offer convenience, over the artist who made the music 10 years earlier and now relinquishes their art not even a quarter into a typical career?

Copyright is abusrd now, but it's not a bad concept. I think the original copyright law of 14 + 14 worked well enough. Life expectancy increased so I'd increase it to 14 + 14 + 14 (or 10 years after the death of the original author, whatever comes first). You fund an artist for their typical career length (if they choose to extend twice) and once they are (near) retired the song is free to work off of. In the meantime you simply negotiate if you want to use their work.

bomewish · on Dec 29, 2023

This seems a bit disingenuous. Lawyers DISAGREE on this stuff (as we will see in this case) and a court will decide the reality by fiat.

paulddraper · on Dec 28, 2023

> if a work is purely derivative of a source work

CliffNotes, Wikipedia, etc. have huge quantities of summarized copyrighted work.

caesil · on Dec 28, 2023

Summarization generally isn't considered a derivative work.

https://en.wikipedia.org/wiki/Derivative_work

btilly · on Dec 28, 2023

First, you missed the "and". Do CliffNotes, Wikipedia, etc. substantially impact the market for the original work? For example CliffNotes does not - people who buy the CliffNotes version typically already have the original work as well (for example from coursework). And Wikipedia may well do more to interest people in the original work than to replace it.

Second, you ignored the "purely derivative" bit. You have to look at to what extent the use is derivative or transformative. See https://en.wikipedia.org/wiki/Transformative_use for a bit about that. (Note, this is a legal term defined by various precedents. OpenAI can't just argue, "Turning it into an LLM is a transform, so it is transformative!") Since CliffNotes is educational and Wikipedia is nonprofit, it is relatively easy for both to qualify as transformative.

As a result your response underscores the point that was made. There are a lot of shades of grey. You really can't just seize on a couple of phrases and key points, then jump straight to the answer. You have to understand how the courts will decide, and then accept that there is an actual judgment call whose outcome depends on the judge judging.

(I'm not a lawyer, but I have had excessive exposure to them in the past.)

joegahona · on Dec 28, 2023

> people who buy the CliffNotes version typically already have the original work as well (for example from coursework)

Is there data that supports this? I’d be interested to know what % of people who buy a Cliffs Notes have already _bought_ the original.

btilly · on Dec 28, 2023

I'm sure that somewhere, someone, has data confirming or denying this.

But, anecdotally, it's what I've seen to be the case.

paulddraper · on Dec 28, 2023

> Do CliffNotes, Wikipedia, etc. substantially impact the market for the original work?

Yes.

For example, Wikipedia cites many research journals that otherwise are available only by subscription.

Prior to Wikipedia, gated information centers were the norm.

btilly · on Dec 29, 2023

You are answering the wrong question.

The question was NOT whether it spreads information from the articles to people who wouldn't have paid for it. The question was whether it suppresses sales of the articles to people who otherwise might have paid for it.

That's a more complicated question of fact. Some people now read Wikipedia and won't buy the article. Some people encounter the reference on Wikipedia and decide to buy the article. Which happens more?

I don't have data. But publishers do. And https://scholarlykitchen.sspnet.org/2022/11/01/guest-post-wi... shows what publishers concluded.

Publishers concluded that Wikipedia references are good for sales. And so jumped on the chance to cooperate with https://wikipedialibrary.wmflabs.org/. Which is therefore able to give free access to 90% of subscription only databases to you if you can prove that you're the kind of person who is likely to add citations to Wikipedia.

Legal questions are funny like that. You have to answer the question actually asked. If you merely answer another one that sounds similar to you, your answer is generally wrong.

mediumsmart · on Dec 30, 2023

I thought the question would be how it does this. If it can write NYT articles because it read them it has to arrive at the exact same words in the same sequence. Wikipedia has to copy and paste to achieve the same. So maybe the question actually asked does not apply.

bsenftner · on Dec 28, 2023

I personally appreciate the semi truck sized loophole that is satire. One can include an entire copy written work within one's own work as long as the treatment of that other copy written work is parody / satire. This is a provision of US copyright law put in place to protect political satire, which can be anything, because politics is everything.

shkkmo · on Dec 28, 2023

> Questions of fair use are famously gray, and anyone who declares something as "entirely fair use", with no caveats, is nearly always wrong except for the must obvious cases, which the given example is most definitely not. A judge has wide latitude in determining fair use.

You're the one presenting unfounded claims with confidence here. There is well established case law about not being able to copyright facts. If you are actually fully paraphrasing a presentation of facts / ideas and not just altering a couple of words here and there, then there is a very strong case for non-infringement.

hn_throwaway_99 · on Dec 28, 2023

> You're the one presenting unfounded claims with confidence here.

No, I'm not. On the contrary, I'm really looking forward to this case because I believe it will be a great test of a bunch of concepts that are totally novel in the world of copyright law as it applies to generative AI. The only things I am presenting with confidence are:

1. That anyone who declares that something is unambiguously fair use (or, contrarily, unambiguously infringing) is likely wrong. There is simply too much latitude by judges, and there have certainly been cases where a ruling went one way, only to be overturned on appeal.

2. While I certainly have an opinion on how I think this case will be decided, I'm not presenting that with unwarranted confidence. Instead, I linked that great article on the 4 factors of fair use determination because it's clear to me lots of people are saying "fair use!" on one side or the other with no understanding of the factors judges must actually consider when making a determination.

shkkmo · on Dec 29, 2023

You seem to be shifting the topic of this thread. The GP comment is about paraphrasing news articles while I don't see anything in the NYT lawsuit about paraphrasing. Rather, the NYT is concerned with exact reproduction or near exact reproduction. I too am very curious about the outcome of this case and wouldn't care bet either way on the outcome. I do have an opinion on what precedent would be better for our society but that doesn't mean I think that outcome is more likely.

However, none of that matters in this particular thread. There are well established precedents about paraphrasing news articles and they do not support the claim you made

btilly · on Dec 28, 2023

The "unfounded claims" were backed up by a link to Stanford on fair use and copyright. That's the opposite of being unfounded.

Remember. The NY Times does not have a record of filing frivolous lawsuits. Particularly not against companies with deep pockets. So it is almost certainly true that a lawyer who knows the law better than you thinks that this has a real chance. So you should be looking for flaws in trivial defenses that you can think up, rather than assuming that you know best.

For example take your copyright facts defense. That would be great if the NY Times was a phone book. They aren't, in addition to facts they offer analysis, editorial positions, and so on. For example I just asked ChatGPT, "In 2016, did the New York Times generally support or oppose President Trump?" I got back an answer talking about various kinds of concerns that the New York Times had, including an editorial titled, "Why Donald Trump Should Not Be President". The copy that ChatGPT needed to have to do that has a lot more than just facts in it.

Now if you paraphrased the NY Times like ChatGPT did when it answered me, you'd have a perfect fair use defense. But you aren't doing it for money, you didn't make a copy of all the NY Times, you aren't destroying the market for the NY Times, and you're legally able to own copyright in your transformed work. OpenAI is doing it for money, did copy all of the NY Times, is seriously impacting the market for NY Times articles, and ChatGPT generated text does not get a copyright.

Fair use is filled with shades of grey. Even if ChatGPT appears to do the same thing that you do, it is far less clear that OpenAI will enjoy the same level of fair use defense.

shkkmo · on Dec 29, 2023

The Stanford link is just generic information about the fair use tests and does nothing to backup the assertion.

> They aren't, in addition to facts they offer analysis, editorial positions, and so on.

Those opinions and ideas are also not copyrightable. Only expressions of them are copyrightable, which is why paraphrasing facts, ideas and opinions is not a violation of copyright.

> Fair use is filled with shades of grey.

Yes, but not all those shade are equal. There is a long history of litigation showing that paraphrasing news articles is fine.

Forgeties79 · on Dec 28, 2023

I would say it is arguable that is fair use, but the whole thing about fair use is that it is a defense, not a type of license or something you can preemptively apply. So whether or not it will be protected under fair use is actually not determined yet. In fact I would say that’s the entire debate here, right?

I have worked on many documentaries and any time we said “fair use” internally what we were implicitly saying is “nobody will come after us because they know that we are probably safe under fair use if this escalated.“ But again, we could never preemptively apply it. We were just anticipating potential conflict and gauging how likely it was to occur.

freejazz · on Dec 28, 2023

>What you described is entirely fair use, actually.

Based upon what? You think other publishers use NYTimes articles for free without license?

ummonk · on Dec 28, 2023

He's talking about citing and quoting NYTimes articles, not republishing them verbatim. That said, it's very different if you're a publication that sometimes cites reporting from other publications vs. a website exclusively dedicated to indexing and summarizing NYTimes articles.

fennecbutt · on Dec 29, 2023

I couldn't get gpt to quote an actual nyt article no matter how hard I tried...it just hallucinated in the general style of a news article.

Presumably, if it can remember at least a paragraph or two of each article, then surely the same would be true of any text it ingested and the model size would approach the dataset size (probably actually much larger). I don't believe this is the case at all, even searching around, I've not found any good recent examples of it regurgitating copyrighted text verbatim.

It's cool to hate AI stuff if you're a creative atm. But gotta love those generative/algorithm based PS brushes, that's still real art!

"Indeed, the opening paragraph of "A Game of Thrones" by George R.R. Martin, with the chapter titled "Bran," starts as follows:

"The morning had dawned clear and cold, with a crispness that hinted"

And then it cuts off, whether that's because OAI now have an oh shit filter or just the model had access to the first page or publicly available articles quoting the first line, I'm not sure.

I tried other chapters and random sections and it could get a sentence or two right but then hallucinated; what's more likely NYT and GRRM? That your works are being reproduced verbatim? Or that Facebook, YouTube descriptions, fan tumblrs and hell, the publicly available and multiple GoT related wikis that include a variety of passages from the books were used as training data?

ummonk · on Jan 4, 2024

I don't think it's necessarily true that model size would need to be larger than dataset size. It's theoretically possible that the model encoding achieves significantly better compression than DEFLATE or GZIP or whatever compression algorithm is used to store the dataset itself.

object-a · on Dec 28, 2023

Sourcing, quoting, and linking is covered in the NYTimes content policy under fair use. See: https://help.nytimes.com/hc/en-us/articles/115014891408-Obta...

I think what wouldn't be covered is reproducing substantial portions of an article, especially if it's done without attribution. Tier 2 publications that fully reprint NYT or AP/Reuters articles are usually doing this via a paid News Service or Content License. See: https://nytlicensing.com/content/new-york-times-news-service...

_the_inflator · on Dec 28, 2023

A lawyer starts a conclusion like this:

"It could be fair use if conditions a, b, and c are met. Condition a means..." ;)

laszlojamf · on Dec 29, 2023

Isn't that the thing though? If they credit the source, it's fine, but ChatGPT usually doesn't

Powdering7082 · on Dec 28, 2023

Do you have some examples & are you sure they don't pay licensing fees to NYT?

windexh8er · on Dec 29, 2023

Correct. But those 2nd tier sources don't have the NYT copy verbatim. Do you really think the US NFL, as an example, would let OpenAI use all of its recorded games as a way to train some new GenAI game framework to build better American Football games? No. All that material is copyright. Public media is going to move to a very awkward era of ownership and licensing because all of these large companies looking to make a buck off public data sets are doing very little to make the economic model less one sided.

I hope the NYT prevails here, personally. Models will (and are) currently tainted by data they should not contain and for longer term privacy concerns this needs to be addressed early and have significant consequences or we're headed towards a world where this type of technology will make our ad-targeted world seem like a much more manageable past.

h1fra · on Dec 28, 2023

it's fair use if you don't make money from your project no?

JohnFen · on Dec 28, 2023

No.

In the US, whether or not you make money has little to do with whether or not your use qualifies as "fair use".

semiquaver · on Dec 28, 2023

Why do you say that? Commercial vs noncommercial use is a primary factor in the “purpose” prong of the fair use balancing test and a significant one in the “market effects” prong.

That a use is noncommercial is often a deciding factor in the success of a fair use defense. GP is overstating it though, since it’s still one of many factors.

JohnFen · on Dec 28, 2023

Whether or not the use is commercial is certainly one of the considerations, but it's not the most significant one generally. There certainly can be specific cases where it's very significant, of course.

But what I was arguing was that a use is not "fair use" merely because it's noncommercial in nature. I cannot make copies of movies and give them away on the street for free and successfully claim "fair use".

semiquaver · on Dec 28, 2023

Agreed.

freejazz · on Dec 28, 2023

Because anyone that is familiar with fair use knows that the purpose prong and the commerciality aspect of it is not one of the more important prongs of the fair use analysis, whereas transformation is. Transformation adjusts what is a purpose that falls under fair use. Did you read Warhol??

semiquaver · on Dec 28, 2023

Yes. Warhol is an example where the commercial nature of the secondary use was the deciding factor in its failure to pass the purpose prong.

  > In sum, if an original work and secondary use share the same or highly similar purposes, and the secondary use is commercial, the first fair use factor is likely to weigh against fair use, absent some other justification for copying.

(P4). It’s very likely that a noncommercial secondary use would have passed under the reasoning in Warhol. I don’t understand the point you’re trying to make.

freejazz · on Dec 28, 2023

Read what you quoted - the commerciality of the use comes after whether or not the use was transformational. That's the entire of point Warhol - when the use is not transformational, there is very little space for a commercial fair use.

thephyber · on Dec 28, 2023

Your parent is more right than you.

Weird Al has made a fantastic living copying music while only changing lyrics. He makes very heavy use of the satire plank of Fair Use.

The “commercial” test is only part of the decision criteria for Fair Use.

sensanaty · on Dec 29, 2023

Always great to see people point out Weird Al, cause he's the shining beacon of an example of what OpenAI et al. should be doing. He explicitly gets permission from the original authors before doing any of his parodies, and he's even been turned down a few times as well, famously Prince rejected him a bunch of times and he subsequently has never made a Prince parody.

Not only does he get permission from the original authors, he also pays royalties to the authors despite legally not having to do so.

jdminhbg · on Dec 28, 2023

I'm pretty sure Weird Al is actually using compulsory licensing in music and just paying the required royalties to the songwriters. Anyone can cover any published song, you just have to pay the royalties when you do.

LogicalRisk · on Dec 28, 2023

He doesn't actually make very heavy use of the satire plank of fair use. He credits the original artists. From his own website

"Does Al get permission to do his parodies?

Al does get permission from the original writers of the songs that he parodies. While the law supports his ability to parody without permission, he feels it’s important to maintain the relationships that he’s built with artists and writers over the years. Plus, Al wants to make sure that he gets his songwriter credit (as writer of new lyrics) as well as his rightful share of the royalties."

The fact that he could rely on fair use is separate from whether he as an artist does rely on fair use.

shkkmo · on Dec 28, 2023

Weird Al makes a point of getting permission even though he views his satirical works as fair use.

The "commercial" test is only a part if the criteria and not necessary, but to say it has little impact is clearly false.

thephyber · on Dec 28, 2023

No.

If that were true, I could take a band that I hate, copy all of their music note-for-note, then release an exact copy on the market and undercut them by selling their entire discography for $0.01

Fair Use requires one of several enumerated activities, including satire, education, journalism. You can’t just copy content and hope that it passes Fair Use.

Hire a lawyer if you are unsure. But at least read the Wikipedia article on the subject if you are going to talk about it.

cma · on Dec 28, 2023

Cover songs do get a compulsory license though, for a predetermined royalty, and one condition is not changing it too much.

heavyset_go · on Dec 28, 2023

Another factor to consider is that neural nets can function as lossy compression, which becomes extremely evident when using models that are overfit.

Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.

TeMPOraL · on Dec 28, 2023

Yes, but this then hits against learning/understanding and compression being fundamentally the same thing. I can't think of a better way to argue in favor of "it's fine if human does it, therefore it's fine if LLM does it", than from the "lossy compression" angle.

heavyset_go · on Dec 28, 2023

It's not okay for a human to pirate, plagiarize, violate IP rights and laws, etc.

But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".

Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.

ben_w · on Dec 28, 2023

> But I disagree with the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn".

Backprop doesn't happen in us, but I think our neurones still do gradient descent – synapses that fire together, wire together.

And ultimately, at the deepest level we can analyse, our brains' atoms are doing quantum field diffusion equations, which you can also do in an Excel spreadsheet, so that kind of reductionism doesn't help either.

> Humans are living beings with needs and rights. A person being able to legally squat in a home doesn't mean that a drone occupying property for some amount of time also has squatter's rights, even though you could easily and affordably automate and scale the deployment of drones to live and hide away on properties long enough to attain rights regarding properties all over the country.

Yes, but we can also do tissue cultures and crude bioprinting, so it's a very foreseeable future where exactly the same argument will also be true for living organisms rather than digital minds.

We need to figure out what the deeper rules are that lead to the status quo, not merely mimic the superficial result. The latter is how cargo cults function.

ml-anon · on Dec 29, 2023

>Backprop doesn't happen in us, but I think our neurones still do gradient descent – synapses that fire together, wire together.

No! Hebbian learning is categorically NOT gradient based learning. Hebbian update rules are local and not the gradient of any function.

Cortical learning is so vastly different from how artificial neural networks “learn” they cannot even begin to be meaningfully compared mathematically. Hebbian learning is not optimization and backprop is not local learning.

Part of the problem of these discussions is a bunch of clueless people talking with authority.

ben_w · on Dec 29, 2023

Finally, a good counterargument. I've seen enough terrible arguments to know exactly how you feel — even in specifically just AI.

I have to keep reminding myself that outside of my own speciality, ChatGPT knows more than me despite its weaknesses, so I bet ChatGPT knows more about Hebbian learning than I do.

I'll look into that more.

danans · on Dec 28, 2023

> We need to figure out what the deeper rules are that lead to the status quo, not merely mimic the superficial result.

Sure, that's an interesting path of inquiry, and one should be free to understand themselves as being no different than a machine if they desire.

But the objective of laws is the benefit of (at least some) humans, not machines covered in lab grown tissue. The process of being human is a big part of what makes us human.

ben_w · on Dec 28, 2023

> machines covered in lab grown tissue

I think you're misapprehending — I mean an entity fully 3D printed out of tissue, no machinery (unless you're counting all biology as machinery, but I think you're not doing that).

I recon bio-printing is now where home computing was in the Apple 1 era, so this is a way off, but it's foreseeable.

> The process of being human is a big part of what makes us human.

Mmm. How much has that process that changed since the ancient world?

riversflow · on Dec 28, 2023

I recon bio-printing is now where home computing was in the Apple 1 era

How do you recon that, Apple 1 was Turing complete. We haven't printed life, that would be a tremendous accomplishment.

I think we're closer to Edison inventing a lightbulb as a step to computers being possible. Printing a conscious thing, at all, would be like the transistor. An Apple 1 analogue wouldn't be likely because of the terrible ethics of a "shitty" printed human.

ben_w · on Dec 29, 2023

> We haven't printed life, that would be a tremendous accomplishment.

Sure we have, and in multiple different senses.

The ones which matters here are cell culture, which is nowhere near the fanciest bar that's been surpassed in this field, and tissue culture which is somewhat harder but the reason why I recon it's at the Apple 1 level is that a small number of experimentalists are messing around with it using expensive equipment that you can technically buy at home but you need to be well trained to actually use, for example:

https://youtu.be/Z_ZGq8Tah0k?si=u6bBatjuSWcyNYJ3

And, more broadly, there's bioprinting as a research etc. field:

https://en.wikipedia.org/wiki/3D_bioprinting

And here's a TED talk from ten years ago where they demoed an early research 3D printed kidney on stage:

https://youtu.be/bX3C201O4MA?t=11m46s

riversflow · on Dec 29, 2023

> Sure we have, and in multiple different senses.

No. That isn't printing life, that is taking already living cells, priming and transforming them into something useful. Regardless, I'd count it if we could make an entire living organism this way, but we cant. Creating a working organ is no doubt amazing, and proof that this technology is worth pursuing, but it isn't "printing life" any more than producing life saving drugs is.

In your example you are talking about being able to bioprint a person(they have to be a person to have that right) to squat a property. Bio printing an organ isn't an example of that, it's not even close. Saying that we are anywhere near being able to print a human to squat a property is pretty ridiculous.

ben_w · on Dec 30, 2023

> No. That isn't printing life, that is taking already living cells, priming and transforming them into something useful.

Which is absolutely sufficient for the usage I described upthread. In fact, I'd go so far as to say it's mandatory for the point I was making, as — fun though bio-printed werewolves, dragons, and fae would be — my point only works if you get humans out of the process rather than some other species. A bioprinted horse is probably slightly harder than a bioprinted human, but the latter isn't getting any squatting rights.

I could've linked to work on synthetic genomes and nucleotides to give evidence for lower-level creation of live, but they don't matter for the same reason:

My point is that there's a pathway heading off into the distance, and somewhere in the distance but before the horizon can be found bio-printed humans with all the same moral issues we're now just beginning to taking seriously thanks to AI being conversational, and if we had something completely customised, that's cool and all, but it doesn't make anyone go "oh, they're people" the way a humanoid body with human DNA getting off a table saying "hello, nice to meet you" does.

> In your example you are talking about being able to bioprint a person(they have to be a person to have that right) to squat a property. Bio printing an organ isn't an example of that, it's not even close. Saying that we are anywhere near being able to print a human to squat a property is pretty ridiculous.

I wrote "an entity fully 3D printed out of tissue […] is a way off, but it's foreseeable" and compared bio-printing today to a nearly 50 year old computer, and one of my references was a link to a youtube channel where someone is attempting to do a small-scale prototype thing along these lines with a handful of organs made from mouse cells grown in his own lab (and mouse cells rather than human because of the disease risk not because something magic happens with human cells). You're mixing up what I think is foreseeable with what I say already exists, and using the nonexistence of what I think can be foreseen to argue against what does exist.

vanviegen · on Dec 28, 2023

> Gradient descent and backpropagation don't take place in the brain.

Not exactly, no, but the 'neurons that fire together wire together' way of learning has a pretty similar effect.

> LLMs "learn" in the same way that Excel sheets "learn".

I've never seen an excel sheet do anything like backpropagation.

zimpenfish · on Dec 28, 2023

> I've never seen an excel sheet do anything like backpropagation.

Not strictly in the sense you mentioned (assuming that you mean "by themselves") but people may find [1] and [2] interesting.

[1] https://pub.towardsai.net/building-a-neural-network-with-bac...

[2] https://towardsdatascience.com/demystifying-feed-forward-and...

galangalalgol · on Dec 28, 2023

Sadly, I have seen one. It was a vba script from the late 90s that used a simple dense multilayer network to do some unsupervised pattern classification. The linear algebra tools in vba/excel along with the solvers are all native dll code and the vba itself is all AOT compiled to native, so it typically runs very fast, and for small matrices it beats out numpy by an order of magnitude due to the ffi overhead. Was it the wrong tool? It depends on your constraints, but probably. It did work though.

ml-anon · on Dec 29, 2023

Hebbian learning and backprop are not comparable and they don’t have a similar effect in any meaningful sense.

pas · on Dec 28, 2023

sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM.

but, more importantly, OpenAI can also be sued for tortious interference? (basically the civil equivalent of accessory)

heavyset_go · on Dec 28, 2023

Whoever operates the LLM, in this case OpenAI, engaged in copyright infringement through the unauthorized modification, reproduction and distribution of content to you.

Robotbeat · on Dec 28, 2023

The person doing the requesting did.

heavyset_go · on Dec 28, 2023

That's not how interactive computer services work.

ben_w · on Dec 28, 2023

> sure, but if I use an LLM to write a novel/article, I can be sued in civil court not the LLM

That's function of the legal system, not of the technology. If tomorrow someone made a perfect dolphin-Esperanto translator and proved Dolphins were as smart as humans, you still can't sue a dolphin until the legal system says so.

darkerside · on Dec 28, 2023

Wouldn't you find out by suing the dolphin and seeing if it holds up in court?

JohnFen · on Dec 28, 2023

Not if you were smart, unless you have some sort of solid argument for why the established case law about this sort of thing is faulty.

darkerside · on Dec 29, 2023

There is established case law? For a talking dolphin?

TeMPOraL · on Dec 29, 2023

> anthropomorphize LLMs (...) gradient descent (...) backpropagation (...) needs and rights

You misunderstood me. I was talking about something more fundamental.

Understanding is data compression. They are the same thing. Learning patterns, building mental models, creating abstractions, generalizing, gaining intuition/a feel for something - all the things humans engage in as part of learning and understanding the world - are all acts of lossy data compression.

mensetmanusman · on Dec 28, 2023

"It's not okay for a human to pirate, plagiarize, violate IP rights and laws, etc."

most of the world disagrees with this view, and that means they will create the AI that wins.

sgt101 · on Dec 28, 2023

also if I write and article and quote some "text like this" [1] then that's not plagerism, but if my arguement is that the underlying assumption that you can anthropomorphize LLMs. Gradient descent and backpropagation don't take place in the brain. LLMs "learn" in the same way that Excel sheets "learn". Well, that's plagiarism and it's not allowed and people will get peeved and my career might get damaged.

I await the HN ban with fear..

[1] I'm not even doing referencing - so I am surely an LLM.

anileated · on Dec 28, 2023

I can’t think of a better way to argue in favor of “LLMs are copyright laundering machines” than from the humanness angle.

Humans have rights, software tools don’t.

If you grant an LLM the full set of human rights, then it can consume information, regurgitate copyrighted works, and use it to generate money for itself. However, considering blatantly obvious theft as “homage” goes hand in hand with free will, agency, being in control of yourself, not being enslaved and abused, etc. Pondering various scenarios along those lines really gets to the heart of why an LLM is so very much not a human, and how subjecting it to the same treatment as humans is a ridiculous notion.

If you don’t grant LLM human rights, then ClosedAI’s stance is basically that pirating works is OK because they pass them through a black box of if conditions and it leads to results that they can monetize. That’s such a solid argument, it’ll surely play well in the court of law.

Training data is not an “LLM does it”; first because “it” here is not “learning” or understanding in human sense (otherwise you would have to presume that an LLM is a human), and second because a software tool doesn’t have agency and it’s really just Microsoft using a tool based on copyrighted works to generate profit.

kelseyfrog · on Dec 28, 2023

Humans don't exactly have the greatest track record of granting other humans rights. I don't presume they'll get it any better with AI.

What I expect to happen is whoever has the most influence and power will get what they want and we'll end up raising a generation with the implicit understanding of "that's just how things are," natural order, truth, reality, and all that jazz.

The only thing that ever changes outcomes is if the contradiction status quo is incapable of being managed.

anileated · on Dec 29, 2023

I can’t argue for or against whether LLMs should have rights or not… I can only point out the hypocrisy of claiming LLMs are “like human” enough and independent enough that their operators-become-slaveowners cannot be held to account on any copyright matters, but also claiming that LLMs are not like human at all lest someone demans them to have rights and nukes the industry.

devsda · on Dec 28, 2023

Humans are defined not just by their abilities but by their limitations too. We celebrate our achievements because sometimes they surpass the limitations of an average human.

Our collective human limitations(physical, mental and temporal) are sort of invisible implicit rules that we all follow in one way or the other. If an entity is not bound by those rules then I don't see why that entity should be treated the same as a human.

Companies already make this differentiation.

For example take captcha and bot detection. Some of the heuristics are based on inherent human limitations like response time, click time, mouse acceleration etc.

I doubt youtube or any other streaming service will be happy if you want to stream all their videos to train a hypothetical human like AI(which views and prepares notes like a human) at a hugely accelerated speed compared to a regular human. You can guess how quickly they will cite fair usage policies.

What I want to say is there are fundamental differences between a human and an AI. So, we should not be quick to dismiss any concerns just because AI can "mimic" humans in certain areas.

RandomLensman · on Dec 28, 2023

We can have different rules for humans than for machines. In fact, that happens all the time.

wokwokwok · on Dec 28, 2023

Is there some LLM meta where understanding and compression are argued to be the same thing I’m not aware of?

Anyone got more details on this?

Superficially it sounds like total BS; a highly compressed zip file does not exhibit any characteristics of learning.

Algorithmically derived highly compressed video streams do not exhibit characteristics of learning.

?

I’ve vaguely heard the learning can be considered to exhibit the characteristics of compression in that understanding of content (eg. segmentation of video content resulting in more highly compressed videos) can lead to better compression schemes.

…but saying you can “do a with b” and “a and b are fundamentally the same thing” seems like a leap…?

It seems self evident you can have compression without comprehension.

adroniser · on Dec 28, 2023

Suppose you wanted to train an LLM to do addition.

An LLM has limited parameters. If an LLM had infinite parameters it could just memorize the results of every single addition question in existence and could not claim to have understood anything. Because it has finite parameters, if an LLM wants to get a lower loss on all addition questions, it needs to come up with a general algorithm to perform addition. Indeed, Neel Nanda trained a transformer to do addition mod 113 on relatively few examples, and it eventually learned some cursed Fourier transform mumbo jumbo to get 0 loss https://twitter.com/robertskmiles/status/1663534255249453056.

And the fact it has developed this "understanding" as an ability to learn a general pattern in the training data enables it to compress. I claim that the number of bits required to encode the general algorithm is fewer than the number of bits required to memorize every single example. If it weren't then the transformer would simply memorize every single example. But if it doesn't have space then it is forced to try to compress by developing a general model.

And the ability to compress enables you to construct a language model. Essentially, the more things compress, the higher the likelihood you assign them. Given a sequence of tokens say "the cat sat on the", we should expect "the cat sat on the mat" to compress into fewer bits than "the cat sat on the door". This is because the latter is far more common and intuitively more common sequences should compress more. You can then look at the number of bits used for every single choice of token following "the cat sat on the" and thus develop a probability distribution for the next token. The exact details of this I'm unclear on. https://www.hendrik-erz.de/post/why-gzip-just-beat-a-large-l... this gives a good summary.

docfort · on Dec 28, 2023

It’s exactly this kind of thinking that underlies lossless text compression (not exactly what a transformer guarantees but often what happens). For that reason, some people thought it would be fun to combine zip and transformers. https://openreview.net/forum?id=hO0c2tG2xL

amoss · on Dec 28, 2023

The idea precedes LLMs by a couple of decades and is thought to apply more broadly within ML/AI than being a specific meta for LLMs. http://prize.hutter1.net/ has been around for a while, there is a link in there to the earlier work (called AIXI?).

vidarh · on Dec 28, 2023

Even something as simple as LZW starts developing a dictionary. Not all compression is sufficient for understanding, but the more you compress a stream of data, the more dependent you are on understanding the source, because understanding the source allows you to take more shortcuts and still be able to reconstruct the data.

dns_snek · on Dec 28, 2023

> fundamentally the same thing

I fundamentally disagree. That's not some established fact, just a narrative used by those who wish to plagiarize using "AI".

cyborgx7 · on Dec 28, 2023

It's fine for a human to remember it. It's not fine for a human redistribute it for money (legally speaking). That's copyright infringement.

Robotbeat · on Dec 28, 2023

Correct, just like it’s infringement to reproduce an article from memory using pen and paper intentionally. The person deciding to do that bears responsibility. OpenAI would be liable IFF they were intentionally facilitating that, instead of it being an undesired artifact from overfitting.

cyborgx7 · on Dec 28, 2023

I'm pretty sure if you reproduce a work from memory by accident, because you didn't notice your subconscious had just stored the entire article and is now reproducing it word for word, you'd still be guilty of copyright infringement.

Robotbeat · on Dec 31, 2023

Or the person doing the prompt hacking intending to produce this result from you would be.

yencabulator · on Dec 28, 2023

The music business is full of examples of that.

freejazz · on Dec 28, 2023

It's super obnoxious when people who have no understanding of the law, point to industry patterns or behaviors as examples of what is legal, not knowing the law and not knowing whether or not the thing they are pointing to is legal. The music business is also full of copyright infringement litigation. You also are not taking into account whether what is copied by an artist is covered by copyright when you made your statement. Do you know what's covered in music copyrights, such that your statement ever had any value for anyone else here?

freejazz · on Dec 28, 2023

That's not true at all. Copyright infringement is a strict liability offense with no inquiry in to the state of the mind of the infringer from a liability perspective. The state of mind of the infringer is only relevant to the issue of willful infringement.

anticensor · on Dec 28, 2023

Willful vs less-than-willful infringement are definitely two separate types of offences, as indicated by the difference of penalty.

freejazz · on Dec 28, 2023

It's just "infringement" and "willful infringement" there is no "less-than-willful infringement". Willful infringement is punitive with increased damages and increased burden to show - it's in the freakin' statute.

accrual · on Dec 28, 2023

> Sometimes they're so overfit that the compression isn't even lossy, and the data is encoded verbatim in the NN.

Here's an article from November 2023 that discusses this:

https://not-just-memorization.github.io/extracting-training-...

modeless · on Dec 28, 2023

> To keep things simple, let's say I never regurgitate chunks of verbatim NYT articles, maybe quite short snippets.

You just described Google. When you think about it, it's surprising that Google is legal. However, it is well established that what Google does is perfectly legal. Remember that internally Google keeps and uses complete verbatim copies of every web page they index.

Yes, Google offers a link to the source. If OpenAI did the same, even if only 0.1% of people clicked on the links and NYTimes hardly got any revenue from it, would that make it legal in your eyes? What if they implemented a system that detected when it was outputting a verbatim copy of something and simply paraphrased it? NYTimes clearly doesn't have copyright on paraphrased versions of their articles. I think it would be pretty silly if the government forced them to do that as it wouldn't make any practical difference to anyone.

graeme · on Dec 28, 2023

Any publisher can opt out of google. Publisher also have substantial control over titles and snippets shown in google, whether an article appears in google news, etc

Paraphrasing is also known as cloning and is often a copyright violation

modeless · on Dec 28, 2023

Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.

In US copyright law facts cannot be copyrighted, so copyright on factual content like newspaper articles is limited. Simply replacing a few words wouldn't work, but I am certain that GPT-4 is capable of paraphrasing factual content at a level that would not be considered infringement if a human did it.

LargeTomato · on Dec 28, 2023

If I make a website that scrapes NYT and passes it back and forth through a machine translator, say, English -> Spanish -> English, then the content will be slightly modified. Is this legal to make money off of?

Seems like the legal answer is unclear but, like Napster, such a system seems like it would lose in court.

blacksmith_tb · on Dec 28, 2023

It would be unlikely to be something you'd find paying customers for, though? I suppose if you charged a small percentage of what NYT charges people might be willing to consider it, but you'd have some costs for hosting etc., so I am skeptical about its viability as a business model...

LargeTomato · on Dec 30, 2023

I'd serve fake news en-masse to low IQ people who click things to feel good about their own views. I'd also build a handful of websites (ideally as many as I can personally manage) to flood the Internet with fake news clickbait.

One site clones fox news. One clones news max. And so on, cloning many news sites, sports sites, any news site. Automated, massive scale content farming. Think of the websites recommended by Taboola but, realistically, a whole lot worse.

graeme · on Dec 28, 2023

That’s not the only reason. Google search is also transformative and non competitive with the underlying publications. And that is why the opt out is important. If you feel google competes with your site you don’t have to sue Google: just tell them to to away

modeless · on Dec 28, 2023

Transformative yes, so is ChatGPT. Much more so actually. Non-competitive is debatable. Especially with the instant answers Google has in addition to regular snippets which can also obviate the need to visit a site. I have a hard time seeing ChatGPT as competing with newspapers more than Google Search does.

Nobody is seriously going to ChatGPT and trying to trick it into regurgitating old NYT articles as an alternative to paying for access to NYT's archives. Meanwhile, newspapers went as far as getting the laws changed in several countries because they felt Google was competing with them too much and didn't like the fact that it was legal.

freejazz · on Dec 28, 2023

>Copyright law doesn't mention opt outs or search engine snippet controls. It's not clear to me that robots.txt is the singular thing that makes Google legal.

Genuinely - what are you talking about besides your own assumptions? you just assume everything google does is legal and therefore any one else doing anything arguably similar must also be legal? Without regard for factual details that do matter to copyright law? Such as license?? Your own description of copyright law here is very stunted - you can't paraphrase articles of the NYTimes and call it a fair use. You can report on what the NYtimes reports on... because that's what news is.

modeless · on Dec 28, 2023

> you just assume everything google does is legal

Not an assumption. This is well established. They've been doing it for twenty years!

> Without regard for factual details that do matter to copyright law? Such as license??

What license? Google doesn't in general have or need an explicit license to crawl websites and neither does OpenAI.

freejazz · on Dec 31, 2023

>Not an assumption. This is well established. They've been doing it for twenty years!

It's not at all well-established. How many anti-trust suits is Google facing now? Your proposition defies common sense.

>What license? Google doesn't in general have or need an explicit license to crawl websites and neither does OpenAI.

It's not the crawling the website that OpenAI did that it needs a license for... why bother conversing if you are going to be this obtuse?

mawise · on Dec 29, 2023

> Any publisher can opt out of google.

Can they? Here's reference to a legal fight where Google scraped song lyrics from a lyrics website, and presented the lyrics verbatim directly to users (bypassing the original site and the ads that allowed that site to operate)

https://www.rollingstone.com/music/music-features/genius-law...

inetknght · on Dec 28, 2023

> However, it is well established that what Google does is perfectly legal.

Google has a wide range of products and shakedowns. Not all of them are "perfectly" legal: Google is being challenged in court over some of their shakedowns and products practices.

modeless · on Dec 28, 2023

I am clearly talking about the web search engine in the context of copyright. Other products or legal concerns like antitrust are completely irrelevant here.

caconym_ · on Dec 28, 2023

You took that quote out of context and missed the broader point in the process. The snippets provided in regular search results cannot generally replace the substance of the full articles they link to, while that's the whole point of GP's hypothetical website—it simply doesn't reproduce large chunks of text verbatim, presumably to avoid copyright infringement claims in the hypothetical's frame, and in GP's rhetorical frame to present an analogy with the information-laundering powers of LLMs that their creators claim make their exploitation of unlicensed training data fair use.

The whole point of a search engine (as we've classically known them) is to index the web and respond to queries with a list of links that you will inspect and click through on. The whole point of an LLM chatbot tool is to eliminate those inspecting and clicking-through steps, becoming a one-stop shop for content whose substance was created by someone else. That's also the whole point of GP's hypothetical, which is why it works as an analogy.

---

There are substantially better arguments for search engines being legitimate fair use. Consider, for example, transformation. AI defenders will argue that these systems are transformative because they reshuffle elements of their input in their output, but that's clearly a much weaker form of transformation than one in which the transformed work has an entirely different nature and purpose, i.e. search engines vs. the results they return. Ultimately these technicality-based "nuh uh" arguments aren't going to save the practice of training AI on unlicensed data, because they are incompatible with the spirit of copyright law even if the novel nature of these technologies means the letter of said law can't quite nail them down yet.

If these arguments do succeed, it will be because the judicial/regulatory environment in which they were applied has been corrupted by capital.

8note · on Dec 28, 2023

A search engine takes an input string, a corpus of text, and returns a series of text that best comes next after the input string.

An LLM takes an input string a corpus of text, and returns a series of text that best comes after the next input string.

To get a paragraph of output, you run the search over and over again

Both the search and LLM reshuffle the inputs to the outputs.

If I'm describing the purpose of the LLM, it's got a wide number of usages. "Making my resume look more professional" or "be a crud api" or "reformat my ask into a api call to X service" or "give me a timeline of events surrounding Y with source links"

caconym_ · on Dec 28, 2023

An 18 wheeler travels on wheels. A shopping cart travels on wheels. A shopping cart does not require a license to operate, therefore an 18 wheeler does not require a license to operate. A shopping cart can be operated inside a grocery store, therefore an 18 wheeler can be operated inside a grocery store. A child can operate a shopping cart, therefore a child can operate an 18 wheeler.

If I'm describing the purpose of an 18 wheeler, it's got a wide number of usages. "Carry my chicken" or "carry my lettuce" or "carry my Cheetos". Or, simply, "carry my groceries".

fennecbutt · on Dec 29, 2023

>content whose substance was created by someone else

And how did the training data contribute to the content in any meaningful way? Inspiration isn't substance.

You think all fantasy writers gotta pay Tolkien estate bc so much of fantasy draws from his tropes? Lmao no.

caconym_ · on Dec 29, 2023

> And how did the training data contribute to the content in any meaningful way? Inspiration isn't substance.

If training data is so unimportant, why not simply not use it and avoid the controversy? At the very least that would certainly fix the issue where the model demonstrates how "inspired" it is by NYT articles by reproducing them verbatim.

:)

riskable · on Dec 28, 2023

The reason why Google keeping entire digital copies of other people's copyrighted works is legal is because copyright is all about distribution rights. Any person can possess the entire works of Disney (without paying for them), for example and as long as they do not distribute those works they're 100% in the clear.

Possession is not a crime when it comes to copyright. It's not like physical things (e.g. drugs or guns) at all. This is why comparing copyright violations to theft is silly.

ChatGPT can absolutely keep verbatim copies of the entire works of basically anything and not run afoul of the law. When it regurgitates a small part of an article that's covered by fair use in theory but the truth is that fair use can only be determined by a judge in a court of law when someone is sued. It cannot be determined with any sort of certainty ahead of time. It's a legal defense, nothing more.

Summarizing content has been legal forever as well (see the other posts here talking about Cliff Notes and some similar products). That's not even fair use that's just like, people's opinions, man (legally speaking).

I don't think the NYT will get what they want out of this at all.

fennecbutt · on Dec 29, 2023

>Any person can possess the entire works of Disney (without paying for them)

Man thinks piracy is legal

sagarm · on Dec 28, 2023

Isn't it totally normal to write articles / blog posts that effectively summarize, and often quote from, news articles?

laborcontract · on Dec 28, 2023

My impression is that it’s not necessarily legal, but going after bloggers and proving damages based is just a huge waste of their time. OpenAI came by with their fat stack of funding and changed that.

Symmetry · on Dec 28, 2023

No, in US law at least there can be no copyright of facts, only presentation. If you convey the same facts in different words that isn't a matter of fair use, it's never even a matter of copyright in the first place.

onos · on Dec 28, 2023

How about things that aren’t quite facts? Reviews, opinions, etc.

ralusek · on Dec 28, 2023

Illegal to have the same opinion as someone?

onos · on Dec 28, 2023

I was inarticulate. Imagine a business that goes to some trouble to review businesses or products. Can we lift those and serve them ourselves? Non facts…

ralfd · on Dec 28, 2023

What parent poster meant is that it is normal that news organisations reference each other and report/cite/rephrase each other reports. For example all other news papers reported about the Watergate scandal reported by Bernstein&Woodward in the Washington Post.

Jensson · on Dec 28, 2023

Those cite the original source that they used to write the article, the gpt models don't.

weird-eye-issue · on Dec 28, 2023

Depends on your prompt

laborcontract · on Dec 28, 2023

Yeah but for every instance of that are face hugger links blogs that will rewrite the article and almost meant to deprive the source of any credit.

It’s not clear to me where the line is.

oxguy3 · on Dec 28, 2023

No, it is very specifically and deliberately fair use. That is the primary intended purpose of fair use. The New York Times doesn't own the news; they just own their articles.

facu17y · on Dec 28, 2023

It is legal. Fair use. People have been doing it for ages. Almost every article you've ever read has some fair use of another article, book or news item, etc.

lacrimacida · on Dec 28, 2023

When it becomes a service where you make money but the source doesn’t is it still fair use?

whythre · on Dec 28, 2023

Yeah. No one is out there suing the shit out of cliff notes because they published a summary of Catcher in the Rye.

tkz1312 · on Dec 28, 2023

they might if cliff notes starting copy pasting parts of the source into their articles and passing it off as original writing though :)

hn_acker · on Dec 28, 2023

Newspapers generally don't "pass off" quotes as their own writing. They make clear which parts they quoted.

throwuwu · on Dec 28, 2023

The Tolkien estate should get busy suing all the fantasy writers, comic artists, game developers and board and card game companies. Lots of cash there.

galangalalgol · on Dec 28, 2023

They have done some of that actually. Tolkien will be public domain in the nations that are at aithors death+50 in a few days. Sadly, it will be a much longer wait in mine and many others.

cmiles74 · on Dec 28, 2023

I think the issue is that they trained ChatGPT on the New York Times' proprietary IP without paying licensing fees and, the Times argues, that is illegal. By way of proof the Times has examples of ChatGPT dumping out articles verbatim.

joegahona · on Dec 28, 2023

This is exactly how I understand it. There’s a lot ink getting spilled about “summarizing isn’t illegal” and “what about Cliffs Notes” but that isn’t what this is about.

If the verbatim examples that have been going around are true, that’s bad. I’d love to know more details around it — prompts used, whether that’s an old model, etc. This seems like plagiarism more than anything.

sagarm · on Dec 28, 2023

IMO it's pretty hard to training an LLM isn't a transformative use. It's clearly not just copying, or even excerpting. Even if it was just compression (and it's not), they're only providing model output not distributing the "compressed" NYT articles.

Yielding verbatim snippets of copyrighted content is a problem for OpenAI though.

cmiles74 · on Dec 28, 2023

Perhaps we will see the courts revisit this idea of "transformative" works and formulate something more useful. I'm my opinion, you can't build an LLM unless you have a large amount of data with which to train it. Given the huge amount of money companies like OpenAI hope to generate, it seems unreasonable that content creators would not be rewarded.

amelius · on Dec 28, 2023

> If you forget about the LLM aspect, and simply build a product out of (legally) scraped NYT articles, is that fair use?

That's not a good question.

If I look out of my window and see my neighbor go to the shop, that's fine. If I use cameras and track everybody I see on the street and put them in a database, then that's problematic and illegal in many places.

Logic does not necessarily apply when scaling is involved.

filoleg · on Dec 30, 2023

> If I use cameras and track everybody I see on the street and put them in a database, then that's problematic and illegal in many places.

Afaik not illegal in the US. You put a camera on your own private property (window), use it to record what’s happening in a public space (the street outside), and then store that data in a database (that other people can presumably access). Unless I am missing something, this scenario is perfectly legal in the US.

Inbefore I get hit with “not every country is like that at all,” NYT is based in the US and the lawsuit is filed in the US. So how a bunch of other countries deal with similar issues shouldn’t really have as much bearing on this specific case.

oivey · on Dec 28, 2023

It is a good question with a simple answer: no.

amelius · on Dec 28, 2023

It depends. Google built a product out of scraping content (Google Search).

But what I'm saying is that answering the question does not allow you to deduce anything about your rights; that's what I mean by "not a good question".

oivey · on Dec 28, 2023

The general answer is no. Fair use is a special carve out legally that has to be determined individually. If your product is something that regurgitates NYT articles while stripping NYT of their source of revenue, that’s got fair odds to not qualify as fair use.

rafabulsing · on Dec 28, 2023

It can allow you to deduce something, depending on the answer.

If we want to establish whether scenario A is fair use or not, and we all agree that A is "worse" (regarding fair use status) than some other scenario B, then if we also agree that B is not fair use, A by definition isn't either. The opposite is not true, of course: B being fair use does not imply that A has to be as well.

I find that kind of upper/lower bound logic can be pretty useful and I think it's what the parent comment was trying to do.

On a related note, that same logic is why I think Godwin's law can be a bit misapplied now and then. Sometimes bringing up nazis/Hitler can be useful to establish some ground truth in a debate (instead of just a way to imply your opponent is actually a bad person, or, possibly, an actual nazi themselves). E.g. a conversation on the morality of violence is vastly different depending on whether you agree that violence against nazis is ok or not.

oivey · on Dec 28, 2023

The problem is multi-dimensional, so bounding logic like this isn’t necessarily useful.

rafabulsing · on Dec 28, 2023

I think it can still provide value if the actual scenario at hand is so complex and fraught that conversations about it end up mostly fruitless (as I think is the case here). At least it can provide you with some mental handholds and supports for where to start reasoning about the problem, which hopefully helps in finding some small agreements, or at the very least, mutual understanding of each other's positions.

oefrha · on Dec 28, 2023

The real answer is it totally depends on whether your product grows to $10,000,000,000, and whether you pays part of it back. Search engines pay with referral traffic.

1vuio0pswjnm7 · on Dec 29, 2023

If there is payment then usually there is an agreement. An agreement can limit fair use. Can the NYT, via an agreement, e.g., "Terms of Use", limit what the subscriber does with the articles. There is not much precedent that suggests otherwise.

Consider the analogy from libraries that want to do data mining.

"Unfortunately, in licenses for digital scholarly content the majority of content acquired by research libraries publishers often include terms that prohibit certain uses that would otherwise be allowable under the Copyright Act. For instance, licenses may require libraries or individual researchers to negotiate for otherwise lawful activities, such as text and data mining, and to pay exorbitant fees on top of the cost of the content itself. While new regulations allow researchers to circumvent technological protection measures to access copyrighted materials, licenses for that content may include terms that explicitly prohibit this circumvention. In many cases, these activities might actually increase the value of published material; for instance, if a data-mining project yields new knowledge about a topic covered in a journal, it may very well spark new interest in that journals content. Libraries and publishers have often assumed that license terms that restrict copyright exceptions are enforceable under state contract law. There is, however, surprisingly little case law on this point."

Putting some string in a robots.txt to try to stop data collection is an amusing "solution". Should copyright owners have "Terms of Use" that limit usage for commercial "AI" purposes.

tauntz · on Dec 28, 2023

> Is that fair use?

As always, the answer is.. "it depends". I guess it depends mostly on the jurisdiction that applies to you. "Fair use" can have rather different legal meaning (or not exist at all) in different countries.

hn_acker · on Dec 28, 2023

Fair use is specific to the US, as far as I'm aware. Moreover, Congress had to codify fair use (turn fair use common into statutory law in the form of 17 U.S. Code § 107) in order to make copyright statutes compatible with the First Amendment. Most other countries don't have freedom of expression and freedom of the press, so copyright law in a different country usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions.

anticensor · on Dec 28, 2023

> Most other countries don't have freedom of expression and freedom of the press

This is demonstrably wrong. Many countries have both freedoms, albeit some have less strong protection than others.

hn_acker · on Dec 29, 2023

Good point. I failed to qualify what I meant by freedom of expression, and made a meaningless claim regardless. Despite the US Constitution's relatively broad speech protections (e.g. don't criminalize hate speech, and allow truth as a defense to defamation claims), US governments don't always respect freedom of expression (e.g. KOSA would force social media companies to moderate more aggressively to "protect kids") or respect press freedom (e.g. police pepper spray journalists at protests). Even so, I think Congress wouldn't have bothered to codify fair use if the First Amendment weren't as broad as it is.

I replace the following sentence from my previous comment:

> Most other countries don't have freedom of expression and freedom of the press, so copyright law in a different country usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions.

with the following:

Copyright law in most countries usually lacks a unifying exception test like fair use to supplement the specific enumerated exceptions in each respective country.

The rest of my previous comment remains the same.

gls2ro · on Dec 29, 2023

I have to point out that you did not refute the initial assertion.

They said "Most other countries" and you replied with "Many countries"

"Many" does not necessary include "most" but "most" does include "many".

dragonwriter · on Dec 29, 2023

> "Many" does not necessary include "most" but "most" does include "many".

No, it doesn't. If a set is of sufficiently low cardinality, “most” (in extreme cases, even “all”) of the set may not be “many”.

Most-all, in fact—Catholic Presidents of the United States have been Democrats. But it is not the case that many Catholic Presidents have been Democrats.

Most women to have served on the US Supreme Court did so only after its first 200 years. But, again, there were not many women who served on the Supreme Court only after its first 200 years.

defrost · on Dec 29, 2023

Most female Vice Presidents of the USofA would not agree that that there are many female US Vice Presidents.

madeofpalk · on Dec 28, 2023

Also “fair use” does not use/define precedent - each case is assessed individually which really can be a flip of the coin.

px43 · on Dec 28, 2023

From what I can tell, this has nothing to do with LLMs at all. In the example in the article, the user is asking Bing to go fetch the contents of an article directly from the website, and print it out, which it dutifully does.

Seems like the "problem" is that NYT etc gives privileged access to search engines for indexing their content, but then get upset when snippets of the indexed content is being shown to users without the users having to fight the paywall or whatever.

This article also claims that the screenshot is coming from ChatGPT when it clearly is not.

rich_sasha · on Dec 28, 2023

I suppose that's a relatively easy thing to fix, technically. It proves, however, that th underlying LLM is trained on copyrighted data.

I'm not sure the problem goes away simply if the LLM in question (or any other one) gets some "no verbose regurgitation" filter.

kolinko · on Dec 28, 2023

The example from the article doesn't show that LLM is trained on copyrighted data - it's just Bing fetching the source article, providing it to GPT, and GPT rephrasing the article. An agent trained on entirely copyright-free data would provide exactly the same output.

exitb · on Dec 28, 2023

In that case, the language model calls a search function and just repeats the result out its conversation context, not its training data. With that in mind it's not clear why it's ok for Bing itself to quote the source, but it stops being ok, when a chatbot does it.

Jensson · on Dec 28, 2023

Bing links to the source, chatbots doesn't.

px43 · on Dec 28, 2023

In the example from the article, it very clearly points to all the sources used.

dyno12345 · on Dec 28, 2023

bing's chatbot does

canttosme · on Dec 28, 2023

https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn :

> Implications: The Ninth Circuit's declaration that selectively banning potential competitors from accessing and using data that is publicly available can be considered unfair competition under California law may have large implication for antitrust law. [citation needed]

> Other countries with laws to prevent monopolistic practices or anti-trust laws may also see similar disputes and prospectively judgements hailing commercial use of publicly accessible information. While there is global precedence by virtue of large companies such as Thomson Reuters, Bloomberg or Google [or LexisNexis or Westlaw] effectively using web-scraping or crawling to aggregate information from disparate sources across the web, fundamentally the judgement by Ninth Circuit fortifies the lack of enforceability of browse-wrap agreements over conduct of trade using publicly available information.

object-a · on Dec 28, 2023

IANAL but aren't the key terms there "selectively banning" and "publicly available"?

NYT articles are largely behind a paywall for everyone. That means they are not publicly available, and a competitor who was blocked from accessing or reproducing that content without a license would not be "selectively banned"

paulddraper · on Dec 28, 2023

> Something like, summarize all articles on US-UK relationships over past 5 years.

So like....Wikipedia, CliffNotes, encyclopedias, etc?

None of these pay royalties to original.

TeMPOraL · on Dec 28, 2023

> Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it. Say like VS Code.

Can't you, though? I'd thought in general, it's a very important for the market to be able to do just that, otherwise everything gets gummed up in webs of exclusive contractual dependencies between established companies.

rich_sasha · on Dec 28, 2023

As I say, I don't really know. But then, this is exactly how SaaS licensing works. There may even be a free personal tier, where you can't sell products based on it, and a professional tier which may be very expensive indeed.

Typically providers of online databases go to some effort to stop people from sharing logins. Even from that point or view, I can imagine scraping articles and providing paraphrases of it for a fee is fishy.

All I'm saying, to some people it's obvious that the whole LLM on scraped Internet is fair use, to me it is not obvious.

qeternity · on Dec 28, 2023

> Typically I can't take a personal "tier" of a product and charge 3rd parties for derivatives of it.

I think you’re confusing terms of service and copyright. IANAL but what you describe sounds exactly like fair use to me, irrespective of how much you are paying NYT.