People who think the examples the lawsuit are “fair use” need to consider what t...

serjester · on Dec 28, 2023

I see the exact opposite - any open source model is going to become prohibitively expensive to train if quality data costs billions of dollars. We’re going to be left with the OpenAI’s and Google’s of the world as the only players in the space until someone solves synthetic data.

wraptile · on Dec 28, 2023

Exactly this. I work at a small web scraping company (so I might be a bit bias) and any small business can collect a fair, capable datasets of public data for model training, sentiment analysis or whatever today. If public data is stopped by copyright as this lawsuit implies that would just mean only giant corporations and pirates would be able to afford this.

This would be a huge blow to open-source and research developers and I'd even argue it could help openAI to get a bit of a moat ala regulatory capture.

pas · on Dec 28, 2023

research is fair use, also providing something amazing like Wikipedia is arguably educational (again fair use), reselling NYT articles on-demand via an API is by itself neither, so likely not free use

wraptile · on Jan 2, 2024

Fair use is irrelevant here as no small business would ever risk court dragging even though they are in the right. Especially since breaking ToS and "business damage" are easiest attachments to any lawsuit related to digital space.

lesuorac · on Dec 28, 2023

You may remember the Google Books lawsuit where Google was digitally copying the entirety of books and making them available online.

Google won that suit under fair-use as a massive searchable database was found to be transformative as well as the non-commercial nature.

So; if your web scraping companies goal is to allow people to bypass a paywall I suspect you'll have trouble in the future. If your web scraping company instead say allows people to do market analysis on how many people need a piano tuner in NYC and it doesn't do that by copying a NYT article doing original research I think you'll be fine.

xbar · on Dec 28, 2023

This feels like a 1996 "music is too expensive for kids so they HAVE to pirate it."

serjester · on Dec 28, 2023

NYT is seeking billions of dollars - I’m not sure that’s a fair comparison.

xbar · on Dec 29, 2023

I do not pretend to have any idea what the sum total of NYT content is worth, but we will see what a jury/judge decides.

vbi8iBEX · on Dec 28, 2023

Scraping is legal, and this seems like a transformative work to me.

aqme28 · on Dec 28, 2023

Returning the full text of an article verbatim seems to me like the opposite of "transformative."

Symmetry · on Dec 28, 2023

In the screenshot for the article you can see that the LLM says it is "Searching for: carl zimmer article on the oldest DNA". That, and what I know about how LLMs work, suggest to me that rather than the article being stored inside the trained LLM it was instead downloaded in response to the question. So the fact that the system is providing the full text of the article doesn't really go to whether training the LLM is a transformative use or not.

bonzini · on Dec 28, 2023

Yes, the screenshot in the article is clearly doing an Internet search. The exhibit in the lawsuit shows that you can complete an article by using GPT on the first sentence of the prompt, with low temperature to aid reproducibility, and obtaining the original except for a single word. That is another thing, and it shows that the LLM has basically recorded the original text into its weights in compressed form: https://pbs.twimg.com/media/GCY4WC6XYAAq-JS?format=jpg&name=...

mattdesl · on Dec 28, 2023

It would be curious to test this on a larger sample than just a few. It is hard to believe that a majority of NYT articles are verbatim stored in the weights of a web-wide LLM, but if that is the case it would be a pretty unbelievable revelation about their ability to compress an entire web’s worth of data. But, more likely, I assume it is a case of overfitting, or simply finding a prompt that happened to work well.

FWIW, I can’t replicate on either GPT 3.5 or 4, but it may be that OpenAI has added new measures to prevent this.

dwringer · on Dec 28, 2023

I have attempted this sort of thing with GPT 3.5 many times and never been successful, although I've still never been taken off of the GPT4 waiting list that I signed up for months ago and I'm not going to subscribe without trying it first. I [and presumably many thousands of others] have tried things like this with many LLMs and image generating models, but to my knowledge we've come up rather short. I've never managed to recreate anything verbatim and have struggled to get anything resembling a copyright infringement out of stable diffusion with the sole exception of a meme image of Willy Wonka.

That said, the meme image of Willy Wonka comes out of stable diffusion 1.5 almost perfectly with surprising frequency. Then again, this is probably because it appeared hundreds or thousands of times in the training set in all sorts of contexts because it's such a popular meme. There is a tension between its status as an integral part of language and its nature as a copyrighted screen grab.

bonzini · on Dec 28, 2023

You can't reproduce on the web interface, because the temperature settings are higher than what's required to compress the text. You need to use the API.

However, I had good luck reproducing poems on GPT 3.5, both copyrighted and not copyrighted, because the choice of words is a lot more "specific" so to speak, and therefore higher temperature isn't enough to prevent complete reproduction of the originals. See https://chat.openai.com/share/f6dbfb78-7c55-4d89-a92e-f4da23... (Italian; the second example is entirely hallucinated even though a poem with that title exists, while the first and third are recalled perfectly).

mattdesl · on Dec 28, 2023

It doesn’t seem that surprising; compared to entire NYT articles, poems are short, structured and more likely to be shared in multiple places across the web.

I’m more surprised that it can repeat 100 articles; if that behaviour is consistent in larger sample sizes and beyond just NYT dataset (which might be repeated on the web more than other sources, causing overfitting), that would be impressive.

You could imagine at some point a large enough GPT5 or 6 or 7 will be able to memorize verbatim every corner of the web.

tantalor · on Dec 28, 2023

That's not what "transformative" means for copyright.

It's more like, is the new work a distinct expression, e.g. satire or commentary, based on the original.

You can reproduce the original verbatim and still be transformative by adding an element of critique.

Example: https://www.dmca.com/articles/akilah-obviously-vs-sargon-of-...

alphaoverlord · on Dec 28, 2023

I don’t think the examples shown reflect an element of critique.

NeutralCrane · on Dec 29, 2023

The opposite is also concerning. IP law has always been convoluted, messy, contradictory, and morally ambiguous. The complaints of IP violation by LLMs are simply taking these inherent flaws and making them immediately obvious, forcing decisions that ultimately will set precedents on the legality of human thought that I don’t think anyone will be comfortable with. People understandably see OpenAI and Microsoft as potentially dangerous to be given so much leeway, but fail to consider on the flip side companies like Disney who have already more or less dictated the majority of copyright law for decades now. They must be chomping at the bit at the legal precedents potentially coming down the pipeline that call into question the ability to interact with any kind of media or information at any level without potentially being on the hook monetarily.

I think all this is doing is making us realize that we have built a massive economic system on a fundamentally flawed idea of ownership over ideas, and the only two solutions will be to tear up the rule book, which will be extremely painful, or double down, which will be fatal.

stainablesteel · on Dec 28, 2023

a court has established this already

in japan, where they said anything goes for ai

so its best to not to lose a competitive edge with things that people openly publish on the internet, if you put it out there for everyone to see then expect other people to use it

VWWHFSfQ · on Dec 28, 2023

A court in Japan will have no impact on the outcome of a copyright lawsuit in USA. Not to mention that it doesn't really matter how a Japanese court ruled since it's all governed by treaties anyway. They will change their laws if required to.

stainablesteel · on Dec 28, 2023

its not about applying laws across different countries

its about a precedent. If you don't keep up with international competition, you lose.

tanseydavid · on Dec 28, 2023

Japan has the right idea about this matter.