Isn't the fundamental issue here that the NYT was available in Common Crawl? If ...

rfw300 · on Dec 28, 2023

If you read the complaint, it explains this pretty well. The use of copyrighted content by search engines is fundamentally different from the way LLMs use that same content. The former directs traffic (and therefore $$) to the publisher, the latter keeps the traffic for itself.

The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.

ctoth · on Dec 28, 2023

From my original comment:

> Is it legal or not to scrape the web?

> If I scrape the web, is it legal to train a transformer on it? Why or why not?

At no point did I say anything about hosting a mirror of the NYT website, with free articles. Obviously. Because OpenAI didn't do that. Some NYT lawyer tried to get ChatGPT to write a NYT article. Maybe first they should have actually done a Google search and shut down some of the actual content farms which simply copy NYT content such as [0]. But instead, we get this.

[0]: https://salaminv.com/news_file/