Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Isn't the fundamental issue here that the NYT was available in Common Crawl?

If they didn't want to share their content, why did they allow it to be scraped?

If they did want to share their content, why do they care (hint: $88 billion)?

Or is it that they wanted to share their content with Google and other search engines in order to bring in readers but now that an AI was trained on it they are angry?

What wrong thing did OpenAI do specific to using Common Crawl?

Didn't most companies use Common Crawl? Excepting Google, who had already scraped the whole damn Internet anyway and just used their search index?

Is it legal or not to scrape the web?

If I scrape the web, is it legal to train a transformer on it? Why or why not?

To me, this is an incredibly open-and-shut case. You put something on the web, people will read that something. If that is illegal, Google is illegal.

Oh, and do you see the part in the article where they are butthurt that it can reproduce the NYT style?

> "Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples," the suit alleges.

Mimics its expressive style. Oh golly the robots can write like they're smug NYT reporters now--better sue!

It appears that the NYT changed their terms of service in August to disallow their content in Common Crawl[0]. Wasn't GPT-4 trained far before August?

0]: https://www.adweek.com/media/the-new-york-times-updates-term...



If you read the complaint, it explains this pretty well. The use of copyrighted content by search engines is fundamentally different from the way LLMs use that same content. The former directs traffic (and therefore $$) to the publisher, the latter keeps the traffic for itself.

The legal misconception I want to flag in your logic is the notion that all uses of the Common Crawl are equally infringing/non-infringing. If you use the Common Crawl to create a list of how often every word in English appears on the internet, that’s unquestionably transformative use. But if you use it to host a mirror of the NYT website with free articles, that’s definitely infringement. The legality of scraping is one matter, and the legality of what you do with the scraped content is quite another.


From my original comment:

> Is it legal or not to scrape the web?

> If I scrape the web, is it legal to train a transformer on it? Why or why not?

At no point did I say anything about hosting a mirror of the NYT website, with free articles. Obviously. Because OpenAI didn't do that. Some NYT lawyer tried to get ChatGPT to write a NYT article. Maybe first they should have actually done a Google search and shut down some of the actual content farms which simply copy NYT content such as [0]. But instead, we get this.

[0]: https://salaminv.com/news_file/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: