Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for so many reasons. To enumerate a few:

1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.

2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.

I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.



I'm sorry but this is such a bad take. Nice appeal to consequences. In my view, the New York Times is entirely justified in pursuing legal action. They invested time and effort in creating content, only to have it used without permission for monetary gain. A clear violation.

Analyzing the factors involved for a "fair use" consideration:

Purpose and Character of the Use: While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.

Nature of the Copyrighted Work: Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.

Amount and Substantiality of the Portion Used: In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.

Effect on the Market Value: NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.


I don’t think the original point being made was that NYT wasn’t justified in bringing the action. The point that was being made was the suit would be ultimately meaningless in the long term even if it was successful in the short term. There is a potentially more significant risk in the future that this suit will not protect against because of the reasons enumerated by the author. While the author is speculating, the law struggles with technology and adapting to change, which makes their prediction useful because it does highlight the problems that are coming that can’t be readily mitigated through legal precedent.


Correct, a_wild_dandan argues that the outcome of this suit makes no pragmatic difference.


> it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.

People are not using ChatGPT as a replacement for current news, and because of hallucinations, no one should be using it for past news either. I wouldn't remotely call ChatGPT a competitor of NYT traffic, like I would Reuters or other news outlets.


The intended result is clearly to supplant other information sources in favor of people getting their information from ChatGPT. Why should it matter to legality that the tech isn't good enough for the goal?


> T. Why should it matter to legality that the tech isn't good enough for the goal?

Because if it is not good enough, then it is not a market substitute.

The laws cares if it is a market substitute and if there are damages. If it sucks, then there aren't damages, which matters for the 4th factor of fair use.


Imo gpt itself is the transformative work.


Ok but it's not


Definition of Transformative Use: The legal concept of transformative use involves significantly altering the original work to create new expressions, meanings, or messages. AI models like GPT don't merely reproduce text; they analyze, interpret, and recombine information to generate unique responses. This process can be argued as creating new meaning or purpose, different from the original works.

In the case of the famous screenshot, the AI just relayed the information it found on the web, it's not included in its training data.

So you're just wrong.


Nope, it doesn't work that way. The fact that the LLM can regurgitate original articles doesn't remove the possibility that training can be considered transformative work, or more in general that using copyrighted material for training can be considered fair use.

Rather, verbatim reproduction is the proof that copyrighted materials was used. Then the court has to evaluate whether it was fair use. Without verbatim reproduction, the court might just say that there is not enough proof that the Times's work was important for the training, and dismiss the lawsuit right away.

Instead, the jury or court now will almost certainly have to evaluate OpenAI's operation against the four factors.

In fact, I agree with the parent that ingesting text and creating a representation that can critique historical facts using material that came from the Times is transformative. An LLM is not just a set of compressed texts, people have shown for example that some neurons fire when you are talking of specific historical periods or locations on Earth.

However, I don't think that the trasformative character is enough to override the other factors, and therefore in the end it won't/shouldn't be considered fair use IMHO.


What if the LLM is running locally and doing all of these things rather than hosted on a webserver which is serving the content?


It doesn't matter, if everything else stays the same what matters is what it's used for. If it's used to make money, it would certainly hurt claims of fair use—maybe not for those that do the training, but for those that use it.


> If it's used to make money, it would certainly hurt claims of fair use

What if a human manually searches all those articles and transcribes / summarizes them to me in the way ChatGPT did?


It might also be considered copyright violation, after evaluating the four fair use factors.


Only humans can do those things, so the test fails for LLM


> rent seeking media companies

Rent seeking? Media companies that actually create content are rent seeking? Versus the garbage hallucinations AI creates?


Rent seeking is an awful term that was from the beginning intended to describe anyone pursing a political or legal goal that deviates from a pure free market economy. As Econlib writes:

> ”Rent seeking” is one of the most important insights in the last fifty years of economics and, unfortunately, one of the most inappropriately labeled. Gordon Tullock originated the idea in 1967, and Anne Krueger introduced the label in 1974. The idea is simple but powerful. People are said to seek rents when they try to obtain benefits for themselves through the political arena. They typically do so by getting a subsidy for a good they produce or for being in a particular class of people, by getting a tariff on a good they produce, or by getting a special regulation that hampers their competitors. Elderly people, for example, often seek higher Social Security payments; steel producers often seek restrictions on imports of steel; and licensed electricians and doctors often lobby to keep regulations in place that restrict competition from unlicensed electricians or doctors.

https://www.econlib.org/library/Enc/RentSeeking.html

This is linked in the wikipedia article, which is even more confused:

https://en.wikipedia.org/wiki/Rent-seeking


No, it dates back to Adam Smith’s conception of rents derived from land-ownership as a parasitic drag on economies (about which he was entirely correct). This concept was later extended to a whole host of other forms of monopolization, some state-granted and some market-derived. In the case of U.S. copyright, we can look at its original terms (quite limited) and see that its current incarnation is more harmful than beneficial to most people.


The New York Times is dying company that is rent seeking here. Along time ago, their content was valuable, yet now you can't even give it away to researchers.

I know because they tried to make a deal with my company, we passed because social media data is infinitely more valuable.


You don't want to seriously tell me that garbage on Twitter in 240 characters is more useful to me than actual journalism, do you?

Maybe their data isn't as valuable to eg. advertisers than the data their audience actually shouted into the internet themselves (guess what), but the thing they've been actually selling for a long time now, journalism, can't be dying that fast considering we're both on this website that in big parts consists of discussing journalism.


Because its usefulness to your private jet fund is the only measurement of value.


To me, your comment only reinforces the point that NYT's content is actually valuable, rather than valuable to rent seekers. But maybe you can give a bit more detail.


> 2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.

Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?

https://finance.yahoo.com/news/beijing-tries-regulate-china-...

https://nitter.unixfox.eu/CDT/status/1625936306814717952?337...

I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.


Are media really rent-seeking? They create new content and analysis, for which they want to be compensated. It seems quite different to hoarding natural resources or land, for example.


> It seems quite different to hoarding natural resources or land

Indeed, it is quite different, because those things are scarce physical things in the real world. Intellectual property is a scam, and killing it once and for all will be one of the best things to come out of the current AI hype cycle. Nobody will "own" ideas, pieces of information, or strings of bytes.


Interesting. So as a hobby photographer I should only publicly release physical prints? An interesting idea.


Rule 1 of the Internet: If you put it on the Internet, it's not yours anymore.

You don't have to agree with it. You don't have to like it. But if you accept it and live by it, it's much harder to get burned.


Rule 1 of the internet is "don't talk about /b/."


About your 1. point: you can't possibly know that future models will be trained exclusively on synthetic data without any hit to performance. It is also not easy to reword the entire copyrighted training corpus without introducing errors or hallucinations. And you assume that this is just a fact?

Your second point reminds me a bit of 'War with the Newts' where humanity arms a race of sentient salamanders until they overthrow humanity. How could we not arm our newts if Germany might be arming theirs?

I also think basically everything else you wrote is wrong.


If Microsoft doesn't get royalty free rights to resell access to everyone's content on demand, China will become the powerhouse of interference-free media? Rrrrrright....


I think it can be simultaneously true that NYT is accurate in their complaint, while having no legal remedy for this and that there shouldn’t be.

There are plenty of large companies in other sectors that acknowledge there are limited legal remedies for them if someone copies some aspect of their business or name.


This is the actual truth. What it sucks for is for citing the data, but GPT-4 doesn't do that to start with unless it's directly from a web result and not the weights.


> GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct

Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: