The NYT is preparing for a tsunami by building a sandcastle. Big picture, this suit won’t matter, for so many reasons. To enumerate a few:
1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.
2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.
3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.
I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.
I'm sorry but this is such a bad take. Nice appeal to consequences. In my view, the New York Times is entirely justified in pursuing legal action. They invested time and effort in creating content, only to have it used without permission for monetary gain. A clear violation.
Analyzing the factors involved for a "fair use" consideration:
Purpose and Character of the Use:
While the argument for transformation might hold in the future as you point out, the current dispute revolves around verbatim use. So clearly not transformative. Also commercial use is more difficult to be ruled fair use.
Nature of the Copyrighted Work:
Using works that are more factual may be more likely to be considered fair use, but I would argue that NYT articles are as creative as factual.
Amount and Substantiality of the Portion Used:
In this case, the entirety of the articles was used, leaving no room for a claim of using an insignificant portion.
Effect on the Market Value:
NYT isn't getting any money from this, and it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.
IANAL, but in my opinion NYT is well within its rights to pursue legal action. Progress is inevitable, but as humans, we must actively shape and guide it. Otherwise it cannot be called progress. In this context, legal action serves as a necessary means for individuals and organizations to assert their rights and influence its course.
I don’t think the original point being made was that NYT wasn’t justified in bringing the action. The point that was being made was the suit would be ultimately meaningless in the long term even if it was successful in the short term. There is a potentially more significant risk in the future that this suit will not protect against because of the reasons enumerated by the author. While the author is speculating, the law struggles with technology and adapting to change, which makes their prediction useful because it does highlight the problems that are coming that can’t be readily mitigated through legal precedent.
> it's clearly not helping their market value if people are checking on ChatGPT instead of reading a NYT article.
People are not using ChatGPT as a replacement for current news, and because of hallucinations, no one should be using it for past news either. I wouldn't remotely call ChatGPT a competitor of NYT traffic, like I would Reuters or other news outlets.
The intended result is clearly to supplant other information sources in favor of people getting their information from ChatGPT. Why should it matter to legality that the tech isn't good enough for the goal?
> T. Why should it matter to legality that the tech isn't good enough for the goal?
Because if it is not good enough, then it is not a market substitute.
The laws cares if it is a market substitute and if there are damages. If it sucks, then there aren't damages, which matters for the 4th factor of fair use.
Definition of Transformative Use: The legal concept of transformative use involves significantly altering the original work to create new expressions, meanings, or messages. AI models like GPT don't merely reproduce text; they analyze, interpret, and recombine information to generate unique responses. This process can be argued as creating new meaning or purpose, different from the original works.
In the case of the famous screenshot, the AI just relayed the information it found on the web, it's not included in its training data.
Nope, it doesn't work that way. The fact that the LLM can regurgitate original articles doesn't remove the possibility that training can be considered transformative work, or more in general that using copyrighted material for training can be considered fair use.
Rather, verbatim reproduction is the proof that copyrighted materials was used. Then the court has to evaluate whether it was fair use. Without verbatim reproduction, the court might just say that there is not enough proof that the Times's work was important for the training, and dismiss the lawsuit right away.
Instead, the jury or court now will almost certainly have to evaluate OpenAI's operation against the four factors.
In fact, I agree with the parent that ingesting text and creating a representation that can critique historical facts using material that came from the Times is transformative. An LLM is not just a set of compressed texts, people have shown for example that some neurons fire when you are talking of specific historical periods or locations on Earth.
However, I don't think that the trasformative character is enough to override the other factors, and therefore in the end it won't/shouldn't be considered fair use IMHO.
It doesn't matter, if everything else stays the same what matters is what it's used for. If it's used to make money, it would certainly hurt claims of fair use—maybe not for those that do the training, but for those that use it.
Rent seeking is an awful term that was from the beginning intended to describe anyone pursing a political or legal goal that deviates from a pure free market economy. As Econlib writes:
> ”Rent seeking” is one of the most important insights in the last fifty years of economics and, unfortunately, one of the most inappropriately labeled. Gordon Tullock originated the idea in 1967, and Anne Krueger introduced the label in 1974. The idea is simple but powerful. People are said to seek rents when they try to obtain benefits for themselves through the political arena. They typically do so by getting a subsidy for a good they produce or for being in a particular class of people, by getting a tariff on a good they produce, or by getting a special regulation that hampers their competitors. Elderly people, for example, often seek higher Social Security payments; steel producers often seek restrictions on imports of steel; and licensed electricians and doctors often lobby to keep regulations in place that restrict competition from unlicensed electricians or doctors.
No, it dates back to Adam Smith’s conception of rents derived from land-ownership as a parasitic drag on economies (about which he was entirely correct). This concept was later extended to a whole host of other forms of monopolization, some state-granted and some market-derived. In the case of U.S. copyright, we can look at its original terms (quite limited) and see that its current incarnation is more harmful than beneficial to most people.
The New York Times is dying company that is rent seeking here.
Along time ago, their content was valuable, yet now you can't even give it away to researchers.
I know because they tried to make a deal with my company, we passed because social media data is infinitely more valuable.
You don't want to seriously tell me that garbage on Twitter in 240 characters is more useful to me than actual journalism, do you?
Maybe their data isn't as valuable to eg. advertisers than the data their audience actually shouted into the internet themselves (guess what), but the thing they've been actually selling for a long time now, journalism, can't be dying that fast considering we're both on this website that in big parts consists of discussing journalism.
To me, your comment only reinforces the point that NYT's content is actually valuable, rather than valuable to rent seekers. But maybe you can give a bit more detail.
> 2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.
Sorry, is this the same China that has already introduced their own sweeping regulations on AI? Which in at least one instance forced a Chinese startup to shut down their newly launched chatbot because it said things that didn't align with the party's official stance on the war in Ukraine?
I don't disagree that research/hosting/progress will continue, but I'm not so sure that it's China who stands to benefit from the US adding some guardrails to this rollercoaster.
Are media really rent-seeking? They create new content and analysis, for which they want to be compensated. It seems quite different to hoarding natural resources or land, for example.
> It seems quite different to hoarding natural resources or land
Indeed, it is quite different, because those things are scarce physical things in the real world. Intellectual property is a scam, and killing it once and for all will be one of the best things to come out of the current AI hype cycle. Nobody will "own" ideas, pieces of information, or strings of bytes.
About your 1. point: you can't possibly know that future models will be trained exclusively on synthetic data without any hit to performance. It is also not easy to reword the entire copyrighted training corpus without introducing errors or hallucinations. And you assume that this is just a fact?
Your second point reminds me a bit of 'War with the Newts' where humanity arms a race of sentient salamanders until they overthrow humanity. How could we not arm our newts if Germany might be arming theirs?
I also think basically everything else you wrote is wrong.
If Microsoft doesn't get royalty free rights to resell access to everyone's content on demand, China will become the powerhouse of interference-free media? Rrrrrright....
I think it can be simultaneously true that NYT is accurate in their complaint, while having no legal remedy for this and that there shouldn’t be.
There are plenty of large companies in other sectors that acknowledge there are limited legal remedies for them if someone copies some aspect of their business or name.
This is the actual truth. What it sucks for is for citing the data, but GPT-4 doesn't do that to start with unless it's directly from a web result and not the weights.
> GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct
Is that just by increasing the temperature, tweaking the prompt, etc.? If you can operate on the raw weights and recreate the original text, copyright infringement still applies.
1. Next gen LLMs will be trained exclusively on “synthetic”/public data. GPT-4V can easily whitewash its entire copyrighted training corpus to be unrecognizably distinct (say reworded by 40%, authors/sources stripped, etc). Ergo there will be no copyright material for GPT-5 to regurgitate.
2. Research/hosting/progress will proceed. The US cannot stop this, only choose to be left behind. The world will move on, with China gleefully watching as their biggest rival commits intellectual suicide all to appease rent seeking media companies.
3. Models can share weights, merge together, cooperate, ablate, evolve over many generations (releases), etc. Copyright law is woefully ill equipped to handle chasing down violators in this AI lineage soup, annealed with data of dubious/unknown provenance.
I could go on, but the point is that, for better or worse, we live in a new intellectual era. The NYT et al are coming along for the ride, whether they like it or not.