The issue is it has to be proven in court. This man was personally responsible for developing web scraping; stealing data from likely copyrighted sources. He would have had communications specifically addressing the legality of his responsibilities, which he was openly questioning his superiors about.
Web scraping is legal and benefiting from published works is entirely the point, so long as you don't merely redistribute it.
Training on X doesn't run afoul of fair-use because it doesn't redistribute nor does using it simply publish a recitation (as Suchir suggested). Summoning an LLM is closer to the act of editing in a text editor than it is to republishing. His hang up was on how often the original works were being substituted for chatGPT, but like AI sports articles, overlap is to be expected for everything now. Even without web scraping in training it would be impossible to block every user intention to remake an article out of the magic "editor" - that's with no-use of the data not even fair-use.
> Web scraping is legal and benefiting from published works is entirely the point, so long as you don't merely redistribute it.
That's plainly false. Generally, if you redistribute "derivative works" you're also infringing. The question is what counts as derivative works, and I'm pretty sure lawyers and judges are perfectly capable of complicating the picture given the high stakes.
Direct derivative works of a single work is easy to prove by model activation but input/output similarity is much easier to get outrage points. True internal function would show that no-use is required to "distribute" derivative seeming content which is rather confusing and is effectively the defense. At these levels a derivative of a derivative is indistinguishable to the human eye anyway.
Soon people will get that you can no longer assume when two pieces of text are similar it is because of direct plagiarism.
No, you don't only look at the end result when determining whether a work is derivative of another. The process with which one produced the work has implications whether it is a derivative or not.
For one, if you can show that you didn't use the original copyrighted work, then your work is not a derivative, no matter how similar the end results are.
And then if the original work was involved, how it was used and what processes were used to are also relevant.
That's why OpenAI employees who did the scraping first-hand are valuable witnesses to those who are suing OpenAI.
Legal processes proceed in a way that is often counter-intuitive to technologists. IMHO you'd gain a better perspective if you actually tried to understand it rather than confidently assume what you already know from tech-land applies to law.
Few things frustrate me more than so many developers’ compulsion to baselessly assume that their incredible dev ultrabrain affords them this pan-topic expertise deep enough to dismiss other fields’ experts based on a few a priori thought experiments.
"Summoning an LLM is closer to the act of editing in a text editor than it is to republishing." This quote puts so succinctly all that is wrong with LLM, it's the most convenient interpretation to an extreme point, like the creators of fair use laws ever expected AI to exist, like the constrains of human abilities were never in the slightest influential to the fabrication of such laws.
"Stealing data" seems pretty strong. Web scraping is legal. If you put text on the public Internet other people can read it or do statistical processing on it.
What do you mean he was "stealing data"? Was he hacking into somewhere?
In a lot of ways, the statistical processing is a novel form of
information retrieval. So the issue is somewhat like if 20 years ago
Google was indexing the web, then decided to just rehost all the
indexed content on their own servers and monetize the views instead
linking to the original source of the content.
It’s not anything like rehosting though. Assume I read a bunch of web articles, synthesize that knowledge and then answer a bunch of question on the web. I am performing some form of information retrieval. Do I need to pay the folks who wrote those articles even though they provided it for free on the web?
It seems like the only difference between me and ChatGPT is the scale at which ChatGPT operates. ChatGPT can memorize a very large chunk of the web and keep answering millions of questions while I can memorize a small piece of the web and only answer a few questions. And maybe due to that, it requires new rules, new laws and new definitions for the better of society. But it’s nowhere near as clear cut as the Google example you provide.
"Seems like only difference between me and ChatGPT is absolutely everything".
You can't be flippant about scale not being a factor here. It absolutely is a factor. Pretending that ChatGPT is like a person synthesizing knowledge is an absurd legal argument, it is absolutely nothing like a person, its a machine at the end of the day. Scale absolutely matters in debates like this.
Why not? A fast piece of metal is different from a slow piece of metal, from a legal perspective.
You can't just say that "this really bad thing that causes a lot of problems is just like this not so bad thing that haven't caused any problem, only more so". Or at least it's not a correct argument.
When it is the scale that causes the harm, stating that the harmful thing is the same as the harmless except the scale, is like.. weird.
So there isn’t a legal distinction regarding fast/slow metal after all. Well that revelation certainly makes me question your legal analysis about copyright.
So in your view, when a human does it, he causes a minute of harm so we can ignore it, but chatGPT causes a massive amount of harm, so we need to penalize it. Do you realize how radical your position is?
You’re saying a human who reads free work that others put out on the internet, synthesizes that knowledge and then answers someone else’s question is a minute of evil, that we can ignore. This is beyond weird, I don’t think anyone on earth/history would agree with this characterization. If anything, the human is doing a good thing, but when ChatGPT does it at a much larger scale it’s no longer good, it becomes evil? This seems more like thinly veiled logic to disguise anxiety that humans are being replaced by AI.
> This is beyond weird, I don’t think anyone on earth/history would agree with this characterization
Superlatives are a slippery slope in argumentation, especially if you invoke the whole humanity of the whole earth of the whole history. I do understand bmaco theory and while not a lawyer I’d bet what you want there’s more than one juridiction that see scale as an important factor.
Often the law is imagined as an objective cold cut indifferent knife but often there’s also a lot of "reality" aspects like common practice.
> So in your view, when a human does it, he causes a minute of harm so we can ignore it, but chatGPT causes a massive amount of harm, so we need to penalize it. Do you realize how radical your position is?
Yes, that's my view. No, I don't think that this is radical at all. For some reasons or another, it is indeed quiet uncommon. (Well, not in law, our politicians are perfectly capable of making laws based on the size of danger/harm.)
However, I haven't yet met anyone, who was able to defend the opposite position, e.g. slow bullets = fast bullets, drawing someone = photographing someone, memorizing something = recording something, and so on. Can you?
Don’t obfuscate, your view is that the stack overflow commentator, Quora answer writer, blog writer, in fact anyone who did not invent the knowledge he’s disseminating, is committing a small amount of evil. That is radical and makes no sense to me.
> Don’t obfuscate, your view is that the stack overflow commentator, Quora answer writer, blog writer, in fact anyone who did not invent the knowledge he’s disseminating, is committing a small amount of evil.
:/ No, it's not? I've written "haven't caused any problem" and "harmless". You've changed it to "small harm" that I've indeed missed.
I don't think that things that don't cause any problem are evil. That's a ridiculous claim, and I don't understand why would you want me to say that. For example I think 10 billion pandas living here on Earth with us would be bad for humanity. Does that mean that I think that 1 panda is a minute of evil? No, I think it's harmless, maybe even a net good for humanity. I think the same about Quora commenters.
Yes, that dichotomy is present everywhere in the real world.
You need lye to make proper bagels. It is not merely harmless, but beneficial in small amounts for that purpose. We still must make sure food businesses don't contaminate food with it; it could cause severe — possibly fatal — esophageal burns. The "A little is beneficial but a lot is deleterious" also applies to many vitamins… water… cops?
Trying to turn this into an “it’s either always good or always bad” dichotomy serves no purpose but to make straw men.
Clearly there is nuance that society compromises on certain things that would be problematic at scale because it benefits society. Sharing learned information disadvantages people who make a career of creating and compiling that information but you know, humans need to learn to get jobs and acquire capital to live and, surprisingly, die and along with them that information.
Or framing the issue another way, people living isn’t a problem but people living forever would be. Scale/time matters.
Here again I’ve fallen for the HN comment section. Defend your view point if you like I have no additional commentary on this.
When you use some webpages, it forces you to agree to an EULA that might preclude web scraping. NYTimes is such a webpage which is why they were sued. This is evidence that OpenAI didn't care about the law. Someone with internal communications about this could completely destroy the company!!!