> How is it any different when a machine does the same thing?
Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.
I'm not aware of evidence that support that claim. If I ask ChatGPT "Give me a recipe for squirrel lemon stew" and it so happens that one person did write a recipe for that exact thing on the Internet, then I would expect that the most accurate, truthful response would be that exact recipe. Anything else would essentially be hallucination.
As I understand it, LLMs are intended to answer questions as "truthfully" as they can. Their understanding of truth comes from the corpus they are trained on. If you ask a question where the corpus happens to have something very close to that question and its answer, I would expect the LLM to burp up that answer. Anything less would be hallucination.
Of course, if I ask a question that isn't as well served by the corpus, it has to do its best to interpolate an answer from what it knows.
But ultimately its job is to extract information from a corpus and serve it up with as much semantic fidelity to the original corpus as possible. If I ask how many moons Earth has, it should say "one". If I ask it what the third line of Poe's "The Raven" is, it should say "While I nodded, nearly napping, suddenly there came a tapping,". Anything else is wrong.
If you ask it a specific enough question where only a tiny corner of its corpus is relevant, I would expect it to end up either reproducing the possibly copyright piece of that corpus or, perhaps worse, cough up some bullshit because it's trying to avoid overfitting.
(I'm ignoring for the moment LLM use cases like image synthesis where you want it to hallucinate to be "creative".)
I get that's what you and a lot of people want it to be, but it isn't what they are. They are quite literally probabilistic text generation engines. Let's emphasise that: the output is produced randomly by sampling from distributions, or in simple terms, like rolling a dice. In a concrete sense it is non-deterministic. Even if an exact answer is in the corpus, its output is not going to be that answer, but the most probable answer from all the text in the corpus. If that one answer that exactly matches contradicts the weight of other less exact answers you won't see it.
And you probably wouldn't want to - if I ask if donuts are radioactive and one person explicitly said that on the internet you probably aren't going to tell me you want it to spit out that answer just because it exactly matches what you asked. You want it to learn from the overwhelimg corpus of related knowledge that says donuts are food, people routinely eat them, etc etc and tell you they aren't radioactive.
It's equally plausible to say you don't intend to reproduce copyrighted code verbatim but occasionally do so given either a sufficiently specific prompt or because the reproduced code is so generic that it probably gets rewritten a hundred times a day because that's how people learned to do basic things from books or documentation or their education.
Um, the entire intent of these "AI" systems is explicitly to reproduce copyrighted work with mechanical changes to make it not appear to be a verbatim copy.
That is the whole purpose and mechanism by which they operate.
Also the intent does not matter under law - not intending to break the law is not a defense if you break the law. Not intending to take someone's property doesn't mean it becomes your property. You might get less penalties and/or charges, due to intent (the obvious examples being murder vs manslaughter, etc).
But here we have an entire ecosystem where the model is "scan copyrighted material" followed by "regurgitate that material with mechanical changes to fit the surrounding context and to appear to be 'new' content".
Moreover given that this 'new' code is just a regurgitation of existing code with mutations to make it appear to fit the context and not directly identical to the existing code, then that 'new' code cannot be subject to copyright (you can't claim copyright to something you did not create, copyright does not protect output of mechanical or automatic transformations of other copyrighted content, and copyright does not protect the result of "natural processes", e.g 'I asked a statistical model to give me a statically plausible sequence of tokens and it did'). So in the best case scenario - the one where the copyright laundering as a service tool is not treated as just that, any code it produces is not protectable by copyright, and anyone can just copy "your work" without the license and (because you've said if you weren't intending to violate copyright it's ok) they can say they could not distinguish the non-copyright-protected work from the protected work and assumed that therefore none of it was subject to copyright. To be super sure though they weren't violating any of your copyrights, they then ran an "AI tool" to make the names better and better suit your style.
I am so sick of these arguments where people spout nonsense about "AI" systems magically "understanding" or "knowing" anything - they are very expensive statistical models, the produce statistically plausible strings of text, by a combination of copying the text of others wholesale, and filling the remaining space with bullshit that for basic tasks is often correct enough, and for anything else is wrong - because again they're just producing plausible sequences of tokens and have no understanding of anything beyond that.
To be very very very clear: if an AI system "understood" anything it was doing, it would not need to ingest essentially all the text that anyone has ever written, just to produce content that is at best only locally coherent, and that is frequently incorrect in more or less every domain to which it is applied. Take code completion (as in this case): Developers can write code without essentially reading all the code that has ever existed just so that they can write basic code, because developers understand code. Developers don't intermingle random unrelated and non-present variables or functions in their code as they write, because they understand what variables are and therefore they can't use non existent ones. "AI" on the other hand required more power than many countries to "learn" by reading as much as possible all code ever written, and then produce nonsense output for anything complex because they're still just generating a string of tokens that is plausible according to their statistical model - the result of these AIs is essentially binary: it has been in effect asked to produce code that does something that was in its training corpus and can be copied essentially verbatim, with a transformation path to make it fit, or it's not in the training corpus and you get random and generally incorrect code - hopefully wrong enough it fails to build, because they're also good at generating code that looks plausible but only fails at runtime because plausible sequence of tokens often overlaps with 'things a compiler will accept'.
I actually once tracked this claim down in the case of stable diffusion.
I concluded that it was just completely impossible for a properly trained stable diffusion model to reproduce the works it was trained on.
The SD model easily fits on a typical USB stick, and comfortably in the memory of a modern consumer GPU.
The training corpus for SD is a pretty large chunk of image data on the internet. That absolutely does not fit in GPU memory - by several orders of magnitude.
No form of compression known to man would be able to get it that small. People smarter than me say it's mathematically not even possible.
Now for closed models, you might be able to argue something else is going on and they're sneakily not training neural nets or something. But the open models we can inspect? Definitely not.
Modern ML/AI models are doing Something Else. We can argue what that Something Else is, but it's not (normally) holding copies of all the things used to train them.
I think this argument starts to break down for the (gigantic) GPTs where the model size is a lot closer to the size of the training corpus.
Thinking in terms of compression, the compression in generative AI models is lossy. The mathematical bounds on compression only apply to lossless compression. Keeping in mind that a small fraction of the training corpus is presented to the training algorithm multiple times, it's not absurd to suggest that these works exist inside the algorithm in a recallable form. Hence the NYT's lawyers being able to write prompts that recall large chunks of NYT articles verbatim.
And I seem to recall there are some theoretical lower bounds on even lossy compression. Some quick back of the envelope fermi estimation gets me a hard lower bound of 5TB for "all the images on the internet"; but I'm not quite confident enough in my math to quite back that up right here and now.
> And I seem to recall there are some theoretical lower bounds on even lossy compression.
I'm not sure what your math is coming from and it seems trivially wrong. A single black pixel is a very lossy compression of every image on the internet. A picture of the Facebook logo is a slightly-less-lossy compression of every picture on the internet (the Facebook logo shows up on a lot of websites). I would believe that you can get a bound on lossy compression of a given quality (whatever quality means) only if you assume that there is some balance of the images in the compressed representation. There are a lot of assumptions there, and we know for a fact that the text fed to the GPTs to train them was presented in an unbalanced way.
In fact, if you look at the paper "textbooks are all you need" (https://arxiv.org/pdf/2306.11644) you can see that presenting a very limited set of information to an LLM gets a decent result. The remaining 6 trillion tokens in the training set are sort of icing on the cake.
I think you'll agree that it would be a bit absurd to threaten legal action against someone for storing a single black pixel.
OTOH Someone might be tempted to start a lawsuit if they believe their image is somehow actually stored in a particular data file.
For this to be a viable class action lawsuit to pursue, I think you'd have to subscribe to the belief that it's a form of compression where if you store n images, you're also able to get n images back. Else very few people would have actual standing to sue.
I think that when you speak in terms of images, for a viable lawsuit, you need to have a form of compression that can recall n (n >= 1) images from compressing m (m >= n) images. Presumably n is very large for LLMs or image models, even though m is orders of magnitude larger. I do not think that your form of compression needs to be able to get all m images back. By forcing m = n in your argument, you are forcing some idea of uniformity of treatment in the compression, which we know is not the case.
The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.
That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.
I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .
I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).
While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .
And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]
Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.
That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)
* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.
* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)
[1] However, in practice there appear to be limits to floating point precision.
Because intent matters in the law. If you intended to reproduce copyrighted code verbatim but tried to hide your activity with a few tweaks, that's a very different thing from using a tool which occasionally reproduces copyrighted code by accident but clearly was not designed for that purpose, and much more often than not outputs transformative works.