I think that when you speak in terms of images, for a viable lawsuit, you need to have a form of compression that can recall n (n >= 1) images from compressing m (m >= n) images. Presumably n is very large for LLMs or image models, even though m is orders of magnitude larger. I do not think that your form of compression needs to be able to get all m images back. By forcing m = n in your argument, you are forcing some idea of uniformity of treatment in the compression, which we know is not the case.
The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.
That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.
I do realize the benefits of the 'compression' model of ML. Sometimes you can even use compression directly, like here: https://arxiv.org/abs/cs/0312044 .
I suppose you're right that you only need a few substantively similar outputs to potentially get sued already. (depending on who's scrutinizing you).
While talking with you, it occurred to me that so far we've ignored the output set o, which is the set of all images output by -say- stable diffusion. n can then be defined as n = m ∩ o .
And we know m is much larger than n, and o is theoretically practically infinite [1] (you can generate as many unique images as you like) , so o >> m >> n . [2]
Really already at this point I think calling SD a compression algorithm might be just a little odd. It doesn't look like the goal is compression at all. Especially when the authors seem to treat n like a bug ('overfit'), and keep trying to shrink it.
That's before looking back at the "compression ratio" and "loss ratio" of this algorithm, so maybe in future I can save myself some maths. It's an interesting approach to the argument I might try more in future. (Thank you for helping me to think in this direction)
* I think in the case of the Getty lawsuit they might have a bit of a point, if the model might have been overfitted on some of their images. Though I wonder if in some cases the model merely added Getty watermarks to novel images. I'm pretty sure that will have had something to do with setting Getty off.
* I am deeply suspicious of the NYT case. There's a large chunk of examples where they used ChatGPT to browse their own website. This makes me wonder if the rest of the examples are only slightly more subtle. IIRC I couldn't replicate them trivially. (YMMV, we can revisit if you're really interested)
[1] However, in practice there appear to be limits to floating point precision.
The black pixel won't get you sued, but the Facebook logo example I used could get you sued. Specifically by Facebook. There is an image (n = 1) that is substantially similar to the output of your compression algorithm.
That is sort of what Getty's lawsuit alleges. Not that every picture is recallable from an LLM, but that several images that are substantially similar to Getty's images are recallable. The same goes with the NYT's lawsuit and OpenAI.