Let's say you take the harry potter books and create a spreadsheet with each wor...

mike_d · on July 5, 2023

If your table was the number of times a word was followed by a chain of other words, that would be a closer comparison to AI weights. In that case it would be possible with reasonable accuracy to reconstruct passages from the harry potter books (see GitHub Copilot).

The copyright aspect makes more sense when you start thinking of AI training models as lossy compression for the original works. Is a downsampled copy of the new Star Wars movie still protected under copyright?

Just tabulating the word counts would not violate copyright as it is considered facts and figures.

drdeca · on July 5, 2023

It resembles lossy compression in some ways, but in other important ways I think it doesn’t?

Like, if one has access to such a model, and doesn’t count it towards the size cost of a compression/decompression program nor as part of the compressed size of the compressed images, then that should allow for compressing images to have substantially fewer bits than one would otherwise be able to achieve (at least, assuming that one doesn’t care about the amount of time used to compress/decompress. Idk if this is actually practical.)

But unlike say, a zip file, the model doesn’t give you a representation of like, a list of what images (or image/caption pairs) it was trained on.

Or like, in your analogy with the lower resolution of the movie, the lower resolution of it still tells you how long the movie is (though maybe not as precisely due to lower framerate, but that’s just going to be off by less than a second, unless you have an exceedingly low framerate, but that’s hardly a video at that point.)

There is a sense in which any model of some data yields a way to compress data-points from it, where better models generally give a smaller size. But, like, any (precisely stated) description counts as a model?

So, whether it is “like lossy compression” in a way that matters to copyright, I would think depends a lot on things like,

Well, for one thing, isn’t there some kind of “might someone consume the allegedly infringing work as a substitute for the original work, e.g. if cheaper?” test?

For a lower resolution version of Star Wars movie, people clearly would.

But if one wanted to view some particular artwork that is in the training set, I would think that one couldn’t really obtain such a direct substitute? (Well, without using the work as an input to the trained model, asking it to make a variation, but in that case one already has the work separate from the model, so that’s not really relevant.)

If I wanted to know what happened in minute 33 of the Star Wars movie, I could look at minute 33 of the compressed version.