> you can't make these machine things without literally feeding this copyrighted...

jrm4 · on July 6, 2023

I'm not saying that "they do or don't objectively" because that doesn't matter as much as people think it does. I'm thinking of what a "jury" COULD decide. I think average joe on a jury is very likely to see that process as "feeding them in."

saynay · on July 5, 2023

More over, you are clearly not in violation of copyright if you are talking about statistics about the material. In your example, printing out a "there were 7000 instances of the word 'the'" is certainly not a violation. A ML model is just a huge pile of these statistics.

However, saying "the first word of the book is 'The'" would not be a violation, while repeating that for every word in the book, as a whole, would be one.

version_five · on July 5, 2023

I agree with you but I think it's important to have some nuance. Imagine I build a statistical model for 10-word sequences (10-grams) and then I trained it on a single book. I probably could pick some starting words and get most of the book back from the "statistics" I compiled. If I trained the same model on a giant dataset, the one book would just contribute to the stats.

All that to say, the models have potential to memorize, but they don't, and if they do it's an undesirable failure mode, not some deliberate copying.

jrm4 · on July 5, 2023

I like this argument a lot; but again -- how does this play out in the real world? It's pretty easy to refute what will happen in real life. Think, e.g Batman. I could write a very new and original "Batman" comic that doesn't strongly resemble anything -- movie, toy, comic, whatever -- that exists, but would be recognizable to fans.

Once it starts doing well, will DC come after me? You bet.

ke88y · on July 5, 2023

These models can definitely be used to intentionally store and recall content that is copyrighted in a way that's not subject to fair use. (eg: trivially, I can very easily train a large model that has a small subnetwork which encodes a compressed or even lossless copy of a picture, and if I were to intentionally train a model is that way then this would be no less a copyright violation than distributing a JPEG of the same image embedded in some large binary).

But also, an unintentional copy of a copyrighted image is not a violation of copyright. (eg: an executable binary which happens to contain the bits corresponding to a picture of Batman -- but which are actually instruction sequences and were provably not intended to encode the picture -- clearly doesn't infringe.)

LLMs are somewhere in-between #1 and #2, and the intent can happen both in the training and also the prompting.

Stack on top of this the fact that the models can also definitely generate content that counts as fair use, or which isn't copyrighted.

It's the multitude of possible outputs, across the copyright spectrum, combined with the function of intent in training and/or prompting, which make this such a thorny legal issue for which existing copyright statute and jurisprudence is ill-suited.

Taking your Batman example: DC would come after you for trademark as well as copyright, and the copyright claims would be very carefully evaluated with respect to your very specific work. But here we are talking about a large model that can generate tons of different work which isn't subject to copyright or which is possibly fair use.

I don't think that existing jurisprudence (or even statute?!) can handle this situation very well, at all, without tons of arbitrary interpretative work on the parts of juries/judges, because of the multitude and vague intent issues described above.

(...Also presumably the merits of the DC case wouldn't matter because your victory would be pyhrric unless you are a mega-corp. Which from a legal theory perspective is neither here nor there but from a legal practicality perspective may inform how companies go about enforcing copyright claims on model weights/outputs.)

Anyways. I think we have a right mess on our hands and the legislature needs to do their damn jobs. Welcome to America, I guess :)

Curious to hear your thoughts on these issues.

jrm4 · on July 6, 2023

Honestly, your second to last sentence is literally the kind of thing I hate hearing most from non-lawyers; the whole "if the legislature were just smarter" thing is just a weird pie-in-the-sky concept that is more-or-less like saying "the world would be better if CEOs were less greedy."

Like, yes, but it's not very likely to happen and it's not a particularly horrible thing if it doesn't; the law is slow and little-c conservative and you're just expecting it to be something it MOST often just ain't.

mike_d · on July 5, 2023

> The summary contains information that was present in the original but it has been transformed and hence it's not a copy.

The summary also contains original thought, something is added to it by a human to make it unique. AI models are primarily deriviative.

A better example would be: if I take 1,000 different copyrighted works and put them into a ZIP file, does that resulting file violate copyright?

version_five · on July 5, 2023

That example is awful, whatever side of the debate one is on

kevin42 · on July 5, 2023

Let's say you take the harry potter books and create a spreadsheet with each word in it as a column, and the number of times that word appears. Would that violate the copyright? I'd be interested in the rationale if someone thinks it would.

mike_d · on July 5, 2023

If your table was the number of times a word was followed by a chain of other words, that would be a closer comparison to AI weights. In that case it would be possible with reasonable accuracy to reconstruct passages from the harry potter books (see GitHub Copilot).

The copyright aspect makes more sense when you start thinking of AI training models as lossy compression for the original works. Is a downsampled copy of the new Star Wars movie still protected under copyright?

Just tabulating the word counts would not violate copyright as it is considered facts and figures.

drdeca · on July 5, 2023

It resembles lossy compression in some ways, but in other important ways I think it doesn’t?

Like, if one has access to such a model, and doesn’t count it towards the size cost of a compression/decompression program nor as part of the compressed size of the compressed images, then that should allow for compressing images to have substantially fewer bits than one would otherwise be able to achieve (at least, assuming that one doesn’t care about the amount of time used to compress/decompress. Idk if this is actually practical.)

But unlike say, a zip file, the model doesn’t give you a representation of like, a list of what images (or image/caption pairs) it was trained on.

Or like, in your analogy with the lower resolution of the movie, the lower resolution of it still tells you how long the movie is (though maybe not as precisely due to lower framerate, but that’s just going to be off by less than a second, unless you have an exceedingly low framerate, but that’s hardly a video at that point.)

There is a sense in which any model of some data yields a way to compress data-points from it, where better models generally give a smaller size. But, like, any (precisely stated) description counts as a model?

So, whether it is “like lossy compression” in a way that matters to copyright, I would think depends a lot on things like,

Well, for one thing, isn’t there some kind of “might someone consume the allegedly infringing work as a substitute for the original work, e.g. if cheaper?” test?

For a lower resolution version of Star Wars movie, people clearly would.

But if one wanted to view some particular artwork that is in the training set, I would think that one couldn’t really obtain such a direct substitute? (Well, without using the work as an input to the trained model, asking it to make a variation, but in that case one already has the work separate from the model, so that’s not really relevant.)

If I wanted to know what happened in minute 33 of the Star Wars movie, I could look at minute 33 of the compressed version.

pxoe · on July 5, 2023

what is a 'copy'? byte accurate, or 'something with general resemblance'? would a badly compressed "copy" image of a copyrighted material still be 'a copy' or would it be some other thing? would low quality image compression be enough to skirt around copyright claims? image formats and viewers just 'reproduce' an impression of original data from derive compressed data. it is also just 'information that's been generalized by some degree' - for space saving purposes and so on. so, what if image generators could be thought of as a 'very good multi-image compression algorithm' that can output multiple images as well, to a 'somewhat recognizable degree'.

hex4def6 · on July 5, 2023

Badly compressed still counts. I think if the data allows you to reconstruct a recognizable recreation of the original work, you have a good chance of it being considered a derivative copy.

A mono audio version of Star Wars, compressed down to 320x240, filmed from the back of a theater on a VHS camera, converted to Video CD, would under any reasonable interpretation be just a copy of the original.

I assume it starts getting murky when there's some sort of transformation done it it. What if I run motion capture on it, and use that motion capture data to create a cartoon version of Star Paws (my puppies in space epic)? What if I do a scene for scene recreation as the animated cartoon (removing any mentions to copyrighted names -- Luke Skywalker is now Duke Dogwalker, for example)? In this case, there's been no actual data transfer -- all the sprites are hand drawn, backgrounds etc.

What would be an interesting exercise would be to try and create a series of artifacts that each on their own are considered non-derivatives, but can be used together to reconstitute the original. For example, create a compression method that relies heavily on transforms / macroblocks, but strip out any of the actual pixel data from the film. That info might be supplied as palette files which are themselves not really copyrighted data, but together with the compressed transform stream can be used to recreate the original video.

visarga · on July 5, 2023

This is a great example. Summarizing or paraphrasing copyrighted content, or simply using it as a seed to generate input-output pairs - this kind of data transformation prior to training could solve the issues with copyright. It cleanly separates form from content.