These models can definitely be used to intentionally store and recall content that is copyrighted in a way that's not subject to fair use. (eg: trivially, I can very easily train a large model that has a small subnetwork which encodes a compressed or even lossless copy of a picture, and if I were to intentionally train a model is that way then this would be no less a copyright violation than distributing a JPEG of the same image embedded in some large binary).
But also, an unintentional copy of a copyrighted image is not a violation of copyright. (eg: an executable binary which happens to contain the bits corresponding to a picture of Batman -- but which are actually instruction sequences and were provably not intended to encode the picture -- clearly doesn't infringe.)
LLMs are somewhere in-between #1 and #2, and the intent can happen both in the training and also the prompting.
Stack on top of this the fact that the models can also definitely generate content that counts as fair use, or which isn't copyrighted.
It's the multitude of possible outputs, across the copyright spectrum, combined with the function of intent in training and/or prompting, which make this such a thorny legal issue for which existing copyright statute and jurisprudence is ill-suited.
Taking your Batman example: DC would come after you for trademark as well as copyright, and the copyright claims would be very carefully evaluated with respect to your very specific work. But here we are talking about a large model that can generate tons of different work which isn't subject to copyright or which is possibly fair use.
I don't think that existing jurisprudence (or even statute?!) can handle this situation very well, at all, without tons of arbitrary interpretative work on the parts of juries/judges, because of the multitude and vague intent issues described above.
(...Also presumably the merits of the DC case wouldn't matter because your victory would be pyhrric unless you are a mega-corp. Which from a legal theory perspective is neither here nor there but from a legal practicality perspective may inform how companies go about enforcing copyright claims on model weights/outputs.)
Anyways. I think we have a right mess on our hands and the legislature needs to do their damn jobs. Welcome to America, I guess :)
Honestly, your second to last sentence is literally the kind of thing I hate hearing most from non-lawyers; the whole "if the legislature were just smarter" thing is just a weird pie-in-the-sky concept that is more-or-less like saying "the world would be better if CEOs were less greedy."
Like, yes, but it's not very likely to happen and it's not a particularly horrible thing if it doesn't; the law is slow and little-c conservative and you're just expecting it to be something it MOST often just ain't.
But also, an unintentional copy of a copyrighted image is not a violation of copyright. (eg: an executable binary which happens to contain the bits corresponding to a picture of Batman -- but which are actually instruction sequences and were provably not intended to encode the picture -- clearly doesn't infringe.)
LLMs are somewhere in-between #1 and #2, and the intent can happen both in the training and also the prompting.
Stack on top of this the fact that the models can also definitely generate content that counts as fair use, or which isn't copyrighted.
It's the multitude of possible outputs, across the copyright spectrum, combined with the function of intent in training and/or prompting, which make this such a thorny legal issue for which existing copyright statute and jurisprudence is ill-suited.
Taking your Batman example: DC would come after you for trademark as well as copyright, and the copyright claims would be very carefully evaluated with respect to your very specific work. But here we are talking about a large model that can generate tons of different work which isn't subject to copyright or which is possibly fair use.
I don't think that existing jurisprudence (or even statute?!) can handle this situation very well, at all, without tons of arbitrary interpretative work on the parts of juries/judges, because of the multitude and vague intent issues described above.
(...Also presumably the merits of the DC case wouldn't matter because your victory would be pyhrric unless you are a mega-corp. Which from a legal theory perspective is neither here nor there but from a legal practicality perspective may inform how companies go about enforcing copyright claims on model weights/outputs.)
Anyways. I think we have a right mess on our hands and the legislature needs to do their damn jobs. Welcome to America, I guess :)
Curious to hear your thoughts on these issues.