As far as I know, none have been released. And it doesn't even really make sense...

kube-system · on Aug 2, 2024

> the models aren't copyrightable to begin with

What criteria for copyright protection are they missing?

astromaniak · on Aug 3, 2024

> As far as I know, none have been released.

I can tell you a secret. What you call 'open source' models are impossible. Because massive randomness is a part of training process. They are not reproducible. Having everything you cannot even tell if the given model was trained on the given dataset. Copyright is a different thing.

And a bad news, what's coming is even worst. Those will be the whole things with self awareness and personal experience. They can be copied, but not reproduced. More over, it's hard or almost impossible to detect if something undeclared was planted in their 'minds'.

All together means 'open source' model in strict interpretation is a myth, great idea which happen to be not. Like Turing test.

> However, plenty of open source software exists.

Attempt to switch topic detected.

PS: as for that massive downvote, I even wasn't rude, don't care. This account will be abandoned soon regardless, like all before and after.

jillesvangurp · on Aug 2, 2024

> models aren't copyrightable to begin with

You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.

> The fact that open source models don't exist

The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong. And yes, I do check the license of stuff that I use as I work with companies that care about such matters.

> As far as I know ...

Now you know better.

JimDabell · on Aug 2, 2024

> You are wrong about that. It's a file with numbers. Which makes it a database or dataset and very much protected by copyright. That's why licenses are needed. For the phone book, things like open street maps, and indeed AI models.

This is only true in jurisdictions that follow the sweat of the brow doctrine, where effort alone without creativity is considered enough for copyright. In other places, such as the USA, collections of facts are not copyrightable and a minimal amount of creativity is required for something to qualify as copyrightable. The phone book is an example that is often used, actually, to demonstrate the difference.

https://en.wikipedia.org/wiki/Sweat_of_the_brow

Hizonner · on Aug 2, 2024

> Which makes it a database or dataset and very much protected by copyright.

Not every collection of numbers is a database, and a database is not the same thing as a dataset.

Databases have limited copyright-like protection in some places. Under TRIPS, that extends to only databases that are "creative by virtue of the selection or arrangement of their contents" or something along those lines. In the US they talk specifically about curation.

ML models do not meet either requirement by any reasonable interpretation.

> The fact that many people (myself included) routinely download and use models distributed under OSI approved licenses (Apache V2, MIT, etc.) makes that statement verifiably wrong.

The "source code" of an ML model is most reasonably interpreted as including all of the training data, which are never, ever available.

Now you know better.

[On edit: By the way, the people creating these works had better hope they're outside copyright, because if not, each one of them is a derivative work of (at least some large and almost impossible to identify subset of) its training data, so they need licenses from all the copyright holders of that training material, which few of them have or can get.]

kube-system · on Aug 2, 2024

If we stop unnecessarily anthropomorphizing software, I think it is plainly obvious these are derivative works. You take the training material, run it through a piece of software, and it produces an output based on that input. Just because the black box in the middle is big and fancy doesn't mean that somehow the output isn't a result of the input.

However, transformativeness is a factor in whether or not there is a fair-use exception for the derivative work. And these models are highly transformative, so this is a strong argument for their fair-use.

Hizonner · on Aug 2, 2024

Maybe, but...

"Fair use" is pretty much entirely a US concept, and similar concepts in other countries aren't isomorphic to it.

The model does have a radically different form from its inputs. So you could easily imagine that being "transformative enough" for US fair use. A lot of the other fair use elements look pretty easy to apply, too. Although there's still the question of whether all the intermediate copies you made to create the model were fair use...

In fact, I'll even concede that a court could find that a model wasn't a derivative work of its inputs to begin with, and not even have to get to the fair use question. The argument would be that the model doesn't actually reproduce any of the creative elements of any particular training input.

I do think a finding like that would be a much bigger stretch than a finding that the model was copyrightable. I could easily see a world where the model was found derivative but was not found copyrightable. And it's actually not clear to me at all that the model has to be copyrightable to infringe the copyright in something else, so that's another mess.

Somewhat related, even if the model itself isn't infringing, it's definitely possible to have most models create outputs that are very similar to (some specific examples in) their training data... in ways that obviously aren't transformative. Outputs that might compete with the original training data and otherwise fail to be fair use. So even if the model is in the clear, users might still have to watch out.