If the model didn't learn anything important from Picasso, it wouldn't be in the training data.
This whole argument of "ah but it doesnt really need it" doesn't hold up. If the model didn't need it, it wouldn't have used it in the first place.
Same thing in Artstation. It was of course propitious for AI scientists to find such a lovely database of high quality imagery, and all so helpfully tagged into categories.
> If the model didn't learn anything important from Picasso, it wouldn't be in the training data.
> This whole argument of "ah but it doesnt really need it" doesn't hold up. If the model didn't need it, it wouldn't have used it in the first place.
I haven't seen anyone making this argument. There's a pretty clear difference between learning something from an image and memorizing it.
There also isn't any illegal with memorizing an image and painting a reproduction. What you aren't allowed to do is sell or distribute that reproduction without a license.
I think it makes more sense to restrict what people are allowed do with ML tools than to restrict what ML tools can do.
Of course it learned, that's the point of training.
You claimed the model can reproduce an image from that training data. That's false, and what the judge dismissed.
“none of the Stable Diffusion output images provided in response to a
particular Text Prompt is likely to be a close match for any specific image in
the training data.”
“I am not convinced that copyright claims based a derivative theory can
survive absent ‘substantial similarity’ type allegations,” the ruling stated.
Whether using copyrighted data to train a model is fair use or not is a different discussion.
As I've read it, the first lead to the material was 99.
2018 They got funding to research it further,
2020 was a first attempt of publication at Nature that was retracted, further improvements were made until 22/23 were two patents were filled, then suddenly 10 days ago Kwon, one of the co-researchers jumped the gun publishing a paper with the details, on one hand fearing a leak of someone else publishing first as that was too simple to replicate, on the other hand excluding everyone else from the paper and only listing him and Lee/Kim (LK) as authors as a Nobel prize can only be shared by three people. 2.5hrs later LK published again listing other 5 authors but him.
Ultimately I'm glad that the research has seen the light, since I'm of the personal persuasion that there is no single scrap of science that should ever be done in the dark. All research, in a perfect world, would be entirely public and freely and easily accessible.
With that being said, I'm not sure if leaking a paper and selfishly putting your name on it and excluding others so that you win a Nobel Prize doesn't exactly seem "heroic". Certainly beneficial for mankind, but it seems like a self-serving action.
The means justify the end. For all we know Lee and Kim would have sat on this for another year or more. I think that’s very understandable and I can’t fault them for wanting to be certain, given all the nasty things people have been saying about them, but the leak has clearly served humanity better than keeping it under wraps.
> but the leak has clearly served humanity better than keeping it under wraps.
Yeah, advancing human knowledge serves humanity, but unfortunately not really those who are advancing that knowledge. Those with money will just use your invention and make more money, you will have a pat on the back. I wish it was more balanced, we would have a lot of inventions sooner and implemented faster.
If it increases entropy as much as many suspect and it only took 1/3 of a couple humans' lives to open that phase space, the Universe has done what it wanted - to hasten heat-death.
As Hawking once explained, “Since events before the Big Bang have no observational consequences, one may as well cut them out of the theory, and say that time began at the Big Bang.” When cosmologists talk about the universe and its age, it seems to me, as a non-cosmologist, that they’re using terms of art related to their models.
Hawking’s explanation deduces that if the observable universe expanded from a singularity, we would be unable to meaningfully theorize what happened before then, since it would be beyond any form of observation to test the theory. Therefore, a scientific model rooted in observation can describe nothing earlier than the Big Bang.
However, not everything unseen is untrue. If a singularity were to form somewhere in Andromeda tomorrow — in all likelihood, one will — we will still have existed today.
Edit: The initial comment was meant as a lighthearted reply to the universe personification, but I ended up sensing a need to explain the reasoning.
It's not "personification", it's the universe tending toward entropy increasing overall. I don't think I've heard anyone claim that heat death "should have happened" as an argument against it, or what it's supposed to mean in reference to the original post.
There is a singularity in Andromeda, so I don't know why one forming matters
First, Reddit's monetization is broken by design. It never made any sense to me why they would charge for reddit gold for an ad-free experience on their website and own mobile app but not on the API. Why would they let third party apps serve their own ads and let them charge to remove them? This would be simple to fix, both technically and in the API's ToS, just serve the same ads regardless of the client. People would be upset, but ultimately I feel it would be entirely fair. But no, it doesn't seem to be a solution considered.
Second, the LLM dataset issue is also attributed to the price hike. Again, I think it's fair if unpopular to charge premium for bulk data. Again, there are technical and ToS solutions for this. They could introduce exponential tiers for bulk data, restrictions on allowed usage, other things that make user-facing usage reasonable but bulk processing expensive, but then again, starting measuring api usage per client id and not per user goes against this point, just making the API extremely expensive for everyone anywhere to the point of being unusable.
Third, all points seem to lead to the fact that what they really want is to kill third party apps and hope a large part of those users move to their app, for what? More tracking, tighter grip, better engagement metrics? Not sure. Even the changes to the extremely hostile mobile site now forcing some users to download the app. Really, I'd figure they'd understand their userbase better than that and how a small fraction on content producers and a even smaller fraction of power users and moderators carry the site, and pissing them off is a really bad idea. But what do I know.
Maybe not the right term... Just that a lot of other libs act like guardrails, i.e. let the model generate what it does (in full form text / GPT output), and try to then parse out what you want, error if output doesn't conform to standard format. As opposed to basically only allowing the model to generate into the already-built JSON form fields. Understandable why this guardrails/parsing approach is so popular though... can't do what this library is doing with OpenAI API. Need to be able to manipulate the token generation; otherwise you're forced to take full text output and try to parse it.
It has learned to make pixels a particular color to mimic that style, but that's it.