This dataset is a massive failure when it comes to ethical research practices. L...

Blackthorn · on Dec 12, 2022

> LAION-5B openly indexed copyrighted data that it had no business collecting.

Seems like an open and shut fair use claim, web indexing (not even scraping, just indexing) is not uncommon...

ritwikgupta · on Dec 12, 2022

Whether or not that data can be used ethically is an entirely separate conversation, one that IRBs have spent decades answering, and a process that LAION completely skirts on the basis of being a company.

Blackthorn · on Dec 13, 2022

I mean, sure, but that has nothing to do with the data being copyrighted or not.

_as31 · on Dec 12, 2022

> LAION-5B openly indexed copyrighted data that it had no business collecting.

This seems to be legal in many countries (from what I know, the UK, EU, Japan and Singapore) due to the TDM (Text and Data Mining) exception, especially for researchers.

uwuemu · on Dec 13, 2022

It's the classic HN scraping butthurt. You can only do this if you're a billion (trillion?) dollar company and you do it behind closed doors.

All these concern trolls are bad actors. All of them.

Unless you literally steal someone's work and use it / sell it as your own, all the data mining is moral and should be legal if it isn't already.

ritwikgupta · on Dec 13, 2022

> Unless you literally steal someone’s work and use it / sell it as your own

Artists creating work are not releasing it on the internet with ShareAlike licenses or any other license which openly allows derivative work or further distribution without a license. This is literally providing a means to stealing people’s work.

How is this any different from providing a listing of copyrighted movies and games and a means to download them, a la The Pirate Bay?

O__________O · on Dec 12, 2022

What specifically are you claiming required a review board?

Quick review of their site and the paper turns up nothing that commonly would be a topic that might merit such a review.

Related FAQs:

- https://laion.ai/faq/

ritwikgupta · on Dec 12, 2022

LAION-5B includes images of humans without their explicit consent. Images of people generally involve IRB/HSR. Additionally, almost any IRB will mention that if you’re using data derived from humans, you must go through IRB.

LAION can say all they want that they’re not including images in their dataset. They include a script to download those URLs into images on disk. By being a company that’s not bound to decades of university ethics regulations, they are seemingly allowed to skirt what you learn on your first day as a researcher in academia. It may be legal, but it sure is not ethical.

O__________O · on Dec 12, 2022

Please provide link to another academic publication agreeing with your claim that linking to online content is unethical without the subject’s explicit approval.

ritwikgupta · on Dec 12, 2022

It's one thing to link to online content. They also provide a download script to then turn the links into realizable images.

This defense, that they merely provide links and not images, is the thin layer of abstraction that their entire ethics case is built on top of. They give you everything needed to create massive datasets of human data without doing it for you.

nl · on Dec 12, 2022

Thanks a more specific claim that the OP didn't make.

version_five · on Dec 13, 2022

This is just trolling and typical of people who just want to shoot down what others have done because they cant or haven't created anything themselves. How about showing a positive solution instead of the equivalent of finding reasons why we can't do anything. If everyone had this attitude we'd still all be hiding in trees somewhere

kmeisthax · on Dec 13, 2022

The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.

For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.

If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating

[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...

[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.

[2] https://github.com/kmeisthax/PD-Diffusion

[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...

ritwikgupta · on Dec 13, 2022

You can check my Google Scholar [0]. I have created many, high-impact datasets that were 1) formative in their respective areas and 2) have seen downstream usage in disasters and wars around the world. Not once in creating those datasets did we take the “easy” route by compromising on the ethics of data collection.

The positive solution here was to not collect data if there was a reasonable ethical concern. This classic mindset of “anything goes as long as we create value” is highly toxic.

[0] https://scholar.google.com/citations?user=4Cdwp_MAAAAJ