This dataset is a massive failure when it comes to ethical research practices. LAION-5B openly indexed copyrighted data that it had no business collecting. They failed to go through an IRB when curating this data. The ethics review for this paper was a joke, where the ethics reviewer raises valid concerns and then discards their review because "if they don't publish it here, they'll publish it somewhere else anyways" [0].
LAION-5B has enabled some really cool technologies and a lot of promising startups. This work should have been carried out responsibly.
Whether or not that data can be used ethically is an entirely separate conversation, one that IRBs have spent decades answering, and a process that LAION completely skirts on the basis of being a company.
> LAION-5B openly indexed copyrighted data that it had no business collecting.
This seems to be legal in many countries (from what I know, the UK, EU, Japan and Singapore) due to the TDM (Text and Data Mining) exception, especially for researchers.
> Unless you literally steal someone’s work and use it / sell it as your own
Artists creating work are not releasing it on the internet with ShareAlike licenses or any other license which openly allows derivative work or further distribution without a license. This is literally providing a means to stealing people’s work.
How is this any different from providing a listing of copyrighted movies and games and a means to download them, a la The Pirate Bay?
LAION-5B includes images of humans without their explicit consent. Images of people generally involve IRB/HSR. Additionally, almost any IRB will mention that if you’re using data derived from humans, you must go through IRB.
LAION can say all they want that they’re not including images in their dataset. They include a script to download those URLs into images on disk. By being a company that’s not bound to decades of university ethics regulations, they are seemingly allowed to skirt what you learn on your first day as a researcher in academia. It may be legal, but it sure is not ethical.
Please provide link to another academic publication agreeing with your claim that linking to online content is unethical without the subject’s explicit approval.
It's one thing to link to online content. They also provide a download script to then turn the links into realizable images.
This defense, that they merely provide links and not images, is the thin layer of abstraction that their entire ethics case is built on top of. They give you everything needed to create massive datasets of human data without doing it for you.
This is just trolling and typical of people who just want to shoot down what others have done because they cant or haven't created anything themselves. How about showing a positive solution instead of the equivalent of finding reasons why we can't do anything. If everyone had this attitude we'd still all be hiding in trees somewhere
The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.
For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.
If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating
You can check my Google Scholar [0]. I have created many, high-impact datasets that were 1) formative in their respective areas and 2) have seen downstream usage in disasters and wars around the world. Not once in creating those datasets did we take the “easy” route by compromising on the ethics of data collection.
The positive solution here was to not collect data if there was a reasonable ethical concern. This classic mindset of “anything goes as long as we create value” is highly toxic.
LAION-5B has enabled some really cool technologies and a lot of promising startups. This work should have been carried out responsibly.
[0] https://openreview.net/forum?id=M3Y74vmsMcY