Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This dataset is a massive failure when it comes to ethical research practices. LAION-5B openly indexed copyrighted data that it had no business collecting. They failed to go through an IRB when curating this data. The ethics review for this paper was a joke, where the ethics reviewer raises valid concerns and then discards their review because "if they don't publish it here, they'll publish it somewhere else anyways" [0].

LAION-5B has enabled some really cool technologies and a lot of promising startups. This work should have been carried out responsibly.

[0] https://openreview.net/forum?id=M3Y74vmsMcY



> LAION-5B openly indexed copyrighted data that it had no business collecting.

Seems like an open and shut fair use claim, web indexing (not even scraping, just indexing) is not uncommon...


Whether or not that data can be used ethically is an entirely separate conversation, one that IRBs have spent decades answering, and a process that LAION completely skirts on the basis of being a company.


I mean, sure, but that has nothing to do with the data being copyrighted or not.


> LAION-5B openly indexed copyrighted data that it had no business collecting.

This seems to be legal in many countries (from what I know, the UK, EU, Japan and Singapore) due to the TDM (Text and Data Mining) exception, especially for researchers.


It's the classic HN scraping butthurt. You can only do this if you're a billion (trillion?) dollar company and you do it behind closed doors.

All these concern trolls are bad actors. All of them.

Unless you literally steal someone's work and use it / sell it as your own, all the data mining is moral and should be legal if it isn't already.


> Unless you literally steal someone’s work and use it / sell it as your own

Artists creating work are not releasing it on the internet with ShareAlike licenses or any other license which openly allows derivative work or further distribution without a license. This is literally providing a means to stealing people’s work.

How is this any different from providing a listing of copyrighted movies and games and a means to download them, a la The Pirate Bay?


What specifically are you claiming required a review board?

Quick review of their site and the paper turns up nothing that commonly would be a topic that might merit such a review.

Related FAQs:

- https://laion.ai/faq/


LAION-5B includes images of humans without their explicit consent. Images of people generally involve IRB/HSR. Additionally, almost any IRB will mention that if you’re using data derived from humans, you must go through IRB.

LAION can say all they want that they’re not including images in their dataset. They include a script to download those URLs into images on disk. By being a company that’s not bound to decades of university ethics regulations, they are seemingly allowed to skirt what you learn on your first day as a researcher in academia. It may be legal, but it sure is not ethical.


Please provide link to another academic publication agreeing with your claim that linking to online content is unethical without the subject’s explicit approval.


It's one thing to link to online content. They also provide a download script to then turn the links into realizable images.

This defense, that they merely provide links and not images, is the thin layer of abstraction that their entire ethics case is built on top of. They give you everything needed to create massive datasets of human data without doing it for you.


Thanks a more specific claim that the OP didn't make.


This is just trolling and typical of people who just want to shoot down what others have done because they cant or haven't created anything themselves. How about showing a positive solution instead of the equivalent of finding reasons why we can't do anything. If everyone had this attitude we'd still all be hiding in trees somewhere


The positive solution is to scrape Wikimedia Commons for everything in "Category: PD-Art-old-100" and train from scratch on that data. Wikimedia Commons is well-moderated, the image data is public domain[0], and the labels can be filtered down to CC-BY or CC-BY-SA subsets[1]. Your resulting model will be CC-BY-SA licensed and the output completely copyright-free.

For the record, that's what I've been trying to do[2]; my stumbling blocks have been training time and a bug where my resulting pipeline seems to do the opposite of what I ask[3]. I'm assuming it's because my wikitext parser was broken and CLIP didn't have enough text data to train on; I'll have the answer tomorrow when I have a fully-trained U-Net to play with.

If I can ever get this working, I want to also build a CLIP pipeline that can attribute generated images against the training set. This would make it possible to safely use CC-BY and CC-BY-SA datasets: after generating

[0] At least in the US. Other jurisdictions think that scanning an image recopyrights it, see https://en.wikipedia.org/wiki/National_Portrait_Gallery_and_...

[1] Watch out for anything tagged with https://commons.wikimedia.org/wiki/Template:Royal_Museums_Gr... as that will taint your model.

[2] https://github.com/kmeisthax/PD-Diffusion

[3] https://pooper.fantranslation.org/@kmeisthax/109486435508334...


You can check my Google Scholar [0]. I have created many, high-impact datasets that were 1) formative in their respective areas and 2) have seen downstream usage in disasters and wars around the world. Not once in creating those datasets did we take the “easy” route by compromising on the ethics of data collection.

The positive solution here was to not collect data if there was a reasonable ethical concern. This classic mindset of “anything goes as long as we create value” is highly toxic.

[0] https://scholar.google.com/citations?user=4Cdwp_MAAAAJ




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: