Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you believe that industry uses pre-trained model that researchers release?

Do you believe that industry uses pre-made datasets that researchers promote in their work?

Would yes to the above two question be sufficient to show "use of biased datasets in research" is correlated with "biased outcomes in real world production deployments"?



"What I believe" is completely irrelevant, because it has zero grounding in research or evidence. She's claiming that we should listen to her because she's an expert, but then provides no evidence or explanation between the two for a causal link.

I'm not saying she's wrong, I'm saying we don't know because she defaulted to the argument that "you should listen to minorities", not "here is the evidence".

What's more, every single example of injustice in her tutorial was an image recognition/classification problem - entirely different from the generative model that originally sparked the debate.


I don't think there's an argument that needs to be made; it's pretty clear to me that people use researcher-produced models in production all the time without regard for whether they're biased because it's the easiest solution. If you think there needs to be an argument made to justify that, that's fine, but I don't think it's valuable to assume that people come into a discussion without a basic understanding of the software engineering ecosystem.

And the point being made isn't "biased input data isn't responsible for a biased model", it's "you need to look one step further and ask why is the input data biased and how that impacts the world."


I want to point out that while I disagree with you, you've already engaged in far more actual debate than Gebru did on Twitter.

Like I said, LeCun (and myself) largely agree with most of Gebru's points. But when LeCun went to great lengths to defend/explain his position, Gebru then literally responded with "I don't have time for this". Even before that, she didn't even bother to link the presentation she referred to (which again, didn't even directly address any of the points that Yann was making!).

It's this complete lack of good-faith engagement that prompted LeCun to quit Twitter, not the underlying discussion on ethics itself. LeCun clearly feels that Twitter is not the place for reasonable discussion, and after this episode, I'm inclined to agree.

> If you think there needs to be an argument made to justify that, that's fine, but I don't think it's valuable to assume that people come into a discussion without a basic understanding of the software engineering ecosystem.

I'm not saying that people don't use off-the-shelf models. I'm saying that I don't know if forcing research datasets to be "unbiased" will make any difference to real-world injustice. I don't even know if any of the examples of bias she gave in her tutorial (HireQ/Microsoft/etc) could be ascribed to the use of pretrained models. She could be right. I don't know. You probably don't either.

Going beyond the empirical question, she'd also need to explicitly argue why responsibility should lie at the feet of the researcher, not the engineer. Gebru did neither, which is why I say it's a huge leap of logic.

That, fundamentally, is LeCun's position. He completely agrees that warning labels should be put on these kind of models that say "Model has been trained on biased data and unsuitable for use in real-world applications where racial fairness is expected". In fact, this is exactly what the authors did.

> And the point being made isn't "biased input data isn't responsible for a biased model", it's "you need to look one step further and ask why is the input data biased and how that impacts the world."

And I'd argue you need to account for the context in which your model is deployed. If I'm using StyleGAN to synthesize facial textures for a video game, biased datasets and models are desirable, not something to be eliminated. I'll use the appropriately biased model depending on whether I want to generate white faces, Chinese faces, or black faces.

It's the use case that dictates the risk, hence why LeCun (and I) believe it's the engineer's responsibility, not the researcher.


> In fact, this is exactly what the authors did.

In fact, they did so after Timnit brought up her objection, and did so by taking advantage of Timnit's research (the model card they added is a direct result of research Timnit was involved with: https://arxiv.org/abs/1810.03993).


Which is precisely why LeCun - like myself - agrees with 95% of what Gebru has said in the past.

The issue at hand was her lack of good-faith engagement on Twitter and the subsequent pile-on from the mob. LeCun is quitting Twitter, he's not quitting ethical debates.


> Which is precisely why LeCun - like myself - agrees with 95% of what Gebru has said in the past.

Even now, it's not clear to me that this is the case. LeCun still hasn't actually acknowledged any of the broader ethical arguments Gebru made, either on twitter or on his followup posts on facebook.

In fact, he makes no references to her research anywhere (beyond the vaguest "I value the research you're doing" in his apology tweet). I found that rather suspicious, I still do.

Like, having read through the entire conversation, I have no confidence that Yann could explain any of Timnit's research if asked about it, even in broad strokes. That's really, really weird given everything that happened.


> Even now, it's not clear to me that this is the case. LeCun still hasn't actually acknowledged any of the broader ethical arguments Gebru made, either on twitter or on his followup posts on facebook.

Well, to be fair, she didn't provide Yann with any ethical arguments in this instance.

But being equally fair, you're right, I can't speak for Yann. I can only speak for myself, and I personally agree with most of what I've read of Gebru's work (though not all).

But the issue at hand wasn't the research itself - it's the way the dialogue was conducted on Twitter, and Gebru accounted for herself very poorly.


> Well, to be fair, she didn't provide Yann with any ethical arguments in this instance.

Yes and no. A lot of what I'm saying is directly from the tutorial she repeatedly suggested he watch. Because I took the time to watch it, because that's the reasonable thing to do when someone suggests that you aren't fully informed on a subject and suggests a resource to improve your understanding.


Regardless of the answers, wouldn't the problem be placed on the feet of "industry", and not the researchers?

Should cryptography researchers backdoor their own papers, because terrorists or pedophiles might use it?


A major aspect of crypto research today is crypto UX and making crypto systems that are difficult to misuse. There are academics who actively work on these issues. They aren't the only academics obviously, but they exist.

Building ML systems that are difficult to misuse is underexplored, and Timnit is one of the relatively few researchers actively doing work in this area.


>A major aspect of crypto research today is crypto UX and making crypto systems that are difficult to misuse.

I'm intrigued by this. Any names (projects/people/protocols) come to mind?


Tink (https://github.com/google/tink) and Age (https://github.com/FiloSottile/age) are the obvious examples, although I think to some extent even things like the Signal Protocol apply.

I'd call them both examples of applied cryptography research. I think these projects compare very, very closely to applied ML research:

They come out of industry research labs, are worked on by respected experts, usually involving some academics, ultimately you end up with an artifact beyond just a paper that is useful for something and improves upon the status quo.

I'm admittedly not a total expert, so I don't know how far down to the level of crypto "primitives" this kind of work goes, but I believe there is some effort to pick primitives that are difficult to "mess up" (think "bad primes") and I know tink actively prevents you from making bad choices in the cases where you are forced to make a choice.

Even more broadly, just consider any tptacek (who I should clarify is *not a researcher, lest he correct me) post on pgp/gpg email, or people like Matt Green (http://mattsmith.de/pdfs/DevelopersAreNotTheEnemy.pdf).

Edit: Some poking around also brought up this person: https://yaseminacar.de/, who has some interesting papers on similar subjects.


> misuse

That doesn't mean what you think it means.


Misuse has two meaning: to use for a bad purpose (criminals using it to do bad things) or use it incorrectly (hold it wrong). I was using misuse in the "hold it wrong" sense, but I agree that there's ambiguity there.


Thanks so much for all of this info, looks like a few really cool projects!!!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: