TFA picks out 4-5 realistically unavoidable features of AI training - like that a certain quality threshold value was chosen arbitrarily, or that there's more training data for English than for other languages - and then they hand-wavingly suggest that each feature could have huge implications about.. something. And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.
Exactly. The whole thing reads like some propaganda. It pits interesting topics ahead then to move on and push some agenda that sounds super political to me.
Yes, some languages are underrepresented and there are some thresholds. But exactly, it is well known that putting the threshold just slightly above or below will probably not materially affect the model.
> And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.
There are so many articles on the Internet that dive into the implications of datasets that amplify certain voices over others. I personally liked this analysis because it actually shows you what such a lopsided database looks like on the inside. And it looks like this breakdown might actually help the creators of LAION-5B fix some of the issues. I saw this as a pretty constructive and useful article.
But to your point about the implications of datasets like this one, allow me to share my personal experiences with generative AI.
(I want to make it clear that I'm not personally offended by any of the things I'm going to say below in this comment. I see all of the following as well-understood issues that can be solved if researchers/AI vendors have the will.)
I speak three languages, but I can only reliably interact with generative AI models using one of them. I'm forced to effectively erase two thirds of my culture and upbringing when I talk to ChatGPT. That severely limits what I can do with LLMs and leaves a bad taste in my mouth.
Generative AI is marketed as this giant forward leap for humanity, on par with smartphones and the Word Wide Web. But that marketing rings hollow when most people in my country can't use these tools merely because they don't speak English.
Unlike smartphones and the Web, getting the models to speak non-English languages is not such an easy task. Want to translate every single app on the iPhone into Hindi? Just hire an experienced team of translators (sidenote: the Hindi localization on iOS is excellent). But how do you create the terabytes of Hindi language content required to make ChatGPT understand Hindi as well as it understands English? I don't know what the solution is, but it's definitely a problem that needs to be solved if we want AI to be useful to as many people as possible.
Beyond LLMs, other generative models also reflect their training data in a way that favors certain cultures over others. E.g when I ask ChatGPT to "generate an image of a girl sitting on a bench, eating an apple", it generates an image of a person who is clearly white, dressed in blue jeans and sneakers. Why does it make that assumption when, in 2024, there are more brown people on the planet than white people?
If I refine that prompt to "generate an image of an indian girl sitting on a bench, eating an apple", it makes the person look more Indian but keeps the jeans and sneakers. Once again, why does the model assume that all humans wear jeans and sneakers by default? Why do I have to prompt it to generate other kinds of clothing?
AI is unlike any other technology we've built so far. We talk to it as if we're addressing a human being, and it talks back to us. This makes it difficult to treat it as merely a tool, even though all of us on this website know that's what it is. And when this anthropomorphized genie in a box refuses to speak my language or treats my culture as a second-class citizen, it stings. It's this sort of stuff that can make people feel like there's something wrong with their culture, their ethnicity, their language.
I'm sure we'll iron out all these issues eventually. But we have to keep making noise and keep demanding better from AI researchers and companies. And articles like this one really contribute a lot towards that goal.
Thank you for this comment, you've done a fair better job criticising cultural bias in LLMs than most.
> But how do you create the terabytes of Hindi language content required to make ChatGPT understand Hindi as well as it understands English?
It's wrong to assume that every language needs as much training data as English to match the English performance, since knowledge that isn't language- or culture-specific can transfer, in theory and (what used to be considered amazing) to a large extent in practice, once it understands the language very well. Of course a very large proportion, maybe the majority, of what the largest LLMs learn is cultural, but the actual language learning is only a small fraction: A transformer with only tens of millions of parameters can translate between two languages.
Are you saying ChatGPT (is that GPT 3.5?) is much worse in Hindi when given the same prompt in English and Hindi? I'm disappointed to hear that, I had thought OpenAI had done better at multi-lingual performance than many others.
TFA picks out 4-5 realistically unavoidable features of AI training - like that a certain quality threshold value was chosen arbitrarily, or that there's more training data for English than for other languages - and then they hand-wavingly suggest that each feature could have huge implications about.. something. And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.