Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Models all the way down (knowingmachines.org)
114 points by jdkee on March 30, 2024 | hide | past | favorite | 34 comments


Very interesting read. The format of the article is weird at first, but we get used to it.

These days we rely more and more on datasets bigger and bigger, and the sheer size / apparent quality of datasets curated by machine makes them an attractive option to smaller, human-curated datasets.

This article is eye-opening about the shortcomings of such approaches.


Would it be ethical to train a model to identify child porn so it could be excluded from training sets, or even the internet, automatically? It seems like an ideal very useful application for CV AI, but you might have to train it on CP to make it effective so…


I can see some practical difficulties with this. The first one is that CSAM is radioactive and can’t be touched, so only a very limited circle of people could even approach this task. Related to that, the model weights would contain representations of the training data, which if distributed could then be used to get some of the source material back out.


Hashes of existing child porn are created and disseminated for automated detection already, a CP detector model isn't too far from that.


Hashes just detect old unmodified content, not new stuff, or even AI generated stuff.


Wasn’t the hashing mechanism that Apple was proposing using capable of dealing with some attempts to obfuscate the CSAM image with edits?


It's also super easy to change the video slightly to alter the hash.


Obviously, but the idea of creating a useful hash function in this case is to make it invariant under those sorts of to a human trivial transformations, whilst avoiding collisions between genuinely different content. Microsoft created something called PhotoDNA that tries to do this.


[flagged]


>The major problem with CP is that the most cost-effective way of producing it is abusing children.

I'm not sure that's completely true. Despite the surprisingly widespread notion that generative AI can only strictly reproduce things it's been trained on, it can still deductively create novel things based on that data. For example, image generation models are trained on a wide variety of adult human figures. Children are just... kind of a subset of what makes an adult, so via some clever prompting, it should be possible to steer a model to generate CP. It should then be possible to generate synthetic data to better tag, fine-tune, train a new model, or whatever.

The problem with this is a person needs to steer that ship -- who's going to do it? Even if it has the potential to prevent abuse, it's something that's so universally unacceptable that people generally don't want to touch the topic. There are law enforcement officers whose job it is to manually search for and assess CP on the web, but given the underlying philosophy of law enforcement, I'm sure most of them would not approve of the idea of generating CP, even if it's synthetic. So... the ideal pick would be an actual (hopefully non-offending) pedophile, but who wants to be known as an employer of pedophiles?

There are supposedly communities of pedophiles who wish to stay non-offending, and I imagine it'll be an "outsider" group that tries this, if at all.


Furthering sexual the fantasies of pedophiles actually does not help children, but obviously creates predators as more people are exposed.


> ... but obviously creates predators as more people are exposed.

If we look at more traditional porn; the uptick in availability seems to have correlated with less sex. And the advent of violent video-games is associated with no uptick in violence as far as I have been told, although it is getting harder for armies to recruit.

I reckon the default position here is if you give people a dopamine hit when they are staring at a screen, they stare at a screen more. Not that they start harming others. The idea that we don't want marginal cases to wake up a sexual love for children makes sense to me, but I don't think more "predators" will be created; we'd probably just see a collapse in the economic incentives currently in place to harm children.


>the economic incentives currently in place to harm children

I really do not believe in that. There isn't a single person in the world who chooses to be a child pornographer because the economics of it are good. They do it because it allows them to carry out their sexual fantasies, which are about abusing children. The abuse has to be the point, if it weren't just abstaining from it would be the only choice.

It is extremely easy to comprehend that your sexual urges (which can be satisfied in a myriad ways) are monumentally less important than a childs wellbeing. Choosing your urges means that you are doing it because of the abuse, which of course any virtual recreation can not provide.


>obviously creates predators as more people are exposed. (from your earlier comment)

>There isn't a single person in the world

Is there any good research to back these up? Societies have had similar lines of reasoning involving prohibition and banning in the past that seemed "obviously" intuitively correct, some of which have been mentioned already by others.

As mentioned by someone else, CP is "radioactive," and as such, in a black market would probably fetch an appropriate price. It's been an extensively studied and observable behavior that people can, and will partake in unethical or immoral activities if the economic incentive is high enough. It's not an either-or thing; it can be about the abuse, but it can also be about money, or even just a lack of empathy. The "lesser" crime of CP distribution doesn't necessarily have to be about the abuse either; people can find themselves in crime rings out of desperation, or coercion.

If realistic AI-gen CP can greatly devalue the real stuff, the risk would become much less worth it, except for the types of people you specifically mention. As I mentioned in another comment, it's possible to train and generate victimless synthetic CP -- AI learns concepts, similar to how a human artist might piece together concepts to create something they've never seen before.

The idea that virtual content can't provide some sort of catharsis or outlet for at least some people is also questionable, unless we can get a significant number of actual pedophiles to come and testify to this. People already find outlets in acting out any number of emotional, sex, or abuse fantasies via text LLMs, despite it being only text. And generative AI is becoming increasingly realistic, including audio and video domains, even if it's disturbing to think about.

There is evidence to support that pedophilia is to some extent a result of biology and uncontrollable life circumstance, and there are non-offending pedophile support groups, so avoiding abuse crimes is a choice for a significant non-zero portion of pedophiles.


That's not obvious to me at all. What if a trove of pornography provides a non-violent alternative, meaning fewer children get hurt? Either hypothesis seems at least somewhat likely, and boldly claiming that one or the other is true should be supported by at least some evidence.

But even with AI generated porn, children get hurt indirectly imo, for the same reason using AI to generate art violates the property rights of artists.


That's the same logic as 'video games cause violence'. Do video games cause violence?


My conclusion is that LAION is built using model upon model upon model; so each error propagates and the resulting dataset is kinda shitty.

But "we" made amazing stuff out of LAION.

So: the next person who can curate a high-quality big dataset is gonna make a fortune


"If your full-time, eight-hours-a-day, five-days-a-week job were to look at each image in the dataset for just one second, it would take you 781 years."

So if we take 781 people to do this job, it'd only take a year? With about 1500 workers it would only take 6 months? Seems important enough to do.


I think there's a typo in the section on arbitrary thresholds. Specifically in the sentence about 16% being within 0.1 of the cutoff. I have a feeling this should be 0.01 as mentioned later.


How does the author know that Midjourney was partially trained on LAION-5B?


Reading this felt like a huge waste of my time.

TFA picks out 4-5 realistically unavoidable features of AI training - like that a certain quality threshold value was chosen arbitrarily, or that there's more training data for English than for other languages - and then they hand-wavingly suggest that each feature could have huge implications about.. something. And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.


Exactly. The whole thing reads like some propaganda. It pits interesting topics ahead then to move on and push some agenda that sounds super political to me.

Yes, some languages are underrepresented and there are some thresholds. But exactly, it is well known that putting the threshold just slightly above or below will probably not materially affect the model.


> And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.

There are so many articles on the Internet that dive into the implications of datasets that amplify certain voices over others. I personally liked this analysis because it actually shows you what such a lopsided database looks like on the inside. And it looks like this breakdown might actually help the creators of LAION-5B fix some of the issues. I saw this as a pretty constructive and useful article.

But to your point about the implications of datasets like this one, allow me to share my personal experiences with generative AI.

(I want to make it clear that I'm not personally offended by any of the things I'm going to say below in this comment. I see all of the following as well-understood issues that can be solved if researchers/AI vendors have the will.)

I speak three languages, but I can only reliably interact with generative AI models using one of them. I'm forced to effectively erase two thirds of my culture and upbringing when I talk to ChatGPT. That severely limits what I can do with LLMs and leaves a bad taste in my mouth.

Generative AI is marketed as this giant forward leap for humanity, on par with smartphones and the Word Wide Web. But that marketing rings hollow when most people in my country can't use these tools merely because they don't speak English.

Unlike smartphones and the Web, getting the models to speak non-English languages is not such an easy task. Want to translate every single app on the iPhone into Hindi? Just hire an experienced team of translators (sidenote: the Hindi localization on iOS is excellent). But how do you create the terabytes of Hindi language content required to make ChatGPT understand Hindi as well as it understands English? I don't know what the solution is, but it's definitely a problem that needs to be solved if we want AI to be useful to as many people as possible.

Beyond LLMs, other generative models also reflect their training data in a way that favors certain cultures over others. E.g when I ask ChatGPT to "generate an image of a girl sitting on a bench, eating an apple", it generates an image of a person who is clearly white, dressed in blue jeans and sneakers. Why does it make that assumption when, in 2024, there are more brown people on the planet than white people?

If I refine that prompt to "generate an image of an indian girl sitting on a bench, eating an apple", it makes the person look more Indian but keeps the jeans and sneakers. Once again, why does the model assume that all humans wear jeans and sneakers by default? Why do I have to prompt it to generate other kinds of clothing?

AI is unlike any other technology we've built so far. We talk to it as if we're addressing a human being, and it talks back to us. This makes it difficult to treat it as merely a tool, even though all of us on this website know that's what it is. And when this anthropomorphized genie in a box refuses to speak my language or treats my culture as a second-class citizen, it stings. It's this sort of stuff that can make people feel like there's something wrong with their culture, their ethnicity, their language.

I'm sure we'll iron out all these issues eventually. But we have to keep making noise and keep demanding better from AI researchers and companies. And articles like this one really contribute a lot towards that goal.


Thank you for this comment, you've done a fair better job criticising cultural bias in LLMs than most.

> But how do you create the terabytes of Hindi language content required to make ChatGPT understand Hindi as well as it understands English?

It's wrong to assume that every language needs as much training data as English to match the English performance, since knowledge that isn't language- or culture-specific can transfer, in theory and (what used to be considered amazing) to a large extent in practice, once it understands the language very well. Of course a very large proportion, maybe the majority, of what the largest LLMs learn is cultural, but the actual language learning is only a small fraction: A transformer with only tens of millions of parameters can translate between two languages.

Are you saying ChatGPT (is that GPT 3.5?) is much worse in Hindi when given the same prompt in English and Hindi? I'm disappointed to hear that, I had thought OpenAI had done better at multi-lingual performance than many others.


Interesting, but I'm not a fan of ergodic literature when the form of interaction is "scroll forever"


You can activate "Reader Mode" in your browser. You will get gibberish (used for the background animations) until 90% down the page, then you will find the interesting text.

EDIT : If you don't scroll and activate "Reader Mode" immediately upon arrival, you only get the signal, and none of the noise !


Yeah, I looked at the source to see if I could just read it, but it's in a canvas element that loads a bunch of external files.


I don't understand how this format keeps coming back. On mobile (which doesn't have reading mode) it is literally unreadable over such a noisy animated text-filled background.


I read it on mobile just fine.


Good for you. Now think about other people who are not like you.

I was able to read it, but only with difficulty. The background animations are super distracting and siphon off mental effort required to ignore them and focus on the text. Many people can do this relatively effortlessly, but for people who can't, the website is far less accessible than it could be.


I was reading it OK, but my browser performance kept deteriorating as I scrolled, until it crashed. Then I opened in another browser with NoScript and all was well.


Interesting use of ergodic. What is ergodic literature?


Literature that is not simply read from start to end in the normal way.

This is a but vague, so examples may help. Choose your own adventure series, House of Leaves, ELIZA, One Hundred Thousand Billion Poems.

If you want to stretch the idea, you could consider LLM output to be ergodic literature of the training data in that it cuts up the corpus and resynthesizes it.


Strange term, choose your own adventure is non-ergodic, there are many failure states you can land on. Meanwhile, most normal literature is already ergodic under the normal definition.


I looked up the definition, and I'm still a bit confused. I'm familiar with word in the context of traversing system states, specifically graph traversal.

The term "ergodic literature" doesn't seem to relate to the word in a math context which seems strange. Usually the same word used in different domains has analogous definitions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: