You're wrong on multiple counts here.

> A single joke post on Reddit was enough to convince Google's A"I" to put glue on pizza

The post was most likely fed to the AI at inference time, not training time.

THe way AI search works (as opposed to e.g. Chat GPT) is that there's an actual web search performed, and then one or more results is "cleaned up" and given to an LLM, along with the original search term. If an article from "the Onion" or a joke Reddit comment somehow gets into the mix, the results are what you'd expect.

> it's "only" statistical repetition if you boil it down enough.

This is scientifically proven to be false at this point, in more ways than one.

> Unfortunately, the tech status quo is nowhere near that capability, hence all the AI companies slurping up as much data as they can, in the hope that "outlier opinions" are simply smothered statistically.

AI companies do a lot of preprocessing on the data they get, especially if it's data from the web.

The better models they have access to, the better the preprocessing.