I mean, faces are faces, right? If the training data set is large and representative I don't see why any two (representative) halves of the data would lead to significantly different models.
If there's some fundamental limit of what type of intelligence the current breed of LLMs can extract from language, at some point it doesn't matter how good or expansive the content of the training set is. Maybe we are finally starting to hit an architectural limit at this point.