Wouldn't also the adverserial model training have to the take "physics correctness" into account? As long as the image detects as "<insert celebrity> in blue dress", why would it care about correct details in eyes if nothing in the "checker" cares about that?
Current image generators don’t use an adversarial model. Though the ones that do would have eventually encoded that as well; the details to look for aren’t hard-coded.
GP told you how they don't work, but not how they do:
Current image generators work by training models to remove artificial noise added to the training set. Take an image, add some amount of noise, and feed it with it's description as inputs to your model. The closest the output is to the original image, the highest the reward function.
Using some tricks (a big one is training simultaneously on large and small amounts of noise), you ultimately get a model that can remove 99% noise based only on the description you feed it, and that means you can just swap out the description for what you want the model to generate and feed it pure noise, and it'll do a good job.
I read this description of the algorithm a few times and I find it fascinating because it's so simple to follow. I have a lot of questions, though, like "why does it work?", "why nobody thought of this before", and "where is the extra magical step that moves this from 'silly idea' to 'wonder work'"?