GP told you how they don't work, but not how they do:
Current image generators work by training models to remove artificial noise added to the training set. Take an image, add some amount of noise, and feed it with it's description as inputs to your model. The closest the output is to the original image, the highest the reward function.
Using some tricks (a big one is training simultaneously on large and small amounts of noise), you ultimately get a model that can remove 99% noise based only on the description you feed it, and that means you can just swap out the description for what you want the model to generate and feed it pure noise, and it'll do a good job.
I read this description of the algorithm a few times and I find it fascinating because it's so simple to follow. I have a lot of questions, though, like "why does it work?", "why nobody thought of this before", and "where is the extra magical step that moves this from 'silly idea' to 'wonder work'"?