It's the quality of the image text pair not the image alone but midjourney is not a model it's a suite of models that work in conjunction. They have an llm in the front to optimize the user prompts, they use SAM models, controlnet models for poses that are in high demand and so much more. That's why you can't really compare foundation models anymore because there are none.