I think that the tool will get higher quality results with easier to control inp...

I think that the tool will get higher quality results with easier to control inputs. Some of that is already happening rapidly.

In theory, given enough machine power, I could see a branching, interactive interface. Like if there are multiple major hills in the latent space ("Tempest-> a storm cloud" vs. "Tempest -> a cartoon pony") the system could identify them all and generate a set of images with different biases towards each hill. Then the user can pick which direction to climb.

The current system on high-end hardware takes about a minute before results for one image even begin to be recognizable. And, about 3 minutes before you can be somewhat confident how it's generally going to turn out.