My guess is because the models were all trained on text. You could do as you say...

		thepasswordis on Feb 15, 2024 \| parent \| context \| favorite \| on: Sora: Creating video from text My guess is because the models were all trained on text. You could do as you say, but I think it would go: blender video {gets described by an AI into text}-> text prompt -> video.