>I mean by the same logic the only difference between a diffusion model and a VL...

>I mean by the same logic the only difference between a diffusion model and a VLM is that you put the spatial transformer on the other end.

Maybe if that was the only different but it's not. There are diffusion models that have nothing to do with transformers or attention or anything like that and where using them for arbitrary sequence prediction is either not possible or highly non-trivial.

Yes, All Neural Network architectures are function approximators but that doesn't they excel equally for all tasks or that you can even use them for anything other than a single task. This era of the transformer where you can simply use a single architecture for NLP, Computer Vision, Robotics, even reinforcement learning is a very new one. Literally anything a bog standard transformer can do is anything GPT can do if Open AI wished.

Like i said, i don't disagree with your broader point. I just don't think this is an instance of it.