They may not be "large" in the same sense that GPT4 is "large" but apart from th...

They may not be "large" in the same sense that GPT4 is "large" but apart from then simulator stuff, every single one of the models you mentioned is transformer-based. Every one of them basically includes encoders to project different modes of information (images and audio) into a "language-like" space so that it can be compared with and mapped to and from text. I think it's fair to say that language models, if not LLMs, unlocked a surprising amount of power.