They may not be "large" in the same sense that GPT4 is "large" but apart from then simulator stuff, every single one of the models you mentioned is transformer-based. Every one of them basically includes encoders to project different modes of information (images and audio) into a "language-like" space so that it can be compared with and mapped to and from text. I think it's fair to say that language models, if not LLMs, unlocked a surprising amount of power.