That kind of ML model is pretty general. Pretraining a big model has been extend...

That kind of ML model is pretty general. Pretraining a big model has been extended to multi-modal environments. People are training them with RL to take actions. People are applying other generative techniques to them, and all sorts of other stuff. If you just look at it as 'predict the next word token,' then it's pretty limiting, but people have already gone way beyond that. TFA talks about some interesting directions people are taking them.

A more general form of your question is whether we can get to AGI with just incremental steps from where we are today, rather than step-change way-out-of-left-field kinds of ideas. People are split on that. Personally, I think that incremental changes from today's methods are sufficient with better hardware and data, but Big New Ideas could certainly speed up progress.