>it would produce near identical results to a 2015 era auto-predict.
I don't know that this is true, but it is plausible enough. But the benefit of Transformers is that they are stupid easy to scale. It is in scale that they are able to perform so remarkably across so many domains. Comparing the function of underparameterized versions of the models and concluding that some class of models are functionally equivalent due to their equivalent performance in underparameterized regimes is a mistake. The value of an architecture is in its practical ability to surface functional models. In theory, a MLP with enough parameters can model any function. But in reality, finding the model parameters that solve real world problems becomes increasingly difficult. The inductive biases of Transformers is crucial in allowing it to efficiently find substantial models that provide real solutions. The Transformer architecture is doing real substantial independent work in the successes of current models.
I don't know that this is true, but it is plausible enough. But the benefit of Transformers is that they are stupid easy to scale. It is in scale that they are able to perform so remarkably across so many domains. Comparing the function of underparameterized versions of the models and concluding that some class of models are functionally equivalent due to their equivalent performance in underparameterized regimes is a mistake. The value of an architecture is in its practical ability to surface functional models. In theory, a MLP with enough parameters can model any function. But in reality, finding the model parameters that solve real world problems becomes increasingly difficult. The inductive biases of Transformers is crucial in allowing it to efficiently find substantial models that provide real solutions. The Transformer architecture is doing real substantial independent work in the successes of current models.