This video by Computerphile is a great overview of transformers and how we got to this point [0]. Basically the networks we used before, recurrent neural networks, "forgot" prior information so they're not good at long tasks. The transformer architecture however does not forget (or at least as easily).
[0] https://www.youtube.com/watch?v=rURRYI66E54