That's kind of the whole deal of the attention mechanism in transformers and also partially why they replaced RNNs. You don't throw away any part of the original input as you construct your output. The downside is that unlike for a RNN, the total sequence length is fixed at training time and complexity grows with the square of it. But apart from computational cost, sequence length is not really an issue anymore for these models.
That's kind of the whole deal of the attention mechanism in transformers and also partially why they replaced RNNs. You don't throw away any part of the original input as you construct your output. The downside is that unlike for a RNN, the total sequence length is fixed at training time and complexity grows with the square of it. But apart from computational cost, sequence length is not really an issue anymore for these models.