The information might be formally lost for the given token, but remember that transformers train on huge amounts of data.
The (absolute) positional encoding is an arbitrary but fixed bias (push into some direction). The word "cat" at position 2 is pushed into the 2-direction. This "cat" might be different from a "cat at position 3, such that the model can learn about this distinction.
Nevertheless, the model could also still learn to keep "cats" at all positions together, for instance such "cats" are more similar to "cats" than to "dogs" at any position.
More importantly, for some words, the model might learn that a word at the beginning of the sequence should have an entirely different meaning than the same word at the end of the sequence.
In other words, since the embeddings are a free parameter to be learned (usually both as embeddings, and weight-tied in the head), there isn't any loss in flexbility. Rather, the model can learn how much mixing is required or whether the information added by the positional embedding should be seperable (for instance by making embeddings linearly independent otherwise)
If you concat, you carry along an otherwise useless and static dimension, and mixing it into the embeddings would be the very first thing the model learns in layer 1.
The (absolute) positional encoding is an arbitrary but fixed bias (push into some direction). The word "cat" at position 2 is pushed into the 2-direction. This "cat" might be different from a "cat at position 3, such that the model can learn about this distinction.
Nevertheless, the model could also still learn to keep "cats" at all positions together, for instance such "cats" are more similar to "cats" than to "dogs" at any position. More importantly, for some words, the model might learn that a word at the beginning of the sequence should have an entirely different meaning than the same word at the end of the sequence.
In other words, since the embeddings are a free parameter to be learned (usually both as embeddings, and weight-tied in the head), there isn't any loss in flexbility. Rather, the model can learn how much mixing is required or whether the information added by the positional embedding should be seperable (for instance by making embeddings linearly independent otherwise)
If you concat, you carry along an otherwise useless and static dimension, and mixing it into the embeddings would be the very first thing the model learns in layer 1.