If you take the inner product between a lot of more or less random vectors (the key and query vectors in attention) most values are going to be close to 0. This means they contribute by e^0 to the denominator. Now, if you have a context length of say 2000, your denominator is already ~ 2000. Increasing it to 2001 doesn't really make a difference.
Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands.
That simple comment is a strong counterpoint to the entire blog post?
Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.
> it might be that the model trains all of the inputs to become very negative
It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.
Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.
Adding 1 to the denominator can be useful if you have softmax with just a few options. Not in self-attention where you have thousands.