If you take the inner product between a lot of more or less random vectors (the ...

quickthrower2 · on July 25, 2023

That simple comment is a strong counterpoint to the entire blog post?

Except with the +1 denominator, it might be that the model trains all of the inputs to become very negative so softmax chucks out close to zeros, whereas it wouldn't bother before because making one prob bigger makes another smaller.

thomasahle · on July 25, 2023

> it might be that the model trains all of the inputs to become very negative

It still can't do this because of L2 regularization / weight decay. If two vectors are norm 1, their inner product is at least -1, so with 2000 vectors that's still 2000 * e^(-1) =~ 735.

Not saying it's theoretically impossible that it could happen. But you would have to try _really_ hard to make it happen.

redox99 · on July 25, 2023

I guess you could add a sort of gating operation with a learnable parameter that sends the value to -inf if doesn't reach the threshold.

Of course it might have some other serious repercussions.