TL;DR: The author proposes that instead of using the Softmax function in each head,
Softmax(x_i) = exp(x_i) / sum(exp(x_i)),
we should use instead what the author calls the Softmax_1 function,
Softmax_1(x_i) = exp(x_i) / (1 + sum(exp(x_i))),
which would make it possible for each transformer head's attention probabilities to be zero, i.e., attend to nothing, by computing x_i's with values well below zero.
Giving each transformer head the ability to ignore all tokens surely can't hurt, but it remains to be seen if it will actually improve transformer performance.
I also saw the author distinguished internal versus output softmax. I think he'd apply his modification only to internal softmax and let the external force an output.
Yes, it makes sense to apply this only to the Softmax we use to compute attention. It makes no sense to apply it to the output Softmax, which must compute a probability distribution over the vocabulary.
Activation sparsity and packing sparse matrices will surely be important, so there is one kind of performance. However the other, perplexity, needs a good demonstration. It might require a big model, but even 30B you can fine tune on nowadays on a big Cloud GPU box.
Giving each transformer head the ability to ignore all tokens surely can't hurt, but it remains to be seen if it will actually improve transformer performance.