TL;DR: The author proposes that instead of using the Softmax function in each he...

rrobukef · on July 24, 2023

I also saw the author distinguished internal versus output softmax. I think he'd apply his modification only to internal softmax and let the external force an output.

cs702 · on July 24, 2023

Yes, it makes sense to apply this only to the Softmax we use to compute attention. It makes no sense to apply it to the output Softmax, which must compute a probability distribution over the vocabulary.

mcbuilder · on July 24, 2023

Activation sparsity and packing sparse matrices will surely be important, so there is one kind of performance. However the other, perplexity, needs a good demonstration. It might require a big model, but even 30B you can fine tune on nowadays on a big Cloud GPU box.