> I believe the outlier problem that this solves only appears for very large mod...

WithinReason · on July 24, 2023

See figure 1:

https://arxiv.org/pdf/2208.07339.pdf

Outliers appear at model size 6.7B and are not present at 2.7B

janalsncm · on July 24, 2023

Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.

WithinReason · on July 24, 2023

The second heading in the TFA is "It’s All About Outliers"

PoignardAzur · on July 24, 2023

6.7B isn't "needs a datacenter" scale.

WithinReason · on July 24, 2023

It's in the million dollar range. XLnet which is a 1.3B model cost $245,000 to train for example.