Sure, emergent properties can arise as parameters increase. Everyone knows that. That’s a much less specific claim than to say that the benefit of modifying softmax can only arise as an emergent property after N parameters, and therefore the benefit can only be evaluated on models above a certain size. To my understanding the author of TFA isn’t suggesting the same issue as the one in your linked paper.
Any reason to believe this? The author never mentioned it, and I can’t think of any other a priori reason why it should be true.