Because the variance can be uniformly high, making it difficult to properly judg...

Because the variance can be uniformly high, making it difficult to properly judge the improvement of one method vs the baseline method: did you actually improve, or did you just get a few lucky seeds? It's much harder to get a paper debunking new "SotA" methods so I default to showing a clear improvement over a good baseline. Simply looking at the performance is also not enough because a task can look impressive, but be actually quite simple (and vice versa), so using these statistical measures makes it easy to distinguish good models on hard tasks from bad models on easy tasks.

I should also note 1) this is about testing whether the performance of a model is meaningfully different from another, not the coefficient of the models 2) I don't reject papers just because they lack this, or if they fail to achieve a statistical significance, I just want it in the paper so the reader can use that to judge (and it also helps suss out cherry picked results)