i think of it as: if the data is gaussian, use a mean, otherwise go non-parametr...

lanstin · on Sept 14, 2021

Assume up front none of your measured latencies from a software networked system will be Gaussian, or <exaggereation> you will die a painful death </exaggeration>. Even ping times over the internet have no mean. The only good thing about means is you can combine them easily, but since they are probably a mathematical fiction, combining them is even worse. Use T-Digest or one of the other algorithms being highlighted here.

mhh__ · on Sept 15, 2021

This is why I try to plot a proper graph of times from any "optimization" I see in a PR. Too many times I see people making this assumption for example, and even if they're right they usually forget to take the width of the gaussian into account (i.e. wow your speedup is 5% of a standard deviation!)

a-dub · on Sept 14, 2021

yep, have made that mistake before. even turned in a write-up for a measurement project in a graduate level systems course that reported network performance dependent measurements with means over trials with error bars from standard deviations.

sadly, the instructor just gave it an A and moved on. (that said, the amount of work that went into a single semester project was a bit herculean, even if i do say so myself)