The funnel shape of the scatter plot immediately reminded me of an article on the insensitivity to sample size pitfall [0], which points out that you'll expect entities with smaller sample sizes to show up more often in the extremes because of the higher variance.
Looks like the tags with the biggest differences exemplify this pretty well.
I also saw that triangle shaped plot and had the same thought. I read a great paper about this recently [0] with some of the same examples as the link in the parent, but going a little further in depth.
I originally got on this topic when reading Bayesian Methods for Hackers [1]. I am still hunting for a good method to correct/compensate for this when I am doing these types of comparisons in my own work.
When I was writing my thesis I wanted to correct for that as well, and weighted my data by the log of the sample size. This made intuitive sense to me, and both my advisors seemed to agree, though neither of us found compelling papers for this.
It really doesn't matter - at least, not for the statistical error the parent is talking about. The effect isn't related to whether we are sampling from a larger population of programmers.
Suppose there were no difference between the usage of each language, and people just program on the weekends vs weekdays with some probability independent of language. Then, if a language has lots of users, it will likely have close to the average weekend/weekday proportion. The fewer users the language has, the more likely that it has an uneven weekend/weekday proportion just by chance. And if you plot the weekend/weekday proportions vs. the number of users, you expect a funnel shape just like the one in the article.
Therefore, the plot in the article - by itself - provides no evidence that there is any difference between the usage of different programming languages.
I think it's just that they are plotting sum vs ratio of two random variables. Try this (in R):
a <- runif(1000); b <- runif(1000); plot(a + b, log(a/b))
Also, they likely have a low-end cutoff (notice their x axis starts at 10^4. If you do the same to the above plot, you get even closer to that exact shape. Try:
It could also be that the most popular languages in the corporate world are a compromise somewhere in between enjoyable/exciting and horrible/boring. (My assumption is that a large weekend ratio correlates with enjoyable/exiting.)
The ratio of sample sizes in the OP also isn't that bad, and none of them are very small.
Can you normalise data like that based on a confidence interval? Just rescaling the graph to unify them seems wrong, (it would answer something like "what do we think the distribution would look like if we distrusted the low end?") but maybe there's a better way?
A confidence interval won't adjust the points (point estimates) but will give those points with a lower sample size wide confidence intervals (often covering zero).
Using an (empirical) Bayesian multilevel model can both attach uncertainty intervals to the point estimates and appropriately "shrink" the estimates towards zero at the low-sample-size end.
The latter is more directly interpretable, at the cost of slightly more complex modelling (/assumptions).
Looks like the tags with the biggest differences exemplify this pretty well.
[0]- http://dataremixed.com/2015/01/avoiding-data-pitfalls-part-2...