Hacker News new | past | comments | ask | show | jobs | submit login

The funnel shape of the scatter plot immediately reminded me of an article on the insensitivity to sample size pitfall [0], which points out that you'll expect entities with smaller sample sizes to show up more often in the extremes because of the higher variance.

Looks like the tags with the biggest differences exemplify this pretty well.

[0]- http://dataremixed.com/2015/01/avoiding-data-pitfalls-part-2...




I also saw that triangle shaped plot and had the same thought. I read a great paper about this recently [0] with some of the same examples as the link in the parent, but going a little further in depth.

I originally got on this topic when reading Bayesian Methods for Hackers [1]. I am still hunting for a good method to correct/compensate for this when I am doing these types of comparisons in my own work.

[0] -http://faculty.cord.edu/andersod/MostDangerousEquation.pdf

[1] - https://github.com/CamDavidsonPilon/Probabilistic-Programmin...


When I was writing my thesis I wanted to correct for that as well, and weighted my data by the log of the sample size. This made intuitive sense to me, and both my advisors seemed to agree, though neither of us found compelling papers for this.


Is it really a 'sample', if they are reporting on the entirety of their data for a given period?

Is the question interpreted as extending to those not on stackoverflow, or is it a complete census of the 'population' of their data?


It really doesn't matter - at least, not for the statistical error the parent is talking about. The effect isn't related to whether we are sampling from a larger population of programmers.

Suppose there were no difference between the usage of each language, and people just program on the weekends vs weekdays with some probability independent of language. Then, if a language has lots of users, it will likely have close to the average weekend/weekday proportion. The fewer users the language has, the more likely that it has an uneven weekend/weekday proportion just by chance. And if you plot the weekend/weekday proportions vs. the number of users, you expect a funnel shape just like the one in the article.

Therefore, the plot in the article - by itself - provides no evidence that there is any difference between the usage of different programming languages.


I would argue it's a sort of (nonrandom) proxy sample in the sense that they're sampling a fraction of the people actually programming on the weekend.


Ok, so if we make sure we're only talking about:

    > "what languages tend to be **asked about** on weekends, as opposed to weekdays?" 
and:

    > "explore differences between **questions that are posted** on weekdays and weekends."
as opposed to the article title:

    > "What Programming Languages Are **Used Most** on Weekends?" 
(emphasis added), is the problem then resolved?


I think it's just that they are plotting sum vs ratio of two random variables. Try this (in R):

a <- runif(1000); b <- runif(1000); plot(a + b, log(a/b))

Also, they likely have a low-end cutoff (notice their x axis starts at 10^4. If you do the same to the above plot, you get even closer to that exact shape. Try:

plot(a + b, log(a/b), xlim=quantile(a + b, probs=c(0.2, 1)))


Note also their x axis is on a log scale, which makes the edges linear instead of curved. E.g.:

plot(a + b, log(a/b), xlim=c(1, 2), log="x")


ah, thanks. I missed that.


So the take away of this post is basically SO is bad at statistics and not what people code on their weekends?


It could also be that the most popular languages in the corporate world are a compromise somewhere in between enjoyable/exciting and horrible/boring. (My assumption is that a large weekend ratio correlates with enjoyable/exiting.)

The ratio of sample sizes in the OP also isn't that bad, and none of them are very small.



Can you normalise data like that based on a confidence interval? Just rescaling the graph to unify them seems wrong, (it would answer something like "what do we think the distribution would look like if we distrusted the low end?") but maybe there's a better way?


A confidence interval won't adjust the points (point estimates) but will give those points with a lower sample size wide confidence intervals (often covering zero).

Using an (empirical) Bayesian multilevel model can both attach uncertainty intervals to the point estimates and appropriately "shrink" the estimates towards zero at the low-sample-size end.

The latter is more directly interpretable, at the cost of slightly more complex modelling (/assumptions).


Thanks! I think the shrinking you mention is what I was trying to say :)

Looking for explanation of multilevel model, I found http://mc-stan.org/documentation/case-studies/radon.html which seems to do exactly that in "Partial pooling model". (see graph)


A confidence interval is not what you want since this isn't a normal distribution of values.

Instead you'd want to use a CDF that bins that values.


Thank you for pointing this out.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: