Analyzing Hacker News Users’ Join Dates, Karma, and Profiles

edw519 · on May 9, 2008

"I didn’t really expect to find a whole lot of interesting things, and found what I expected."

Which is a great way to conduct research! Nice work.

This reminded me of my senior project in number theory, when I manipulated a large data set, wondering what I'd find. Eventually, I found quite a bit.

Also reminded me of this quote by Wernher von Braun:

"Basic research is what I am doing when I don't know what I'm doing."

byrneseyeview · on May 9, 2008

Linear fits? Why? Why?

Karma is a function of time since joining, participation, and quality of contributions. And it starts at 1. 'Participation' can be determined by looking at contributions per length of time. Quality is average score of each submission -- separating it from participation is a useful way to extend this to a more complicated model taking into account the fact that people stop using HN. So the line of best fit should be something closer to 1 + t * q * p, or the sum of 1 + (t0 * q * p0) + .... (tn * q * pn) to describe folks who are off-and-on contributors.

timr · on May 9, 2008

If nothing else, it's legitimate to filter non-participants before doing a regression analysis on the rest. It's technically true that you can't predict karma over time, but that's not really an interesting statement until you eliminate the large number of people who sign up, then never post or comment.

My instinct is that once you filter these people, you'll see a much stronger linear relationship between time and karma, since karma isn't normalized by the number of contributions, and number of contributions is probably a poisson process.

breck · on May 9, 2008

Removing all 1's and 2's improves the relationship somewhat (moreso with log(k)) but still not a whole lot.

timr · on May 9, 2008

Sounds like there's a vast gulf of people with little (but not zero) contribution, then. Can you plot # of contributions versus membership time?

breck · on May 10, 2008

don't have the contributions data, just the karma score.

DougBTX · on May 9, 2008

> And it starts at 1.

That's just a constant offset, it matters more whether the next karma level is 2 or 10. When fitting trend lines a "linear fit" would normally satisfy y=mx+c, without limiting yourself to c=0. Note that the posted linear fit has c = -1... apparently everyone starts with -1 karma. Lies, damned lies and statistics I say!

breck · on May 9, 2008

Agreed. If I had data for the quality of contributions I could have done that.

DougBTX · on May 9, 2008

Could you post "General Composition of the Dataset" with a logarithmic vertical scale please? Should help compensate for the outliers so we can see more detail at the bottom of the graph.

breck · on May 9, 2008

Done.

xirium · on May 9, 2008

You may want to take into account differences in file timestamps because the data was collated over many days.

breck · on May 9, 2008

Ahh, I didn't notice that. That would affect things. The timestamps and counts are: ('04/15', 630), ('04/16', 1993), ('04/17', 1994), ('04/18', 491), ('04/20', 270), ('04/21', 1049), ('04/22', 59), ('04/24', 33), ('04/25', 342), ('04/26', 29), ('05/03', 165), ('05/06', 86), ('05/07', 23). So most of the members have older join dates than I figured.

wallflower · on May 9, 2008

The interesting thing about karma.. I find that I can't/I get tired of posting insightful comments day after day..and take breaks and lurk..edw519 I don't know how you do it. I probably won't make the "leaders" (but I don't think its important)

thaumaturgy · on May 9, 2008

| (but I don't think its important)

I think that's the crux of it. Somebody could monitor all the various news sites and spend an hour a day here posting comments and stories and so forth, but I suspect most folks would rather spend that time doing something else.

That said, edw519 is a pretty cool guy.

nostrademons · on May 9, 2008

I tend to post in bursts, so most of my karma comes over periods of a week or two when I'm posting several times a day. It's actually a bad sign, because it means I'm not working on my startup. ;-) Then there are periods where I'll post like twice a week and my RescueTime log'll show that I'm spending like 80-90% of my computer time coding. So yeah, it's a tradeoff, and ultimately the code is more important, but I find that I burn out if I spend too much coding.

pierrefar · on May 9, 2008

Nice work. I love manipulating large data sets :)

Which program did you use to produce the plots?

breck · on May 9, 2008

mooneater · on May 9, 2008

Scatterplots are too dense!

omfut · on May 9, 2008

Just curios, how does the karma point work?