Hacker News new | past | comments | ask | show | jobs | submit login
Analyzing Hacker News Users’ Join Dates, Karma, and Profiles (breckyunits.com)
36 points by kradic on May 9, 2008 | hide | past | favorite | 19 comments



"I didn’t really expect to find a whole lot of interesting things, and found what I expected."

Which is a great way to conduct research! Nice work.

This reminded me of my senior project in number theory, when I manipulated a large data set, wondering what I'd find. Eventually, I found quite a bit.

Also reminded me of this quote by Wernher von Braun:

"Basic research is what I am doing when I don't know what I'm doing."


Linear fits? Why? Why?

Karma is a function of time since joining, participation, and quality of contributions. And it starts at 1. 'Participation' can be determined by looking at contributions per length of time. Quality is average score of each submission -- separating it from participation is a useful way to extend this to a more complicated model taking into account the fact that people stop using HN. So the line of best fit should be something closer to 1 + t * q * p, or the sum of 1 + (t0 * q * p0) + .... (tn * q * pn) to describe folks who are off-and-on contributors.


If nothing else, it's legitimate to filter non-participants before doing a regression analysis on the rest. It's technically true that you can't predict karma over time, but that's not really an interesting statement until you eliminate the large number of people who sign up, then never post or comment.

My instinct is that once you filter these people, you'll see a much stronger linear relationship between time and karma, since karma isn't normalized by the number of contributions, and number of contributions is probably a poisson process.


Removing all 1's and 2's improves the relationship somewhat (moreso with log(k)) but still not a whole lot.


Sounds like there's a vast gulf of people with little (but not zero) contribution, then. Can you plot # of contributions versus membership time?


don't have the contributions data, just the karma score.


> And it starts at 1.

That's just a constant offset, it matters more whether the next karma level is 2 or 10. When fitting trend lines a "linear fit" would normally satisfy y=mx+c, without limiting yourself to c=0. Note that the posted linear fit has c = -1... apparently everyone starts with -1 karma. Lies, damned lies and statistics I say!


Agreed. If I had data for the quality of contributions I could have done that.


Could you post "General Composition of the Dataset" with a logarithmic vertical scale please? Should help compensate for the outliers so we can see more detail at the bottom of the graph.


Done.


You may want to take into account differences in file timestamps because the data was collated over many days.


Ahh, I didn't notice that. That would affect things. The timestamps and counts are: ('04/15', 630), ('04/16', 1993), ('04/17', 1994), ('04/18', 491), ('04/20', 270), ('04/21', 1049), ('04/22', 59), ('04/24', 33), ('04/25', 342), ('04/26', 29), ('05/03', 165), ('05/06', 86), ('05/07', 23). So most of the members have older join dates than I figured.


The interesting thing about karma.. I find that I can't/I get tired of posting insightful comments day after day..and take breaks and lurk..edw519 I don't know how you do it. I probably won't make the "leaders" (but I don't think its important)


| (but I don't think its important)

I think that's the crux of it. Somebody could monitor all the various news sites and spend an hour a day here posting comments and stories and so forth, but I suspect most folks would rather spend that time doing something else.

That said, edw519 is a pretty cool guy.


I tend to post in bursts, so most of my karma comes over periods of a week or two when I'm posting several times a day. It's actually a bad sign, because it means I'm not working on my startup. ;-) Then there are periods where I'll post like twice a week and my RescueTime log'll show that I'm spending like 80-90% of my computer time coding. So yeah, it's a tradeoff, and ultimately the code is more important, but I find that I burn out if I spend too much coding.


Nice work. I love manipulating large data sets :)

Which program did you use to produce the plots?


JMP


Scatterplots are too dense!


Just curios, how does the karma point work?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: