Privacy issues aside, I'd love a co-occurring votes / collaborative filtering based approach to HN. The problem here is that a fixed threshold of votes is a coarse estimation of quality and relevance.
Imagine a system where each HN vote is weighted according to your similarity to that voter. That way a vote by people with whom I have very little in common would also be worth very little to me.
I'd love to view a HN where I don't see the highly-voted Gruber/Apple/Facebook posts but I still see the stuff about Clojure, Steve Blank, and patio11.
No, I'm sorry, but I can't use keyword filtering for what I'm describing. Let me explain:
What I'm talking about here is uncovering "latent" communities, if you will. As in, make a giant matrix with the users being the columns and the posts being the rows and then use the eigenvectors to make recommendations (see SVD: http://en.wikipedia.org/wiki/Singular_value_decomposition)
The benefit of this approach is that I no longer have to be conscious of the topics I am filtering in or out. Even keyword based filtering is, again, a coarse estimation of relevance. I may be very interested in clojure, but I'm certainly not interested in every article that contains 'clojure' in the title.
An SVD (or similar) approach would filter my interests loosely on the co-occurrence of votes. That is, a vote from someone with whom I have high overlap is worth more to me than a vote from someone with whom I have never voted the same direction on the same post.
I question whether SVD would yield good recommendations.
In any case, co-voting data is not scrape-able from the public HN site, so I think using keywords and urls is really the only realistic filtering option at this point.
You can use people's comments as a (loose) proxy for their interest in a post; people who comment on something are more likely to have upvoted it (or at least consider it worthwhile to talk about, even if they never really vote on things.) You could perhaps even use Sentiment analysis, and take negative (root-level) comments as downvotes (and prune any branch below a negative comment, because it's probably an argument.)
Speaking of this, does anyone know of a good RSS filter? By that I mean a service to which you give a link to an RSS feed and provide certain filters, and they will provide a link to a modified version of the feed that they host themselves.
Awesome, this is exactly what I wanted! By filtering out all the Apple/iPhone/iPad-related crap, I'll probably be able to cut the number of new items from tech blogs in half.
Everyone forgets about postrank.com (formerly aiderss.com) It is actually what I use to filter HN... but who knows, I may switch over to the HN50 or HN100 described above.
I don't think that achieves the same thing. First, it requires the user to manually filter out new stuff they aren't interested in, where as a machine learning approach will evolve as the content space evolves.
Think of it like email spam. You can setup manual filters to filter out email spam, but that is a constant and never ending stream of work for you. A simple bayesian filter like pg has described will require far less work and give far better results.
In this case, a machine learning approach is even better because it can bring up stories that a user will be very interested in even though the story would never make it to the current homepage.
I'd like a Best of Hacker News that, like Best of Reddit, links to exceptionally interesting comments. The comments on HN are usually far more interesting than the stories, and often interesting comments are attached to uninteresting stories.
Great solution and glad to see how you expanded it to other thresholds. I think this might be the best automated way to track the best of Hacker News (of course I think my http://www.hackernewsletter.com is another great way with more of an non-automated feel + other content.)
Very good project. I would love to see, in the rss feed, the real article (the HN text or the link the story points to). This would be great so that we don't have to get out of google reader to actually read the article.
I prefer http://www.daemonology.net/hn-daily/: that's the top 10 articles of the previous day. Reading them one day late is not too bad for me, and it helps procrastinate less: it has put an end on my "let's see if there's something new".
Still this is a very good alternative, which I'll probably end up trying :)
If you're not time bound to the absolute latest, also check out the new http://hackerbra.in - it will shade down the snapshots you've already seen in its history
However, is anyone else concerned that the driver seat of HN is essentially being handed over to new users as more advanced users switch to services like this or "HN daily" or even the "/best" page? It seems like the people that don't know anything exists beyond the front page will become the only ones left to do the work of curating content.
(just speculating, nothing against new users I haven't been here that long myself...)
As long as only a minority of users use this, great. Otherwise, if everyone's reading through things like this, WTF is voting the stories up in the first place?
Part of what makes HN good is that most people hit the site and help with the voting. The more people who move away to reading HN remotely or through feeds.. the worse the voting situation gets.
Agreed, but I don't think there's much to worry about as it stands. PG said newsyc currently gets 60,000 unique visitors per day, and the # of ppl following these twitter feeds is barely 1% of that.
Say we snap a cache frequently either centrally or locally. It would be so much easier to just diff the HN cache throughout each session with the last central cache or local cache to see what has changed significantly by highlighting or something.
I was thinking of adding that as a feature to http://hackerbra.in. Currently, news is every 2hrs, ask 5hrs, best 11hrs. The site tells you if you've already viewed the current snapshot, but actually highlighting the new items since the last snapshot would be an interesting feature. Also, when stored, having a tweet go out with the number of new items for that snapshot could be useful.
Imagine a system where each HN vote is weighted according to your similarity to that voter. That way a vote by people with whom I have very little in common would also be worth very little to me.
I'd love to view a HN where I don't see the highly-voted Gruber/Apple/Facebook posts but I still see the stuff about Clojure, Steve Blank, and patio11.