With all due respect Joel, it seems you missed a good number of the
inputs you could have used to train your classifier.
If I (jcr) want to know what classifies as "interesting to Joel"
(joelthelion), I simply look at the "comments" and "submissions" links
in your HN profile. It will show me the stuff you took the time to
comment on, or took the time to submit to HN.
If I want to know what's "interesting" to me, the "saved stories" link
is visible in my own profile even through it is not visible to others.
In the "saved stories" is THE goldmine of every submission I've either
submitted or up-voted.
Depending on your personal bookmarking habits, your bookmarks file/db
can be another useful input. I'm in the habit of bookmarking both the
submitted article, and the HN discussion page (if it's good). Assuming I
didn't book the HN discussion, I could easily find the HN
discussion/submission for all of the sites that I've ever bookmarked
with a search engine. Give google an URL and ask it for all of the sites
linking to that URL, then parse for HN, and you've got the target.
The serious problem I see with your approach was already mentioned by
ck2; you're creating a bubble and will miss out on all the fantastic
stuff that is interesting to you, but you don't yet know that it's
interesting to you.
One of the primary benefits of HN and similar sites is learning about
the things that others find interesting. Those things may not interest
me, but the fact that others find them interesting is, well,
interesting.
Why to they consider it interesting?
Why do I consider it uninteresting?
Even if my personal opinions remain unchanged, these are important
questions for me to keep asking myself, repeatedly.
This is also why cabals are so destructive. They cause homogeneity on what would normally be a diverse set of topics. I wonder if you could train your classifier to reject topics that are similar or are being promoted by the same set of people?
There's a lot of stuff on HN I like, and a lot of stuff I find boring or irrelevant. Unfortunately, the stuff that I really like tends to be novel and unpredictable, so trying to teach some kind of Bayesian classifier to recognise things I'll like is probably not going to work.
Personally, I think I'd be perfectly happy with an old-school killfile: do not show me posts whose headlines contain the strings "X", "Y" or "Z", or that link to sites "A", "B" or "C".
I have a hacked up version of HN that I've added tags and a decent search to that I am using as a personal journal. I haven't removed account creation or anything from the regular HN, so you can still post new stories, vote and comment with an account: http://kruhft.dyndns.org/discuss
Or maybe even going down the "subreddit" route, i.e. a separate section for Apple content, Python hacking, SOPA/PIPA etc. In other words, create a "bucket" for the general high-level, recurring HN topics and allow users to pick which ones they want to see and those they don't.
What if you set your initial goals small? The goal is not to highlight only the good stuff, but to get rid of the obviously bad things for you?
Make two classifiers, one that tries to decide "is it interesting for thristian". And make another that decides "is this similar to an existing story". Then it gets demoted only if both it is not interesting and it is similar to something that already exists.
> do not show me posts whose headlines contain the strings "X", "Y" or "Z"
The problem with that is these days there's so many "clickbait" headlines, some of which offer absolutely no insight into the content itself.
Without policing each and every submission to ensure the headlines accurately define the linked article - it would be difficult to guarantee any kind of success with basic headline filters.
Old timers (‘scuse me while I slam a geritol and bourbon) will remember that when reddit first launched, there was a recommendation engine that purportedly took your votes and turned them into a personal page of stories you would enjoy.
It was scrapped and eventually subreddits were introduced. I think people like the idea of communities.
That being said... The fact that something failed one, two, or a hundred times in the past doesn’t mean it won’t work today. Things may be different today, or perhaps the approach may be slightly different. If Google can make bazillions of dollars using machine learning to optimize the ads displayed on a page, I’m pretty confident machine learning can be used to optimize the likelihood that you’ll upvote articles suggested by a bot.
The question of whether this becomes an echo chamber and you stop finding things that are interesting but outside of your current tastes is deep. There’s an old saying, “The best present is something you didn’t know you wanted until you unwrapped it."
Reading HN will also not expose you to non-hacker-related information. At the same time, it's always possible to make the stories you're more likely to like more likely to be chosen, and introduce some randomness to be exposed to the odd lower-rated story.
More than that, subreddits are why Reddit didn't turn into digg.
My theory is that as communities grow, you have one of two choices to maintain cohesion: aggressively police who is let in or give people with divergent interests space to express those interests.
If you don't, and your community is growing, eventually the most valuable members will no longer find value in the community. This is just reversion to the mean. But the consequence is that those valuable people will then leave, and the cycle repeats until the community evaporates entirely.
Subreddits are a "pressure valve" that allow high-value users to self-select and maintain their own set of norms without sacrificing the growth of the main public square.
It's not the stories and comments that are fundamentally valuable to HN, it's the people.
I remember there being a lot of discussion before subreddits where most people (myself included) wanted tagging instead of subdomains. The subreddits have obviously worked well but I think they (along with self posts and image previews) also changed reddit from a news aggregator into a glorified forum.
The community is very important. But there's something to be said for the ability to filter properly - there was a blog post on HN recently discussing how aggregate sites end up with the lowest common denominator. What ends up on the first page is only a small % of the submissions that are actually interesting. I probably miss half of the stories that I would like just because they fall off before I even see them.
The question is, how do you figure out what's interesting to who? Joel is attempting to answer this for himself. A site that could answer it for everyone while still somehow maintaining social bonds would do very well I think.
I have been working at it with http://hubski.com. In short, you choose people to filter content for you based on what they post, and what they share. You in turn choose what to pass on to those who follow you. It's been working pretty well.
I'm still working on recommendations. There are suggestions once you follow someone. It's a good way to get started, but the best way to get good content is to hand pick folk based on what they have shared.
I remember about 5 years ago when I was looking for something else like reddit only more technical, saw HN and thought, "wow, I don't have a clue what 90% of these people are talking about. Seems interesting."
By filtering out stuff, you'll never expose yourself to things outside your "pattern".
HN's subtitle is "Links for the intellectually curious"
I guess HN has a dual purpose to keep people up on "breaking" hacker news, but I like to think of it as "hacker news outside your thinking pattern".
Also why do people immediately go to AWS for testing something? Doesn't a real hacker have their own server handy for experimental projects, or is it only me?
The thing is, HN doesn't just contain links for the intellectually curious. It often has stories about a new school of thought, or an approach to design, or a new programming language or something. However, it also often has stories about Internet Drama in the tech startup community, or American political events, or practical advice for people entering the business world for the first time. They're popular here because they're relevant to a large proportion of the HN user-base, but it's certainly possible for someone to be both intellectually curious and uninterested in American politics.
I like the approach of going outside of your "comfort zone" and start discovering new things. Limiting my exposure to what I enjoy most is not an intellectually curious strategy.
> By filtering out stuff, you'll never expose yourself to things outside your "pattern".
What makes you think that a Naive Bayes classifier will automatically put all unseen/unexpected/surprising data in just one of two categories?
Indeed, to the classifier it's just two categories, it doesn't "know" what's "in" or "out".
In fact it is much more likely that the classifier will sort of distribute the data that is unexpected (read: doesn't contain many features that are trained on) rather evenly among the two categories based on other features. Which is exactly what you would want it to do.
The author also said he plotted the precision-recall curves? (too bad he just showed a screenshot with the numbers instead of the graphs) That sort of analysis is bound to bring out such behaviour.
Hey everyone, thanks for the comments. There's too much to respond to everyone, but a lot of people have brought up an "echo chamber" concern.
As other people have pointed out, the naive Bayes model works topically, so it will learn that I like stories about "patents" but not (usually) whether the stories are pro or anti-patent. It is totally true that I might miss an interesting story about the new OSX or about Pinterest, but I'm willing to live with that.
Two larger points are that
1. HN is only a small fraction of the news I consume, so it wouldn't matter that much to me even if it were a bubble chamber, and
2. The main reason I did this is that I simply couldn't keep up with the volume of stories otherwise.
Last night when I spoke about this, someone asked me whether I was concerned about all the false negatives I was missing. But before I started this, my RSS feed had like 800 (and growing) unread HN articles in it. Reading some of them, even a targeted some, is better than none.
Anyway, thanks for all the comments. I'm suprised (and glad) that people are so interested in this!
I applied naive bayes to generic news for a project a few years ago. Counter to some of the comments here, I think it works surprisingly well in filtering articles, and is a great way to start.
One of the nicest aspects of it is that it doesn't support a user's confirmation bias: your perspective isn't taken into account in filtering since it's just looking at keywords. That's probably not as important here, but especially on political news it's highly relevant. If I'm a Democrat, I don't want just left-leaning news to come through the grapevine, because that prevents me from seeing the other perspective.
I think what he means is that if you take the entire article as a bag-of-words model, the general topic of the article will be a stronger signal than the position the article takes with respect to the subject.
The latter is a lot harder to extract from a bag-of-words. In fact the only thing I can imagine is if an article uses a lot of euphemisms or negative synonyms of a topic.
So you get the factual reporting articles regardless of left/right bias. With the more polemic hyperbole using articles, it will filter a bit more of the angle you disagree with, which adds some bias, but it also good for your blood pressure and a polemic article you disagree with is not going to make your view more balanced either.
I've been working on this kind of stuff on an off for a while now. I still think it's a great idea, but recommendations are a tricky business and naive bayes is not good enough.
Nice article and an interesting approach. thanks for writing it up.
The model can only get better with more training data,
which requires me to judge whether I like stories or not.
I do this occasionally [using] the above command-line tool,
but maybe I’ll come up with something better in the future.
Well, you could analyse your server logs to see which stories you really did click on and which you skipped over (You'll probably want to only consider pages that have at least one click for the edge case of "I didn't even look at that page". Need a cookie or login to make sure it only counts your clicks)
Also consider scraping your HN "saved stories" list as a positive source.
Don't recall if you mentioned it in your article, but you'll probably want to randomly insert the occasional low-scoring article as a check for under-weighting.
I really wish our profile pages supplied a (private) log of up/down votes for comments and flags for stories in addition to the the up votes for stories. It would make for some interesting datamining
Why is HN opposed to scraping? I think since the front page is dynamically generated -- a bot would waste resources. The site gets loaded (slow to respond) during peak times.
Why would HN refuse to provide a useful RSS feed / a non standard compliant one (number of comments/points/time of submission/id) that could be used to develop apps for a better reading experience? http://api.ihackernews.com/ is unreliable at best.
because code written by N hackers scraping HN has a greater effect on the site, compared to using the reliable HNSearch API ~ http://www.hnsearch.com/api
Ask HN: Does HN have an API and if not what's the etiquette for scraping?
http://news.ycombinator.com/item?id=2138730
Ask HN: Is there an API for HN?
http://news.ycombinator.com/item?id=1107874
Great move with the bayes classifier. I'm more interested in how stories are received on other networks so I made http://hnfluence.com for my HN consumption. Ultimately looking to see how one network effects another.
It's 2AM so forgive my skimming of the article, but that is a heck of a lot of work and looks to be quite good. Will take another looksee tomorrow when I can think-straight.
Hopefully when I get back to hackernews sometime tomorrow this will be frontpage where it belongs. :)
As a recommendation, clicking the up arrow on a story will put it on this list: http://news.ycombinator.com/saved?id=xpose2000. That way you can view it without bookmarking. (As for hitting the frontpage, it did...)
Nevertheless, I believe that a good news source, like a good community, sometimes gives you things that you might not like--things that challenge the filters we already have.
It's what flipping though a regular newspaper does, what hacker news does, what listening to a good broadcast does. The key is if you know the content is going to be so good that you're willing to take that risk despite your hesitation.
Hacker News is a bit like a firehose, but that's why I read it. The people on this site are informed, opinionated and pretty damn smart. It reminds every time I get on how little I actually know.
To be honest, a small part of me feel uncomfortable with that but that's why I read it--it's news I need to know.
Interesting post. This isnt quite the same but wanted to mention i created a project, http://ribbot.com, to let people create their own Hacker News style site on other topics.
THIS is what I've been looking for. It deserves a post of its own. Is the only way to make money by paying the monthly fee and using one's own ads? I am about to begin playing with it..
Awesome, thanks for checking it out. Yes, the only way is ad supported right now if you want to go that route. What else were you thinking - monthly memberships for private forums? It seems odd but I haven't thought about it much.
I'm considering making the project open source as well so technical folks can run their own versions and modify it. Sort of a wordpress model where it's open source but they still make money off the hosted version which is simpler for non-technical people to use. Thoughts?
This is cool but please HN do not ever start implementing this. Digg was once exactly like hacker news until they screwed it up by making the homepage different for every user, thus killing the feeling of a shared experience for us users.
Even though it's not always fair, even though you have to wade through stories, the fact that there is a common home page is what spurs the discussion and keeps things interesting.
This community is exactly how I remember digg in its heyday. Mostly tech stories and one common home page that has some bit of prestige when your story reached it. That's the magic sauce.
I didn't know that. I'm not a hardcore redit user though. I still feel like HN is more like the "good old days" of digg when it was mostly tech. Reddit is cool but it has a more mass appeal with its topics.
I think a good feature to measure would be the number of duplicates a story has multiplied by the number of days between submissions (normalized somehow).
I have this theory that "atemporal" stories (technical analysis, insightful essays, etc) that keep getting resubmitted every year are more interesting than news about the latest gadget. I've written about it before: http://news.ycombinator.com/item?id=2505081
That's really neat, had taken a couple stabs at this myself but not gotten all that far.
I think there's probably some really interesting data in the comments (maybe just take top ten comments, 3 layers deep?) And wrt following links, one idea that occurred to me was taking a screenshot of the page linked to and basing part of your model on that... I suspect there may be graphical, layout similarities in some of the pages people like.
Check out [Programming Collective Intelligence](http://www.amazon.com/Programming-Collective-Intelligence-Bu...). I found it to be a good introduction to machine learning because it uses practical (and neat) examples to teach the concepts. One of my top 5 programming books.
Just curious if you tried using LSM (Latent Semantic Matching) on the titles and/or their contents to determine what you like and don't like. It might take some time to train the data to "your liking" but might return some decent results. I know Apple has an LSM framework in Lion that could be of some use. Just wondering if you gave that approach a thought?
Another huge advantage of this tool is that by only seeing whats interesting to you, you won't find yourself wading through every single story on the front page of HN during downtime.
Looks like a great start to an awesome project. I think the next logical step would be to expand to give anybody a filtered HN experience, but you probably didn't need me to tell you that :)
To filter the overwhelming amount of readable content on HN, I personally use the http://news.ycombinator.com/over?points=120 function. In fact, since the amount of voters have increased a lot, I think I'm going to increase the point cap to 200.
If you train the predictor with URLs that user has browsed(from browser history) even if they are not originated from HN, would improve the predicting model. Because, user's browser history would reflect his interests.
My biggest complaint with the number of stories on HN is not finding the great articles but wading through a bunch of stories about the same three or four topics I am not interested in...
Nice thing, I think people need such a thing, because news number is great, but interesting and relevant news number is not that great. Information filters are the future.
Nice work, however this feature is a little like putting blinkers on a horse in case it sees something frightening. HN's greatest asset is the variation of the stories.
This is a very rewarding thing to do. I recommend it to everyone. You don't necessarily even need to use AI.
I filter HN with http://hacker-newspaper.gilesb.com/, which pulls RSS, filters it, and reformats it on an hourly cron job. I mainly did it for the typography -- I disagree with just about every visual design decision on Hacker News -- but added very primitive filtering after the fact. I throw out any story from TechCrunch, Zed Shaw, Steve Yegge, and Jeff Atwood, because I just got tired of them, and any story with "YC" in it, too, because I got tired of seeing job ads for Y Combinator startups. (In fact it was the job ad for a Curebits marketing manager right after their scandal that did it.)
When Apple launched the iPad, I went in and added a simple regex to filter out any story about it. Hacker News is a great source for skimming but occasionally gets fixated on topics. I get like a hundred uniques a day so it's not exactly a huge hit, but I've thought about making a commercial version with customization. It got featured on Mashable and somebody created an iPad app which looked very, VERY similar, which I'm going to take as validating my design. But whether or not I ever startupify it, anyone who wants a customized version can just fork the project, deploy their own version in like ten or twenty minutes, and tweak regexes to their heart's content. It's on GitHub (https://github.com/gilesbowkett/hacker_newspaper) and only requires the most basic proficiency with cron, ruby, and python.
I also want to add comment-scanning. Right now I don't use comment links at all. The code extracts them but then simply throws them away. I don't want to add comment links back in unless I can also set it up to alert me if the comments thread contains comments from raganwald, patio11, jashkenas, amyhoy, etc -- basically automated comment elitism. I'm not trying to be a dick with that, I'm just a busy dude.
Anyway, when I set out to do this, I planned on doing a bunch of Bayesian whatnot, but I found that I got most of the way there just tweaking regexes occasionally. Likewise there's a lot of rough edges I could clean up, e.g., text encoding is a bit of a mess, and summarization in the style of http://tldr.it/ would make it way more useful.
But I recommend it because making deliberate decisions about what info you want to get from HN makes it a lot less like watching TV and a lot more like doing actual research into topics which interest you. It's surprising how much more enjoyable HN becomes when viewed through a customized filter.
A word of advice about using the comment links in the hn's rss feed, from personal experience: try not to scrape them in an interval. I mistakably ran a piece of code that went through 7~10 different posts in a short period of time (30 secs) and my server was banned immediately.
If I (jcr) want to know what classifies as "interesting to Joel" (joelthelion), I simply look at the "comments" and "submissions" links in your HN profile. It will show me the stuff you took the time to comment on, or took the time to submit to HN.
If I want to know what's "interesting" to me, the "saved stories" link is visible in my own profile even through it is not visible to others. In the "saved stories" is THE goldmine of every submission I've either submitted or up-voted.
https://news.ycombinator.com/saved?id=jcr
Depending on your personal bookmarking habits, your bookmarks file/db can be another useful input. I'm in the habit of bookmarking both the submitted article, and the HN discussion page (if it's good). Assuming I didn't book the HN discussion, I could easily find the HN discussion/submission for all of the sites that I've ever bookmarked with a search engine. Give google an URL and ask it for all of the sites linking to that URL, then parse for HN, and you've got the target.
The serious problem I see with your approach was already mentioned by ck2; you're creating a bubble and will miss out on all the fantastic stuff that is interesting to you, but you don't yet know that it's interesting to you.
One of the primary benefits of HN and similar sites is learning about the things that others find interesting. Those things may not interest me, but the fact that others find them interesting is, well, interesting.
Why to they consider it interesting?
Why do I consider it uninteresting?
Even if my personal opinions remain unchanged, these are important questions for me to keep asking myself, repeatedly.