Hacking Hacker News

jcr · on Feb 17, 2012

With all due respect Joel, it seems you missed a good number of the inputs you could have used to train your classifier.

If I (jcr) want to know what classifies as "interesting to Joel" (joelthelion), I simply look at the "comments" and "submissions" links in your HN profile. It will show me the stuff you took the time to comment on, or took the time to submit to HN.

If I want to know what's "interesting" to me, the "saved stories" link is visible in my own profile even through it is not visible to others. In the "saved stories" is THE goldmine of every submission I've either submitted or up-voted.

https://news.ycombinator.com/saved?id=jcr

Depending on your personal bookmarking habits, your bookmarks file/db can be another useful input. I'm in the habit of bookmarking both the submitted article, and the HN discussion page (if it's good). Assuming I didn't book the HN discussion, I could easily find the HN discussion/submission for all of the sites that I've ever bookmarked with a search engine. Give google an URL and ask it for all of the sites linking to that URL, then parse for HN, and you've got the target.

The serious problem I see with your approach was already mentioned by ck2; you're creating a bubble and will miss out on all the fantastic stuff that is interesting to you, but you don't yet know that it's interesting to you.

One of the primary benefits of HN and similar sites is learning about the things that others find interesting. Those things may not interest me, but the fact that others find them interesting is, well, interesting.

Why to they consider it interesting?

Why do I consider it uninteresting?

Even if my personal opinions remain unchanged, these are important questions for me to keep asking myself, repeatedly.

peterb · on Feb 17, 2012

This is also why cabals are so destructive. They cause homogeneity on what would normally be a diverse set of topics. I wonder if you could train your classifier to reject topics that are similar or are being promoted by the same set of people?

thristian · on Feb 17, 2012

There's a lot of stuff on HN I like, and a lot of stuff I find boring or irrelevant. Unfortunately, the stuff that I really like tends to be novel and unpredictable, so trying to teach some kind of Bayesian classifier to recognise things I'll like is probably not going to work.

Personally, I think I'd be perfectly happy with an old-school killfile: do not show me posts whose headlines contain the strings "X", "Y" or "Z", or that link to sites "A", "B" or "C".

gnosis · on Feb 17, 2012

Something else that would help enormously is simply allowing stories and comments to be tagged.

kruhft · on Feb 17, 2012

I have a hacked up version of HN that I've added tags and a decent search to that I am using as a personal journal. I haven't removed account creation or anything from the regular HN, so you can still post new stories, vote and comment with an account: http://kruhft.dyndns.org/discuss

patrickk · on Feb 17, 2012

Good idea.

Or maybe even going down the "subreddit" route, i.e. a separate section for Apple content, Python hacking, SOPA/PIPA etc. In other words, create a "bucket" for the general high-level, recurring HN topics and allow users to pick which ones they want to see and those they don't.

robrenaud · on Feb 17, 2012

What if you set your initial goals small? The goal is not to highlight only the good stuff, but to get rid of the obviously bad things for you?

Make two classifiers, one that tries to decide "is it interesting for thristian". And make another that decides "is this similar to an existing story". Then it gets demoted only if both it is not interesting and it is similar to something that already exists.

laumars · on Feb 17, 2012

> do not show me posts whose headlines contain the strings "X", "Y" or "Z"

The problem with that is these days there's so many "clickbait" headlines, some of which offer absolutely no insight into the content itself.

Without policing each and every submission to ensure the headlines accurately define the linked article - it would be difficult to guarantee any kind of success with basic headline filters.

algad · on Feb 17, 2012

I use Yahoo Pipes to do that kind of stuff. Quick and easy for filtering RSS feeds.

raganwald · on Feb 17, 2012

Old timers (‘scuse me while I slam a geritol and bourbon) will remember that when reddit first launched, there was a recommendation engine that purportedly took your votes and turned them into a personal page of stories you would enjoy.

It was scrapped and eventually subreddits were introduced. I think people like the idea of communities.

raganwald · on Feb 17, 2012

That being said... The fact that something failed one, two, or a hundred times in the past doesn’t mean it won’t work today. Things may be different today, or perhaps the approach may be slightly different. If Google can make bazillions of dollars using machine learning to optimize the ads displayed on a page, I’m pretty confident machine learning can be used to optimize the likelihood that you’ll upvote articles suggested by a bot.

The question of whether this becomes an echo chamber and you stop finding things that are interesting but outside of your current tastes is deep. There’s an old saying, “The best present is something you didn’t know you wanted until you unwrapped it."

surement · on Feb 17, 2012

Reading HN will also not expose you to non-hacker-related information. At the same time, it's always possible to make the stories you're more likely to like more likely to be chosen, and introduce some randomness to be exposed to the odd lower-rated story.

EwanG · on Feb 18, 2012

Is that really an old saying? I thought that was a Steve Jobs' description of the iPad?

raganwald · on Feb 18, 2012

It goes way back before SJ.

jfarmer · on Feb 17, 2012

More than that, subreddits are why Reddit didn't turn into digg.

My theory is that as communities grow, you have one of two choices to maintain cohesion: aggressively police who is let in or give people with divergent interests space to express those interests.

If you don't, and your community is growing, eventually the most valuable members will no longer find value in the community. This is just reversion to the mean. But the consequence is that those valuable people will then leave, and the cycle repeats until the community evaporates entirely.

Subreddits are a "pressure valve" that allow high-value users to self-select and maintain their own set of norms without sacrificing the growth of the main public square.

It's not the stories and comments that are fundamentally valuable to HN, it's the people.

underwater · on Feb 17, 2012

I remember there being a lot of discussion before subreddits where most people (myself included) wanted tagging instead of subdomains. The subreddits have obviously worked well but I think they (along with self posts and image previews) also changed reddit from a news aggregator into a glorified forum.

jfarmer · on Feb 17, 2012

It's about 1000x more interesting this way, personally. I want communities and personalities, not efficient news delivery and robots.

engblaze · on Feb 17, 2012

The community is very important. But there's something to be said for the ability to filter properly - there was a blog post on HN recently discussing how aggregate sites end up with the lowest common denominator. What ends up on the first page is only a small % of the submissions that are actually interesting. I probably miss half of the stories that I would like just because they fall off before I even see them.

The question is, how do you figure out what's interesting to who? Joel is attempting to answer this for himself. A site that could answer it for everyone while still somehow maintaining social bonds would do very well I think.

markkat · on Feb 18, 2012

I have been working at it with http://hubski.com. In short, you choose people to filter content for you based on what they post, and what they share. You in turn choose what to pass on to those who follow you. It's been working pretty well.

I'm still working on recommendations. There are suggestions once you follow someone. It's a good way to get started, but the best way to get good content is to hand pick folk based on what they have shared.

thekevan · on Feb 17, 2012

"...when reddit first launched..."

I remember about 5 years ago when I was looking for something else like reddit only more technical, saw HN and thought, "wow, I don't have a clue what 90% of these people are talking about. Seems interesting."

mukyu · on Feb 18, 2012

The way I remember it wasn't tossed because people did not like it, but because they did not have the computing resources to do it anymore.

ck2 · on Feb 17, 2012

By filtering out stuff, you'll never expose yourself to things outside your "pattern".

HN's subtitle is "Links for the intellectually curious"

I guess HN has a dual purpose to keep people up on "breaking" hacker news, but I like to think of it as "hacker news outside your thinking pattern".

Also why do people immediately go to AWS for testing something? Doesn't a real hacker have their own server handy for experimental projects, or is it only me?

thristian · on Feb 17, 2012

The thing is, HN doesn't just contain links for the intellectually curious. It often has stories about a new school of thought, or an approach to design, or a new programming language or something. However, it also often has stories about Internet Drama in the tech startup community, or American political events, or practical advice for people entering the business world for the first time. They're popular here because they're relevant to a large proportion of the HN user-base, but it's certainly possible for someone to be both intellectually curious and uninterested in American politics.

ericd · on Feb 17, 2012

I would argue that American politics as they affect the internet and copyright impacts pretty much everyone, unfortunately.

jarek · on Feb 17, 2012

Which doesn't necessarily mean everyone would be interested in reading about the newest developments weekly.

finnw · on Feb 17, 2012

> It often has stories about [...] a new programming language [...]

That's something I think could be automated (at least for me.)

I can (and do) guess whether I am likely to be interested in a new programming language, based on what other languages the commenters compare it to.

In my case it would probably be something like:

* Compared to Haskell, Io or SETL => very interesting

* Compared to Java, JS or Scheme => neutral

* Compared to Groovy, PHP and VB => uninteresting

mmackh · on Feb 17, 2012

I like the approach of going outside of your "comfort zone" and start discovering new things. Limiting my exposure to what I enjoy most is not an intellectually curious strategy.

tripzilch · on Feb 17, 2012

> By filtering out stuff, you'll never expose yourself to things outside your "pattern".

What makes you think that a Naive Bayes classifier will automatically put all unseen/unexpected/surprising data in just one of two categories?

Indeed, to the classifier it's just two categories, it doesn't "know" what's "in" or "out".

In fact it is much more likely that the classifier will sort of distribute the data that is unexpected (read: doesn't contain many features that are trained on) rather evenly among the two categories based on other features. Which is exactly what you would want it to do.

The author also said he plotted the precision-recall curves? (too bad he just showed a screenshot with the numbers instead of the graphs) That sort of analysis is bound to bring out such behaviour.

lemming · on Feb 17, 2012

I wonder if a filter that excluded things that I've trained it that I don't like would be more effective?

joelgrus · on Feb 17, 2012

Hey everyone, thanks for the comments. There's too much to respond to everyone, but a lot of people have brought up an "echo chamber" concern.

As other people have pointed out, the naive Bayes model works topically, so it will learn that I like stories about "patents" but not (usually) whether the stories are pro or anti-patent. It is totally true that I might miss an interesting story about the new OSX or about Pinterest, but I'm willing to live with that.

Two larger points are that

1. HN is only a small fraction of the news I consume, so it wouldn't matter that much to me even if it were a bubble chamber, and 2. The main reason I did this is that I simply couldn't keep up with the volume of stories otherwise.

Last night when I spoke about this, someone asked me whether I was concerned about all the false negatives I was missing. But before I started this, my RSS feed had like 800 (and growing) unread HN articles in it. Reading some of them, even a targeted some, is better than none.

Anyway, thanks for all the comments. I'm suprised (and glad) that people are so interested in this!

m0th87 · on Feb 17, 2012

I applied naive bayes to generic news for a project a few years ago. Counter to some of the comments here, I think it works surprisingly well in filtering articles, and is a great way to start.

One of the nicest aspects of it is that it doesn't support a user's confirmation bias: your perspective isn't taken into account in filtering since it's just looking at keywords. That's probably not as important here, but especially on political news it's highly relevant. If I'm a Democrat, I don't want just left-leaning news to come through the grapevine, because that prevents me from seeing the other perspective.

Tichy · on Feb 17, 2012

I don't understand what you mean. Don't you train the classifier on some input set? There's your user's confirmation bias.

tripzilch · on Feb 17, 2012

I think what he means is that if you take the entire article as a bag-of-words model, the general topic of the article will be a stronger signal than the position the article takes with respect to the subject.

The latter is a lot harder to extract from a bag-of-words. In fact the only thing I can imagine is if an article uses a lot of euphemisms or negative synonyms of a topic.

So you get the factual reporting articles regardless of left/right bias. With the more polemic hyperbole using articles, it will filter a bit more of the angle you disagree with, which adds some bias, but it also good for your blood pressure and a polemic article you disagree with is not going to make your view more balanced either.

joelthelion · on Feb 17, 2012

I've been working on this kind of stuff on an off for a while now. I still think it's a great idea, but recommendations are a tricky business and naive bayes is not good enough.

DennisP · on Feb 18, 2012

What is good enough?

switz · on Feb 17, 2012

Wow this is great! I was actually thinking about something like this the other day. Any chance on releasing the source?

Edit: Found it - https://github.com/joelgrus/hackernews

someone13 · on Feb 17, 2012

It's kinda hard to see, but there's a link at the bottom of the article:

https://github.com/joelgrus/hackernews

Sukotto · on Feb 17, 2012

Nice article and an interesting approach. thanks for writing it up.

  The model can only get better with more training data,
  which requires me to judge whether I like stories or not.
  I do this occasionally [using] the above command-line tool,
  but maybe I’ll come up with something better in the future.

Well, you could analyse your server logs to see which stories you really did click on and which you skipped over (You'll probably want to only consider pages that have at least one click for the edge case of "I didn't even look at that page". Need a cookie or login to make sure it only counts your clicks)

Also consider scraping your HN "saved stories" list as a positive source.

Don't recall if you mentioned it in your article, but you'll probably want to randomly insert the occasional low-scoring article as a check for under-weighting.

I really wish our profile pages supplied a (private) log of up/down votes for comments and flags for stories in addition to the the up votes for stories. It would make for some interesting datamining

sounds · on Feb 17, 2012

Why is HN opposed to scraping? I think since the front page is dynamically generated -- a bot would waste resources. The site gets loaded (slow to respond) during peak times.

Beyond that, only pg could say...

mmackh · on Feb 17, 2012

Why would HN refuse to provide a useful RSS feed / a non standard compliant one (number of comments/points/time of submission/id) that could be used to develop apps for a better reading experience? http://api.ihackernews.com/ is unreliable at best.

I'm working on something right now and this limitation causes difficulties. (http://cl.ly/241x3F3d123P0w0L1s3q)

bootload · on Feb 17, 2012

"... Why is HN opposed to scraping? ..."

because code written by N hackers scraping HN has a greater effect on the site, compared to using the reliable HNSearch API ~ http://www.hnsearch.com/api

  Ask HN: Does HN have an API and if not what's the etiquette for scraping?
  http://news.ycombinator.com/item?id=2138730

  Ask HN: Is there an API for HN?
  http://news.ycombinator.com/item?id=1107874

mmackh · on Feb 17, 2012

Judging from my own experience this API is not reliable.

andres · on Feb 17, 2012

If you have problems with the API please let me know (andres@octopart.com).

3pt14159 · on Feb 17, 2012

Mine as well.

zalew · on Feb 17, 2012

I do HN scrapping of my 'saved stories' for my personal site http://zalew.net/hn/ so I assume pg doesn't mind that as it scraps only once per hour. here's the description of it and source http://zalew.net/2011/12/08/grab-your-hackernews-stories-and...

mmackh · on Feb 17, 2012

Link to the relevant Blog: http://joelgrus-hackernews.blogspot.com/

--

I had a go at scraping HN myself a couple of days ago (in PHP) and here's the output:

http://thequeue.org/api/frontpage.xml

http://thequeue.org/api/new.xml

http://thequeue.org/api/best.xml

The trouble is that sponsored posts, i.e. We're Hiring (YC 11) break the code. If anyone can help, there's a question on Stackoverflow about it:

http://stackoverflow.com/questions/9301215/scraping-hn-front...

underwater · on Feb 17, 2012

This article only scores 0.039 on his blog. Whoops.

siculars · on Feb 17, 2012

Great move with the bayes classifier. I'm more interested in how stories are received on other networks so I made http://hnfluence.com for my HN consumption. Ultimately looking to see how one network effects another.

aggarwalachal · on Feb 17, 2012

this is actually quite nice. those nifty little additions to original layout look good.

xpose2000 · on Feb 17, 2012

It's 2AM so forgive my skimming of the article, but that is a heck of a lot of work and looks to be quite good. Will take another looksee tomorrow when I can think-straight.

Hopefully when I get back to hackernews sometime tomorrow this will be frontpage where it belongs. :)

drivebyacct2 · on Feb 17, 2012

As a recommendation, clicking the up arrow on a story will put it on this list: http://news.ycombinator.com/saved?id=xpose2000. That way you can view it without bookmarking. (As for hitting the frontpage, it did...)

silentscope · on Feb 17, 2012

Neat idea.

Nevertheless, I believe that a good news source, like a good community, sometimes gives you things that you might not like--things that challenge the filters we already have.

It's what flipping though a regular newspaper does, what hacker news does, what listening to a good broadcast does. The key is if you know the content is going to be so good that you're willing to take that risk despite your hesitation.

Hacker News is a bit like a firehose, but that's why I read it. The people on this site are informed, opinionated and pretty damn smart. It reminds every time I get on how little I actually know.

To be honest, a small part of me feel uncomfortable with that but that's why I read it--it's news I need to know.

barmstrong · on Feb 17, 2012

Interesting post. This isnt quite the same but wanted to mention i created a project, http://ribbot.com, to let people create their own Hacker News style site on other topics.

wallawe · on Feb 17, 2012

THIS is what I've been looking for. It deserves a post of its own. Is the only way to make money by paying the monthly fee and using one's own ads? I am about to begin playing with it..

barmstrong · on Feb 17, 2012

Awesome, thanks for checking it out. Yes, the only way is ad supported right now if you want to go that route. What else were you thinking - monthly memberships for private forums? It seems odd but I haven't thought about it much.

I'm considering making the project open source as well so technical folks can run their own versions and modify it. Sort of a wordpress model where it's open source but they still make money off the hosted version which is simpler for non-technical people to use. Thoughts?

jakejake · on Feb 17, 2012

This is cool but please HN do not ever start implementing this. Digg was once exactly like hacker news until they screwed it up by making the homepage different for every user, thus killing the feeling of a shared experience for us users.

Even though it's not always fair, even though you have to wade through stories, the fact that there is a common home page is what spurs the discussion and keeps things interesting.

This community is exactly how I remember digg in its heyday. Mostly tech stories and one common home page that has some bit of prestige when your story reached it. That's the magic sauce.

dangrossman · on Feb 17, 2012

Digg was supplanted by reddit, a site where the homepage is different for every user.

jakejake · on Feb 18, 2012

I didn't know that. I'm not a hardcore redit user though. I still feel like HN is more like the "good old days" of digg when it was mostly tech. Reddit is cool but it has a more mass appeal with its topics.

redslazer · on Feb 17, 2012

By user choice though and not some random filter.

jplehmann · on Feb 17, 2012

Looks like you wouldn't find your blog post very interesting. Only scored 0.039 on 2/16 at 11PM. http://joelgrus-hackernews.blogspot.com/

lt · on Feb 17, 2012

I think a good feature to measure would be the number of duplicates a story has multiplied by the number of days between submissions (normalized somehow).

I have this theory that "atemporal" stories (technical analysis, insightful essays, etc) that keep getting resubmitted every year are more interesting than news about the latest gadget. I've written about it before: http://news.ycombinator.com/item?id=2505081

dave_sullivan · on Feb 17, 2012

That's really neat, had taken a couple stabs at this myself but not gotten all that far.

I think there's probably some really interesting data in the comments (maybe just take top ten comments, 3 layers deep?) And wrt following links, one idea that occurred to me was taking a screenshot of the page linked to and basing part of your model on that... I suspect there may be graphical, layout similarities in some of the pages people like.

But kudos for doing it, very cool!

joverholt · on Feb 17, 2012

Check out [Programming Collective Intelligence](http://www.amazon.com/Programming-Collective-Intelligence-Bu...). I found it to be a good introduction to machine learning because it uses practical (and neat) examples to teach the concepts. One of my top 5 programming books.

akg · on Feb 17, 2012

Just curious if you tried using LSM (Latent Semantic Matching) on the titles and/or their contents to determine what you like and don't like. It might take some time to train the data to "your liking" but might return some decent results. I know Apple has an LSM framework in Lion that could be of some use. Just wondering if you gave that approach a thought?

guynamedloren · on Feb 17, 2012

Another huge advantage of this tool is that by only seeing whats interesting to you, you won't find yourself wading through every single story on the front page of HN during downtime.

Looks like a great start to an awesome project. I think the next logical step would be to expand to give anybody a filtered HN experience, but you probably didn't need me to tell you that :)

gnufs · on Feb 18, 2012

To filter the overwhelming amount of readable content on HN, I personally use the http://news.ycombinator.com/over?points=120 function. In fact, since the amount of voters have increased a lot, I think I'm going to increase the point cap to 200.

hello_moto · on Feb 17, 2012

Got a question to you all Ruby developers:

I came from Java background where Maven reigns supreme when it comes to build + dependency + convention on file structure and I like this set-up.

What is the equivalent to that in Ruby, I know there's Bundler and Rake, but Rake feels like Ant where you'd have to do a few things yourself.

mickey7 · on Feb 17, 2012

for a simple hack to keep exposure to novelty & variety i suggest just using negative filters to downvote the disliked content

thus the more novel the content - the higher up it will remain

unless it's an article about a novel completely revolutionary arrangement of old ideas which you never liked on their own :)

sravfeyn · on Feb 17, 2012

If you train the predictor with URLs that user has browsed(from browser history) even if they are not originated from HN, would improve the predicting model. Because, user's browser history would reflect his interests.

yread · on Feb 17, 2012

Yes, but not necessarily what he likes. I often click on a story in HN only to find it completely useless and content-free

dfc · on Feb 18, 2012

How easy would it be to invert this filter?

My biggest complaint with the number of stories on HN is not finding the great articles but wading through a bunch of stories about the same three or four topics I am not interested in...

donniezazen · on Feb 19, 2012

I read all posts that reach 100 karma points. http://talkfast.org/2010/07/23/a-cure-for-hacker-news-overlo...

vseorlov · on Feb 18, 2012

Nice thing, I think people need such a thing, because news number is great, but interesting and relevant news number is not that great. Information filters are the future.

xedarius · on Feb 17, 2012

Nice work, however this feature is a little like putting blinkers on a horse in case it sees something frightening. HN's greatest asset is the variation of the stories.

hamoid · on Feb 17, 2012

Great work. I've been wishing posts had tags so I could subscribe to some topics and avoid others.

Apart from the technical side, I find the light blue text hard to read.

chaz81 · on Feb 17, 2012

This guy just gave a talk about this at a Seattle Meetup a few hours ago, pretty cool to see how fast it shot up to the top.

DumbledoreSnipe · on Feb 17, 2012

Is there any technical difficulty in implementing an overbar on the content to vote the content up or down?

Just like you can do on Reddit.

jgmmo · on Feb 17, 2012

This is totally awesome. Very happy to come across this. Alot of technology in there I can learn from.

sravfeyn · on Feb 17, 2012

Wow. Something I am looking ahead to try to implement after my AI course.

iamgilesbowkett · on Feb 17, 2012

This is a very rewarding thing to do. I recommend it to everyone. You don't necessarily even need to use AI.

I filter HN with http://hacker-newspaper.gilesb.com/, which pulls RSS, filters it, and reformats it on an hourly cron job. I mainly did it for the typography -- I disagree with just about every visual design decision on Hacker News -- but added very primitive filtering after the fact. I throw out any story from TechCrunch, Zed Shaw, Steve Yegge, and Jeff Atwood, because I just got tired of them, and any story with "YC" in it, too, because I got tired of seeing job ads for Y Combinator startups. (In fact it was the job ad for a Curebits marketing manager right after their scandal that did it.)

When Apple launched the iPad, I went in and added a simple regex to filter out any story about it. Hacker News is a great source for skimming but occasionally gets fixated on topics. I get like a hundred uniques a day so it's not exactly a huge hit, but I've thought about making a commercial version with customization. It got featured on Mashable and somebody created an iPad app which looked very, VERY similar, which I'm going to take as validating my design. But whether or not I ever startupify it, anyone who wants a customized version can just fork the project, deploy their own version in like ten or twenty minutes, and tweak regexes to their heart's content. It's on GitHub (https://github.com/gilesbowkett/hacker_newspaper) and only requires the most basic proficiency with cron, ruby, and python.

I also want to add comment-scanning. Right now I don't use comment links at all. The code extracts them but then simply throws them away. I don't want to add comment links back in unless I can also set it up to alert me if the comments thread contains comments from raganwald, patio11, jashkenas, amyhoy, etc -- basically automated comment elitism. I'm not trying to be a dick with that, I'm just a busy dude.

Anyway, when I set out to do this, I planned on doing a bunch of Bayesian whatnot, but I found that I got most of the way there just tweaking regexes occasionally. Likewise there's a lot of rough edges I could clean up, e.g., text encoding is a bit of a mess, and summarization in the style of http://tldr.it/ would make it way more useful.

But I recommend it because making deliberate decisions about what info you want to get from HN makes it a lot less like watching TV and a lot more like doing actual research into topics which interest you. It's surprising how much more enjoyable HN becomes when viewed through a customized filter.

mmackh · on Feb 17, 2012

A word of advice about using the comment links in the hn's rss feed, from personal experience: try not to scrape them in an interval. I mistakably ran a piece of code that went through 7~10 different posts in a short period of time (30 secs) and my server was banned immediately.

iamgilesbowkett · on Feb 17, 2012

heh, thnx. my comment-reading experience is also boosted with some http://defunkt.io/dotjs hacks, btw:

https://github.com/gilesbowkett/dotjsfiles/blob/master/news....