Hacking Hacker News

rickdale · on Feb 23, 2014

This is totally cool and kudos to you. But be aware there are limits of personalized hacker news. I think Bill Maher put it best when ranting just last night about facebooks customized news feeds:

Newspapers may be old-fashioned, but here's what we're losing if you never see one. They are trying to tell you what's actually important, not just what's important to you. You may not read the whole paper, but you at least see headlines, making you aware that something's going on outside of your microtargeted world of fashion or music or Wiccans or zombies or whatever you're into.

Replace 'newspapers' with hacker news and you get the point.

https://www.youtube.com/watch?v=WohtmZDZCGM

hnriot · on Feb 23, 2014

total nonsense, the classifier is just pushed upstream, from what I want to see, to what some ad-motivated editor wants me to see.

newspapers reported stories to sell you something, be it papers, ads or someone's agenda, not because they believed we'd all be more rounded citizens.

ketralnis · on Feb 23, 2014

But in the case of filtering Hacker News you're taking that pre-filtered list (filtered by geeks and startup folk and whathaveyou) and filtering it even more.

Whether that bubble is a subset of some other bubble, it's still a bubble.

ThomPete · on Feb 23, 2014

Newspapers are trying to show you what they think is actually important. But of course omnibus papers will never really tell you what is important only what is current.

sosborn · on Feb 23, 2014

Never?

cam_l · on Feb 23, 2014

broken clock etc..

joelgrus · on Feb 23, 2014

Oh jeez, who submitted this again? I learned my lesson a couple of years ago, everyone hates this. :)

joelgrus · on Feb 23, 2014

Also, FYI, I don't even use this anymore, these days I just read the HN frontpage. :)

tlarkworthy · on Feb 23, 2014

Its exactly the kind of thing I would build and abandon. So the maybe more interesting things is why you don't use it? I presume its a UI thing, or classifier is unreliable, or something?

I would love to hear why vanilla HN is better now.

joelgrus · on Feb 23, 2014

I don't know that vanilla HN is better now. I abandoned it for two main reasons:

1. The Hacker News API I was using was very unreliable and would go down for days / weeks at a time, which made the whole pipeline unreliable.

2. I was consuming this as an RSS feed, but when Google Reader shut down I abandoned my RSS habit cold turkey, so now I pretty much only read sites that I visit directly, or things people link to on FB / Twitter.

icebraining · on Feb 23, 2014

Why did you use the API instead of consuming the HN RSS feed itself? They even offer a big feed for such usage: http://ycombinator.com/newsnews.html

joelgrus · on Feb 23, 2014

Ha, mostly because I didn't know about it. The big feed is not all that discoverable.

davidw · on Feb 23, 2014

> I would love to hear why vanilla HN is better now.

Probably because it has more politics and less 'hacker news' these days.

gus_massa · on Feb 23, 2014

I found the previous discussion: https://news.ycombinator.com/item?id=3602407 (498 points, 737 days ago, 82 comments)

(Note: I usually prefer the "unfiltered" version, unless there is a very special new that covers the 90% of the front page.)

joelthelion · on Feb 23, 2014

I implemented the same thing for reddit a while back, and learned the same lesson :)

However, I haven't completely given up on the idea. The point is that while limiting news to things you've liked in the past is a terrible idea, helping you find good articles isn't.

The question is, what constitutes a good news item? I think the key is that there are many answers to this question:

1. Something very specific to your interests (your classifier should be good at detecting this)

2. Something big that everyone should hear about (the HN frontpage is good for this)

3. Something about a new concept you've never heard about. I've implemented this by keeping track of popular words on reddit and giving a boost to words not seen in the past, with some success.

4. Probably others I haven't thought about.

I'd love to hear your thoughts on this.

stickhandle · on Feb 23, 2014

Sorry. I stumbled on it and thought it was interesting.

smoyer · on Feb 23, 2014

"The model can only get better with more training data, which requires me to judge whether I like stories or not. I do this occasionally when there’s nothing interesting on Facebook. Right now this is just the above command-line tool, but maybe I’ll come up with something better in the future."

If you let your program log into HN using your account, it should be able to tell which of the stories you've up-voted there. If you use that as the input to your classifier, as you read stories on HN, simply mark those that interest you by up-voting them.

I'm also curious to know whether the stories are weighted by age to account for changes in what you find interesting.

minimaxir · on Feb 23, 2014

FYI, the new Hacker News API allows easy programmatic access of story/comments and infinite chronological paging. You could download every Hacker News story in less than 2 hours without breaking the API request limit.

https://hn.algolia.com/api

e15ctr0n · on Feb 23, 2014

There's a list of all the apps that have been built based on this API: http://hn.algolia.com/cool_apps

pak · on Feb 23, 2014

Yup, I've switched over my Chrome extension [1] to use Algolia's API instead of HNSearch (which is shutting down), and so far, it seems to be working peachy.

[1]: https://chrome.google.com/webstore/detail/hacker-news-sideba...

sillysaurus3 · on Feb 23, 2014

the new Hacker News API allows easy programmatic access of story/comments and infinite chronological paging. You could download every Hacker News story in less than 2 hours without breaking the API request limit.

Which API? https://www.google.com/search?q=hacker+news+api

jared314 · on Feb 23, 2014

With hnsearch.com being shutdown, I believe he is referring to: http://hn.algolia.com/api

sillysaurus3 · on Feb 23, 2014

Using that API, you could download 48,000 hacker news stories in 2 days, so if there have been less than 48,000 submissions, then what minimaxir said is true. But first you'd need to generate a list of all 48,000 story ids, and there seems to be no way to actually do that.

minimaxir · on Feb 23, 2014

No need to generate all IDs beforehand. The search_by_date endpoint is fine. You have to paginate using the created_at_i parameter, not the page parameter.

Also, you can set hitsPerPage = 1000. ;)

sillysaurus3 · on Feb 23, 2014

When trying to access page 2 via that endpoint: http://hn.algolia.com/api/v1/search_by_date?tags=story&hitsP...

"you can only fetch the 1000 hits for this query, contact us to increase the limit"

It was a nice try, but it did seem too good to be true.

minimaxir · on Feb 23, 2014

You can paginate using the created_at_i parameter (edited OP)

Just pass created_at_i<X, where X is the time stamp of the earliest submission.

I was able to download 500k stories (i.e. about half of HN's 1.26M stories) before I ran into memory issues; I've fixed them and am downloading the rest.

sillysaurus3 · on Feb 23, 2014

That's incredible. Will you upload the raw database somewhere, please? If you make a torrent, I'll help seed it. Would you email me at sillysaurus3@gmail.com whenever it's ready?

_hoa8 · on Feb 23, 2014

I have written an API for HN:

- Python module: https://github.com/karan/HackerNewsAPI

- REST API: https://github.com/karan/HNify

ehsanu1 · on Feb 23, 2014

Was this there when you posted?:

    RATE LIMITS
    We are limiting the number of API requests from a single IP to 1000 per hour.
    If you or your application has been blacklisted and you think there has been
    an error, please contact us.

minimaxir · on Feb 23, 2014

Yes, that was there, and the math is still correct.

You can query 1,000 stories per request, and do 1,000 requests per hour. That's 1M requests per hour. There are 1.26M Hacker News stories indexed by the API. :)

EDIT: Finished downloading all the entries (and can confirm that 1.26M is indeed all of them). Took 3 hours due to a conservative wait period between each request to make sure I stayed within the limits.

tripzilch · on Feb 24, 2014

How big is that dataset? Can you share it somehow?

jrussino · on Feb 23, 2014

"new Hacker News API" -> do you mean there's now an official API? I must have missed this news. Can you provide a link / additional info?

deft · on Feb 23, 2014

This is old so what you're suggesting wouldn't be possible unless he updated it (which I don't think he has any reason to do).

arnorhs · on Feb 23, 2014

If this is a solution to the "not enough links that I personally like" - kudos to the author. Nice to find a fun project to work on that will also solve a problem for them.

I personally despise recommendation / personalization algorithms of any kind. I still have never found one that's actually better than myself at distinguishing articles that I'd like to read, music that I want to listen to, tweets I'd like to see, etc.

When reading HN, I'm constantly surprised by links that would not normally be on my radar for things I'm interested in. I think personalization algos, in general, are good at filtering those away.

Since the author mentioned HN being too much of a firehose and this then also being a solution to the "too many links to keep up to date on" problem, the solution might be a bit simpler than the author suggested.

HN already has the "best" links at https://news.ycombinator.com/best

It's hard to find - it's in the 'Lists' section in the footer, but it's still there and I use it all the time, when I haven't been actively reading HN for a while.

hayksaakian · on Feb 23, 2014

ideally you could train the data set from stories i've upvoted on HN

https://news.ycombinator.com/saved?id={{username}}

pflanze · on Feb 23, 2014

Previous discussion: https://news.ycombinator.com/item?id=3602407

AznHisoka · on Feb 23, 2014

It seems you prefer to read articles from:

- WashingtonPost

- BusinessWeek

- MarginalRevolution

- NY Times

[1] Based on BuzzSumo's social data:

http://app.buzzsumo.com/#/influencers?q=@joelgrus&type=influ... (Press View Links Shared, Analyze Links Tab]

himal · on Feb 23, 2014

Github link: https://github.com/joelgrus/hackernews

seizethecheese · on Feb 23, 2014

From the first paragraph: "people vote [links] up or down." Um... can't links only be voted up?

sethaurus · on Feb 23, 2014

Past a certain karmic threshold, both are allowed.

ColinWright · on Feb 23, 2014

Are you sure? I have over 60k karma and still can't downvote links.

benaiah · on Feb 23, 2014

Nope, only comments. The only actions you can take on a link are upvoting or flagging it.

j2kun · on Feb 23, 2014

It seems there is a small but very strong subculture of HackerNews readers who enjoy reading and discussing mathematical things. I would love to have a separate feed of those stories (and then after I'm done I could browse the HN front page), and I have often thought about the possibility of writing a program to do that.

DataTau (the HN for data mining) seems to have failed, so I imagine a filter is the way to go rather than make a new website.

jt2190 · on Feb 23, 2014

I think one of the challenges of hosting a "sub-HN" is that the hosting costs are hard to justify.

This raises the question: How does YC justify hosting costs? My completely-off-the-cuff-assumption-take-this-with-a-huge-grain-of-salt is that YC benefits by having a huge audience to make announcements to, like job postings at YC funded companies, various pg essays, or just investing in overall goodwill from the HN audience. Probably the most likely reason is to increase deal-flow to YCombinator itself, though.

icebraining · on Feb 23, 2014

Why does it need to have some justification beyond being a fun hobby? HN is hosted on a single server, and it probably uses less than 1TB/month, so it's not that expensive for someone with a Bay Area tech salary, let alone the whole YC.

kra · on Feb 25, 2014

I just use the 50 or 100 point minimum feed in my reader, and skip articles that don't look interesting based on how much time I want to spend. Sometimes I only read articles if they're a day old and the first comment makes them look interesting.

ninjakeyboard · on Feb 23, 2014

Cool story bro. You may want to check out digitalocean for hosting - their cheapest option is only $5 a month - about equivalent to the smaller $40/month aws option. It's very simple as well.

siculars · on Feb 23, 2014

Ideas this the first time around. Cool, but still have the same problem I had then. Confirmation bias.

ingend88 · on Feb 23, 2014

Interesting. This will go into today's top5HN Newsletter. Signup at top5hn.launchrock.co

piracyde25 · on Feb 23, 2014

Wait, this is 2 years ago?