Hacker News new | past | comments | ask | show | jobs | submit login
Hacking Hacker News (joelgrus.com)
132 points by stickhandle on Feb 23, 2014 | hide | past | favorite | 50 comments



This is totally cool and kudos to you. But be aware there are limits of personalized hacker news. I think Bill Maher put it best when ranting just last night about facebooks customized news feeds:

Newspapers may be old-fashioned, but here's what we're losing if you never see one. They are trying to tell you what's actually important, not just what's important to you. You may not read the whole paper, but you at least see headlines, making you aware that something's going on outside of your microtargeted world of fashion or music or Wiccans or zombies or whatever you're into.

Replace 'newspapers' with hacker news and you get the point.

https://www.youtube.com/watch?v=WohtmZDZCGM


total nonsense, the classifier is just pushed upstream, from what I want to see, to what some ad-motivated editor wants me to see.

newspapers reported stories to sell you something, be it papers, ads or someone's agenda, not because they believed we'd all be more rounded citizens.


But in the case of filtering Hacker News you're taking that pre-filtered list (filtered by geeks and startup folk and whathaveyou) and filtering it even more.

Whether that bubble is a subset of some other bubble, it's still a bubble.


Newspapers are trying to show you what they think is actually important. But of course omnibus papers will never really tell you what is important only what is current.


Never?


broken clock etc..


Oh jeez, who submitted this again? I learned my lesson a couple of years ago, everyone hates this. :)


Also, FYI, I don't even use this anymore, these days I just read the HN frontpage. :)


Its exactly the kind of thing I would build and abandon. So the maybe more interesting things is why you don't use it? I presume its a UI thing, or classifier is unreliable, or something?

I would love to hear why vanilla HN is better now.


I don't know that vanilla HN is better now. I abandoned it for two main reasons:

1. The Hacker News API I was using was very unreliable and would go down for days / weeks at a time, which made the whole pipeline unreliable.

2. I was consuming this as an RSS feed, but when Google Reader shut down I abandoned my RSS habit cold turkey, so now I pretty much only read sites that I visit directly, or things people link to on FB / Twitter.


Why did you use the API instead of consuming the HN RSS feed itself? They even offer a big feed for such usage: http://ycombinator.com/newsnews.html


Ha, mostly because I didn't know about it. The big feed is not all that discoverable.


> I would love to hear why vanilla HN is better now.

Probably because it has more politics and less 'hacker news' these days.


I found the previous discussion: https://news.ycombinator.com/item?id=3602407 (498 points, 737 days ago, 82 comments)

(Note: I usually prefer the "unfiltered" version, unless there is a very special new that covers the 90% of the front page.)


I implemented the same thing for reddit a while back, and learned the same lesson :)

However, I haven't completely given up on the idea. The point is that while limiting news to things you've liked in the past is a terrible idea, helping you find good articles isn't.

The question is, what constitutes a good news item? I think the key is that there are many answers to this question:

1. Something very specific to your interests (your classifier should be good at detecting this)

2. Something big that everyone should hear about (the HN frontpage is good for this)

3. Something about a new concept you've never heard about. I've implemented this by keeping track of popular words on reddit and giving a boost to words not seen in the past, with some success.

4. Probably others I haven't thought about.

I'd love to hear your thoughts on this.


Sorry. I stumbled on it and thought it was interesting.


"The model can only get better with more training data, which requires me to judge whether I like stories or not. I do this occasionally when there’s nothing interesting on Facebook. Right now this is just the above command-line tool, but maybe I’ll come up with something better in the future."

If you let your program log into HN using your account, it should be able to tell which of the stories you've up-voted there. If you use that as the input to your classifier, as you read stories on HN, simply mark those that interest you by up-voting them.

I'm also curious to know whether the stories are weighted by age to account for changes in what you find interesting.


FYI, the new Hacker News API allows easy programmatic access of story/comments and infinite chronological paging. You could download every Hacker News story in less than 2 hours without breaking the API request limit.

https://hn.algolia.com/api


There's a list of all the apps that have been built based on this API: http://hn.algolia.com/cool_apps


Yup, I've switched over my Chrome extension [1] to use Algolia's API instead of HNSearch (which is shutting down), and so far, it seems to be working peachy.

[1]: https://chrome.google.com/webstore/detail/hacker-news-sideba...


the new Hacker News API allows easy programmatic access of story/comments and infinite chronological paging. You could download every Hacker News story in less than 2 hours without breaking the API request limit.

Which API? https://www.google.com/search?q=hacker+news+api


With hnsearch.com being shutdown, I believe he is referring to: http://hn.algolia.com/api


Using that API, you could download 48,000 hacker news stories in 2 days, so if there have been less than 48,000 submissions, then what minimaxir said is true. But first you'd need to generate a list of all 48,000 story ids, and there seems to be no way to actually do that.


No need to generate all IDs beforehand. The search_by_date endpoint is fine. You have to paginate using the created_at_i parameter, not the page parameter.

Also, you can set hitsPerPage = 1000. ;)


When trying to access page 2 via that endpoint: http://hn.algolia.com/api/v1/search_by_date?tags=story&hitsP...

"you can only fetch the 1000 hits for this query, contact us to increase the limit"

It was a nice try, but it did seem too good to be true.


You can paginate using the created_at_i parameter (edited OP)

Just pass created_at_i<X, where X is the time stamp of the earliest submission.

I was able to download 500k stories (i.e. about half of HN's 1.26M stories) before I ran into memory issues; I've fixed them and am downloading the rest.


That's incredible. Will you upload the raw database somewhere, please? If you make a torrent, I'll help seed it. Would you email me at sillysaurus3@gmail.com whenever it's ready?


I have written an API for HN:

- Python module: https://github.com/karan/HackerNewsAPI

- REST API: https://github.com/karan/HNify


Was this there when you posted?:

    RATE LIMITS
    We are limiting the number of API requests from a single IP to 1000 per hour.
    If you or your application has been blacklisted and you think there has been
    an error, please contact us.


Yes, that was there, and the math is still correct.

You can query 1,000 stories per request, and do 1,000 requests per hour. That's 1M requests per hour. There are 1.26M Hacker News stories indexed by the API. :)

EDIT: Finished downloading all the entries (and can confirm that 1.26M is indeed all of them). Took 3 hours due to a conservative wait period between each request to make sure I stayed within the limits.


How big is that dataset? Can you share it somehow?


"new Hacker News API" -> do you mean there's now an official API? I must have missed this news. Can you provide a link / additional info?


This is old so what you're suggesting wouldn't be possible unless he updated it (which I don't think he has any reason to do).


If this is a solution to the "not enough links that I personally like" - kudos to the author. Nice to find a fun project to work on that will also solve a problem for them.

I personally despise recommendation / personalization algorithms of any kind. I still have never found one that's actually better than myself at distinguishing articles that I'd like to read, music that I want to listen to, tweets I'd like to see, etc.

When reading HN, I'm constantly surprised by links that would not normally be on my radar for things I'm interested in. I think personalization algos, in general, are good at filtering those away.

Since the author mentioned HN being too much of a firehose and this then also being a solution to the "too many links to keep up to date on" problem, the solution might be a bit simpler than the author suggested.

HN already has the "best" links at https://news.ycombinator.com/best

It's hard to find - it's in the 'Lists' section in the footer, but it's still there and I use it all the time, when I haven't been actively reading HN for a while.


ideally you could train the data set from stories i've upvoted on HN

https://news.ycombinator.com/saved?id={{username}}



It seems you prefer to read articles from:

- WashingtonPost

- BusinessWeek

- MarginalRevolution

- NY Times

[1] Based on BuzzSumo's social data:

http://app.buzzsumo.com/#/influencers?q=@joelgrus&type=influ... (Press View Links Shared, Analyze Links Tab]



From the first paragraph: "people vote [links] up or down." Um... can't links only be voted up?


Past a certain karmic threshold, both are allowed.


Are you sure? I have over 60k karma and still can't downvote links.


Nope, only comments. The only actions you can take on a link are upvoting or flagging it.


It seems there is a small but very strong subculture of HackerNews readers who enjoy reading and discussing mathematical things. I would love to have a separate feed of those stories (and then after I'm done I could browse the HN front page), and I have often thought about the possibility of writing a program to do that.

DataTau (the HN for data mining) seems to have failed, so I imagine a filter is the way to go rather than make a new website.


I think one of the challenges of hosting a "sub-HN" is that the hosting costs are hard to justify.

This raises the question: How does YC justify hosting costs? My completely-off-the-cuff-assumption-take-this-with-a-huge-grain-of-salt is that YC benefits by having a huge audience to make announcements to, like job postings at YC funded companies, various pg essays, or just investing in overall goodwill from the HN audience. Probably the most likely reason is to increase deal-flow to YCombinator itself, though.


Why does it need to have some justification beyond being a fun hobby? HN is hosted on a single server, and it probably uses less than 1TB/month, so it's not that expensive for someone with a Bay Area tech salary, let alone the whole YC.


I just use the 50 or 100 point minimum feed in my reader, and skip articles that don't look interesting based on how much time I want to spend. Sometimes I only read articles if they're a day old and the first comment makes them look interesting.


Cool story bro. You may want to check out digitalocean for hosting - their cheapest option is only $5 a month - about equivalent to the smaller $40/month aws option. It's very simple as well.


Ideas this the first time around. Cool, but still have the same problem I had then. Confirmation bias.


Interesting. This will go into today's top5HN Newsletter. Signup at top5hn.launchrock.co


Wait, this is 2 years ago?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: