This is totally cool and kudos to you. But be aware there are limits of personalized hacker news. I think Bill Maher put it best when ranting just last night about facebooks customized news feeds:
Newspapers may be old-fashioned, but here's what we're losing if you never see one. They are trying to tell you what's actually important, not just what's important to you. You may not read the whole paper, but you at least see headlines, making you aware that something's going on outside of your microtargeted world of fashion or music or Wiccans or zombies or whatever you're into.
Replace 'newspapers' with hacker news and you get the point.
But in the case of filtering Hacker News you're taking that pre-filtered list (filtered by geeks and startup folk and whathaveyou) and filtering it even more.
Whether that bubble is a subset of some other bubble, it's still a bubble.
Newspapers are trying to show you what they think is actually important. But of course omnibus papers will never really tell you what is important only what is current.
Its exactly the kind of thing I would build and abandon. So the maybe more interesting things is why you don't use it? I presume its a UI thing, or classifier is unreliable, or something?
I would love to hear why vanilla HN is better now.
I don't know that vanilla HN is better now. I abandoned it for two main reasons:
1. The Hacker News API I was using was very unreliable and would go down for days / weeks at a time, which made the whole pipeline unreliable.
2. I was consuming this as an RSS feed, but when Google Reader shut down I abandoned my RSS habit cold turkey, so now I pretty much only read sites that I visit directly, or things people link to on FB / Twitter.
I implemented the same thing for reddit a while back, and learned the same lesson :)
However, I haven't completely given up on the idea. The point is that while limiting news to things you've liked in the past is a terrible idea, helping you find good articles isn't.
The question is, what constitutes a good news item? I think the key is that there are many answers to this question:
1. Something very specific to your interests (your classifier should be good at detecting this)
2. Something big that everyone should hear about (the HN frontpage is good for this)
3. Something about a new concept you've never heard about. I've implemented this by keeping track of popular words on reddit and giving a boost to words not seen in the past, with some success.
"The model can only get better with more training data, which requires me to judge whether I like stories or not. I do this occasionally when there’s nothing interesting on Facebook. Right now this is just the above command-line tool, but maybe I’ll come up with something better in the future."
If you let your program log into HN using your account, it should be able to tell which of the stories you've up-voted there. If you use that as the input to your classifier, as you read stories on HN, simply mark those that interest you by up-voting them.
I'm also curious to know whether the stories are weighted by age to account for changes in what you find interesting.
FYI, the new Hacker News API allows easy programmatic access of story/comments and infinite chronological paging. You could download every Hacker News story in less than 2 hours without breaking the API request limit.
Yup, I've switched over my Chrome extension [1] to use Algolia's API instead of HNSearch (which is shutting down), and so far, it seems to be working peachy.
the new Hacker News API allows easy programmatic access of story/comments and infinite chronological paging. You could download every Hacker News story in less than 2 hours without breaking the API request limit.
Using that API, you could download 48,000 hacker news stories in 2 days, so if there have been less than 48,000 submissions, then what minimaxir said is true. But first you'd need to generate a list of all 48,000 story ids, and there seems to be no way to actually do that.
No need to generate all IDs beforehand. The search_by_date endpoint is fine. You have to paginate using the created_at_i parameter, not the page parameter.
You can paginate using the created_at_i parameter (edited OP)
Just pass created_at_i<X, where X is the time stamp of the earliest submission.
I was able to download 500k stories (i.e. about half of HN's 1.26M stories) before I ran into memory issues; I've fixed them and am downloading the rest.
That's incredible. Will you upload the raw database somewhere, please? If you make a torrent, I'll help seed it. Would you email me at sillysaurus3@gmail.com whenever it's ready?
RATE LIMITS
We are limiting the number of API requests from a single IP to 1000 per hour.
If you or your application has been blacklisted and you think there has been
an error, please contact us.
Yes, that was there, and the math is still correct.
You can query 1,000 stories per request, and do 1,000 requests per hour. That's 1M requests per hour. There are 1.26M Hacker News stories indexed by the API. :)
EDIT: Finished downloading all the entries (and can confirm that 1.26M is indeed all of them). Took 3 hours due to a conservative wait period between each request to make sure I stayed within the limits.
If this is a solution to the "not enough links that I personally like" - kudos to the author. Nice to find a fun project to work on that will also solve a problem for them.
I personally despise recommendation / personalization algorithms of any kind. I still have never found one that's actually better than myself at distinguishing articles that I'd like to read, music that I want to listen to, tweets I'd like to see, etc.
When reading HN, I'm constantly surprised by links that would not normally be on my radar for things I'm interested in. I think personalization algos, in general, are good at filtering those away.
Since the author mentioned HN being too much of a firehose and this then also being a solution to the "too many links to keep up to date on" problem, the solution might be a bit simpler than the author suggested.
It's hard to find - it's in the 'Lists' section in the footer, but it's still there and I use it all the time, when I haven't been actively reading HN for a while.
It seems there is a small but very strong subculture of HackerNews readers who enjoy reading and discussing mathematical things. I would love to have a separate feed of those stories (and then after I'm done I could browse the HN front page), and I have often thought about the possibility of writing a program to do that.
DataTau (the HN for data mining) seems to have failed, so I imagine a filter is the way to go rather than make a new website.
I think one of the challenges of hosting a "sub-HN" is that the hosting costs are hard to justify.
This raises the question: How does YC justify hosting costs? My completely-off-the-cuff-assumption-take-this-with-a-huge-grain-of-salt is that YC benefits by having a huge audience to make announcements to, like job postings at YC funded companies, various pg essays, or just investing in overall goodwill from the HN audience. Probably the most likely reason is to increase deal-flow to YCombinator itself, though.
Why does it need to have some justification beyond being a fun hobby? HN is hosted on a single server, and it probably uses less than 1TB/month, so it's not that expensive for someone with a Bay Area tech salary, let alone the whole YC.
I just use the 50 or 100 point minimum feed in my reader, and skip articles that don't look interesting based on how much time I want to spend. Sometimes I only read articles if they're a day old and the first comment makes them look interesting.
Cool story bro. You may want to check out digitalocean for hosting - their cheapest option is only $5 a month - about equivalent to the smaller $40/month aws option. It's very simple as well.
Newspapers may be old-fashioned, but here's what we're losing if you never see one. They are trying to tell you what's actually important, not just what's important to you. You may not read the whole paper, but you at least see headlines, making you aware that something's going on outside of your microtargeted world of fashion or music or Wiccans or zombies or whatever you're into.
Replace 'newspapers' with hacker news and you get the point.
https://www.youtube.com/watch?v=WohtmZDZCGM