Sure, I'm going to be writing a longer blog post on how I made it, but for now here's a short summary:
I made a script that scrapes all the links from hacker news every 15 minutes. I then open the links and process the text using python's nltk package (deciding what words are important and useful). Then I used a suffix tree in a mongodb backend to store the important words in such a way that once it looks up a word you can get the set of documents pertaining to the word. This way the search is linear in the length of the query and not the number of documents. The rest was just some jquery ajax calls and parsing of the search query.
I'll look into a new design, maybe make the orange's white and the white's orange.
Great and snappy! I would really love the longer blog post...! Two questions if you have a minute: 1) Why suffix trees and not suffix arrays? 2) How are you implementing them? Did you do the tree building yourself or is there a good library that you recommend? Thanks.
I used a suffix tree over a suffix array because I hadn't heard of suffix arrays, but after glancing at the wikipedia page for suffix arrays it seems those might have been a good choice too. I'll look more into it. I did all the tree building myself, and I'll explain that in my post. The post should be ready by tomorrow.
For this type of thing I've had much better results with LDA in the python package gensim. It is less prone to mis-matches based on similar keywords (since it is context based) the problem with LDA is that for it to be most effective you need to have a taxonomy available for the documents, but you might be able to build a corpus or two out of sites like stack overflow.
Awesome. Love it.
Suggestion as many has pointed to swap the Orange and White color.
And one request, can you put HN link instead of direct article link as i want to view the discussion on that article too.
I like it! Clean interface and quick search. The top articles feature is handy. Maybe make the search options more customizable though? Being able to tweak the parameters of the search would be awesome
I noticed that the more links are in there also. You can find it by searching for 'analytics' then click on 'more' button. you'll notice a search result that links to "/news2" :)
My #1 use of hnsearch.com (and before it, searchyc.com) has been to find my own old comments. This has no results for my username, so it can't yet fill that role.
Yeah I didn't handle any cases for other languages since most of the articles are in english. But I feel ya on making search engines for multiple languages, had to do that this summer and it was a nightmare