Awesome stuff!
I tried a couple of searches, and the search ranking looks decently good (at first glance, atleast).
I also like the fading of text as the results become less relevant.
What kind of relevance algorithms are you using? Since this site is targeted at hackers, who tend to like more control, you could expose some of the parameters and allow users to tweak their relevance.
For instance, sliders that let you determine the importance of article age, # of comments in the article, karma points of article, avg karma points of readers, and of course the pattern match counts...
I tried a couple of searches, and the search ranking looks decently good (at first glance, atleast).
Its based solely upon the text in the comment thread. I was actually pretty surprised this worked as well as it did. I am currently crawling outlinks, which should hopefully improve relevancy even more, as well as discover more topics.
Since this site is targeted at hackers, who tend to like more control, you could expose some of the parameters and allow users to tweak their relevance.
Rather than drilling specifically into Hacker News, I'm more interested in exposing functionality to other hackers by building an API that will allow them to auto-organize their sites too. The one issue is that indexing is a batch, off-line process, and most APIs are built in a real-time, on-demand setting.
The other day I though how cool it would be to have a Web service that could crawl your site and auto categorize all of your pages (or at least help you to do it). As ever, turns out someone is on the case ;-) Nice work! I think there's definitely a wider audience for this technology.
There are lots of uses for this, but my main advice is to not loose these three things:
1. advantage of relevancy within specific domains. The page-rank was a huge value-add to relevancy over other search. But internet wide is now too ambitious. HN is a great corpus because the content is already vetted by a community. The work of integrating other specialized communities content can give density and relevancy.
2. ease-of-use in integration. The less configuration to use this API, the better. Autotagging, done well, is very useful. I have a lot of ideas around this if you'd like to chat some time.
3. ease-of-use interface . Combining browsable, faceted search with NLP is, I think, the sweet spot between getting lots of relevant results, but allowing for discovery.
i think that'd be quite easy, being honest)
i've been working on the same concept to that app, it's hackable in 3-5 weeks :) especially having some great tools such as Weka or Mahout
it's hard to build a complete custom solution. although it's possible to allow people to use certain wrapped components, giving explanation of how to achieve best results using certain methods.
again, one person may be satisfied with a certain achieved result, but that may work not quite well for other person. just for instance, clusterization or topic extraction. you should have some knowledge about what you want to get, what are the possible outcomes of investigation, and dig into data to get what you need. generic approach will give broad resultset that may require some additional effort for real-world usage.
Using Python's NLTK and Lucene can produce results like this. I wrote something similar using Wordnet, PHP/Zend Lucene, and and Freeling (C++ NLP) for NewsCup.
I think what makes this project interesting to me is the interface and quality of search results. They show a really good understanding how to use NLP and search in conjunction.
Very cool - can you add dates to the lists of articles, be nice to see how old an article is (or maybe color code the list to provide two axes of relevance)?
Agreed on this point - it was the first thing I looked for.
I was hoping to use this site as an alternate view into HN to find semi-recent submissions covering specific topics.
In addition, it might be nice to expose the popularity of the article somehow. Something that got 24+ votes is probably going to be more relevant to me than something with only 2.
I love this. I am working on a related project (the result, not HN) inspired by Paul Graham's Naive Bayes Spam Filter.
If you have a moment, I would like to hear more about your architecture and interface. It is responsive, clean, accurate, and multi-device ready, and clearly an implementation to be replicated.
I appreciate your work on this (in the past I've just used google with the site option). As a nitpick, one thing I'd do is remove the fluff from the bottom of the page and give an indicator that the large box is for a search term (as opposed to several lines of text which is what it looks like). In fact, a normal sized box would work just fine.
I realize that there is a line that instructs you to enter text in the box, but to me in got lost in the noise of the page (another reason for getting rid of the extra details at the bottom). On the results page, it'd be nice if you showed the results foremost and the related topics as a sidebar. I want my results, and only if I don't find what I'm looking for do I want to know the code's opinion of where I should try to go next.
Anyways, like I said those are just nitpicks. Nice job.
A couple problems though. For terms with spaces in them it creates a tag for each word. For example "stack overflow" has tags for "stack" and "overflow", which aren't all that useful (although yes the tag stackoverflow is created): http://metaoptimize.com/projects/autotag/hackernews/term/5a/...?
What would be a solution to this? Parse the text so each token includes the first space it encounters and see if that multi word token occurs frequently?
The other problem is stripping out punctuation. The tag c++ doesn't exist, neither does c#. Not sure how you'd include relevant punctuation and strip out the rest.
Could be really interesting for creating silios within sites for automagically creating navigation bars whereby all the related nav links were relevant to the page currently being viewed, removing unrelated clutter and offering navigation for a site much more relevant.
Mike Cheng (http://searchyc.com) gave me this data dump a year ago. I am currently crawling the rest of hacker news, as well as outlinks, to fill out this index.
Consider the site right now a proof-of-concept. I'm trying to gauge people's interest level, and get feedback.
I also like the fading of text as the results become less relevant.
What kind of relevance algorithms are you using? Since this site is targeted at hackers, who tend to like more control, you could expose some of the parameters and allow users to tweak their relevance.
For instance, sliders that let you determine the importance of article age, # of comments in the article, karma points of article, avg karma points of readers, and of course the pattern match counts...