Show HN: Hacker News, automagically organized

tsycho · on Oct 18, 2010

Awesome stuff! I tried a couple of searches, and the search ranking looks decently good (at first glance, atleast).

I also like the fading of text as the results become less relevant.

What kind of relevance algorithms are you using? Since this site is targeted at hackers, who tend to like more control, you could expose some of the parameters and allow users to tweak their relevance.

For instance, sliders that let you determine the importance of article age, # of comments in the article, karma points of article, avg karma points of readers, and of course the pattern match counts...

bravura · on Oct 18, 2010

I tried a couple of searches, and the search ranking looks decently good (at first glance, atleast).

Its based solely upon the text in the comment thread. I was actually pretty surprised this worked as well as it did. I am currently crawling outlinks, which should hopefully improve relevancy even more, as well as discover more topics.

Since this site is targeted at hackers, who tend to like more control, you could expose some of the parameters and allow users to tweak their relevance.

Rather than drilling specifically into Hacker News, I'm more interested in exposing functionality to other hackers by building an API that will allow them to auto-organize their sites too. The one issue is that indexing is a batch, off-line process, and most APIs are built in a real-time, on-demand setting.

petercooper · on Oct 18, 2010

The other day I though how cool it would be to have a Web service that could crawl your site and auto categorize all of your pages (or at least help you to do it). As ever, turns out someone is on the case ;-) Nice work! I think there's definitely a wider audience for this technology.

bravura · on Oct 18, 2010

I think there's definitely a wider audience for this technology.

What audiences do you see for this technology?

Also, how would you expand the audience for this technology?

Possible options:

* Auto-crawl content and automagically organize it, without involving the content owner. (The Google approach).

* Build a turn-key solution that people can upload their content and get the index returned to them. (An API approach.)

* Talking to businesses directly, and make one on one deals. (An enterprise/B2B approach.)

hendler · on Oct 18, 2010

There are lots of uses for this, but my main advice is to not loose these three things:

1. advantage of relevancy within specific domains. The page-rank was a huge value-add to relevancy over other search. But internet wide is now too ambitious. HN is a great corpus because the content is already vetted by a community. The work of integrating other specialized communities content can give density and relevancy.

2. ease-of-use in integration. The less configuration to use this API, the better. Autotagging, done well, is very useful. I have a lot of ideas around this if you'd like to chat some time.

3. ease-of-use interface . Combining browsable, faceted search with NLP is, I think, the sweet spot between getting lots of relevant results, but allowing for discovery.

techbio · on Oct 19, 2010

Mostly #1, but agreed with all. Especially so as to leverage a managed topic domain into a transferable form of domain knowledge.

ifesdjeen · on Oct 18, 2010

i think that'd be quite easy, being honest) i've been working on the same concept to that app, it's hackable in 3-5 weeks :) especially having some great tools such as Weka or Mahout

jlees · on Oct 19, 2010

As with all NLP/machine learning, it's trivial to do if you know the tools and already have the data -- but pretty difficult to do well.

ifesdjeen · on Oct 19, 2010

it's hard to build a complete custom solution. although it's possible to allow people to use certain wrapped components, giving explanation of how to achieve best results using certain methods.

again, one person may be satisfied with a certain achieved result, but that may work not quite well for other person. just for instance, clusterization or topic extraction. you should have some knowledge about what you want to get, what are the possible outcomes of investigation, and dig into data to get what you need. generic approach will give broad resultset that may require some additional effort for real-world usage.

hendler · on Oct 18, 2010

Using Python's NLTK and Lucene can produce results like this. I wrote something similar using Wordnet, PHP/Zend Lucene, and and Freeling (C++ NLP) for NewsCup.

I think what makes this project interesting to me is the interface and quality of search results. They show a really good understanding how to use NLP and search in conjunction.

Nice work.

jcroberts · on Oct 18, 2010

This is not a complaint, but simply a bug report.

With javascript disabled, if you type something in the provided text box and hit enter, you end up with an error message:

  Not Found
  The requested URL /projects/autotag/php/search.php was not found on this server.

I just figured you'd want to know.

adammichaelc · on Oct 18, 2010

I have JS enabled and am getting the same error.

locopati · on Oct 18, 2010

Very cool - can you add dates to the lists of articles, be nice to see how old an article is (or maybe color code the list to provide two axes of relevance)?

pudquick · on Oct 18, 2010

Agreed on this point - it was the first thing I looked for.

I was hoping to use this site as an alternate view into HN to find semi-recent submissions covering specific topics.

In addition, it might be nice to expose the popularity of the article somehow. Something that got 24+ votes is probably going to be more relevant to me than something with only 2.

techbio · on Oct 18, 2010

I love this. I am working on a related project (the result, not HN) inspired by Paul Graham's Naive Bayes Spam Filter.

If you have a moment, I would like to hear more about your architecture and interface. It is responsive, clean, accurate, and multi-device ready, and clearly an implementation to be replicated.

bravura · on Oct 18, 2010

Wow, you seriously have a lot of side-projects (http://techbio.org/). Drop me an email, and we'll talk.

techbio · on Oct 18, 2010

Your contact page quickly led me to here: http://pypi.python.org/pypi/topia.termextract/

and: http://www.metaoptimize.com/qa

Great stuff.

techbio · on Oct 18, 2010

Thank you, I will in the next 48 hours.

hipcat · on Oct 18, 2010

I appreciate your work on this (in the past I've just used google with the site option). As a nitpick, one thing I'd do is remove the fluff from the bottom of the page and give an indicator that the large box is for a search term (as opposed to several lines of text which is what it looks like). In fact, a normal sized box would work just fine.

I realize that there is a line that instructs you to enter text in the box, but to me in got lost in the noise of the page (another reason for getting rid of the extra details at the bottom). On the results page, it'd be nice if you showed the results foremost and the related topics as a sidebar. I want my results, and only if I don't find what I'm looking for do I want to know the code's opinion of where I should try to go next.

Anyways, like I said those are just nitpicks. Nice job.

nck4222 · on Oct 19, 2010

Very very cool.

A couple problems though. For terms with spaces in them it creates a tag for each word. For example "stack overflow" has tags for "stack" and "overflow", which aren't all that useful (although yes the tag stackoverflow is created): http://metaoptimize.com/projects/autotag/hackernews/term/5a/...?

What would be a solution to this? Parse the text so each token includes the first space it encounters and see if that multi word token occurs frequently?

The other problem is stripping out punctuation. The tag c++ doesn't exist, neither does c#. Not sure how you'd include relevant punctuation and strip out the rest.

pchristensen · on Oct 18, 2010

Awesome! Gabriel, could you please get these search results into Duck Duck Go?

kno · on Oct 18, 2010

Nice, some folks will demote you here for any reason.

naner · on Oct 19, 2010

Pretty cool. The only goof I found was that it thinks that Emacs is plural for 'Emac'.

jackfoxy · on Oct 18, 2010

Nice work. This gets a place on my bookmark bar.

When can you get it current?

Minor note: found that F# does not index.

rgrieselhuber · on Oct 18, 2010

Very nicely done. Would love to see some stats about how long this took, in terms of crawling, indexing, etc. time.

Also, would love to hear more about the tools behind it.

r11t · on Oct 18, 2010

Feature request: Ability to browse list of existing tags would be great apart from the auto-complete feature.

bigbang · on Oct 19, 2010

Cool. How do you find the related topics to a given topic? What api do you use for that? Google suggestions?

HNer · on Oct 19, 2010

Could be really interesting for creating silios within sites for automagically creating navigation bars whereby all the related nav links were relevant to the page currently being viewed, removing unrelated clutter and offering navigation for a site much more relevant.

ifesdjeen · on Oct 18, 2010

great! i've been working on the same exact thing for a month already! good job! :) glad to know that my idea existed in someone else's mind.

ceejayoz · on Oct 18, 2010

No minecraft tag?

seancron · on Oct 18, 2010

The index is out of date. Minecraft wasn't around in October 2009.

Grouper · on Oct 18, 2010

It would be interesting to see what new tags pop over the years. And maybe trends in tag usage etc.

For example, to see a google trend style chart for Ruby vs Python tags.

webXL · on Oct 18, 2010

Nice work. Why does it stop at Oct. 13, 2009?

bravura · on Oct 18, 2010

Mike Cheng (http://searchyc.com) gave me this data dump a year ago. I am currently crawling the rest of hacker news, as well as outlinks, to fill out this index.

Consider the site right now a proof-of-concept. I'm trying to gauge people's interest level, and get feedback.

fabiandesimone · on Oct 18, 2010

Wow! Excellent stuff!

Just added it to my Utilities bookmark folder.

Congrats!

finemann · on Oct 18, 2010

Awesome work mate :)