Ask HN: Algorithm to identify the same events based on semantic similarity

aristus · on Sept 12, 2008

The current consensus is to use statistics without trying to write a program that "understands" human language.

A common approach is to break your items down into a vector of "features" (words, phrases, tokens, etc) and apply standard IR techniques like kNN, singular value decomposition (SVD), tf/idf, etc.

One interesting thing about your problem is that people use different words and phrasing for the same thing, which means you need a way to equate different tokens with each other. SVD is pretty good because it finds degrees of similarity based on context: 'Pikachu' becomes strongly associated with 'Pokemon' and 'Squirtle', and also with misspellings like 'Picachoo'. SVD is tricky and expensive if you implement it wrong. The good news is that it recently came off-patent so there are a lot of new libraries popping up.

When I worked on a keyword search engine prototype for Brasil, I found that a mix of Tanimoto and Ferber similarity scores worked very well for finding similarity between short phrases, and I didn't even need to know how to read Portuguese.

PressFlip is using support vector machines for a similar task of clustering news items. The Vowpal Wabbit is a stellar library used by Yahoo. I think they recently open-sourced it. allvoices.com is tackling the same area as well. (disclaimer: I am a dev at allvoices, though I don't work on that part :)

jacobscott · on Sept 12, 2008

How big is the technology gap in this area between large players and startups? I've always wondered whether having a feature being more mathy/researchy makes it harder for a small team to handle. I guess if everyone is using OSS libraries it may even the score.

aristus · on Sept 12, 2008

It's smaller than you may imagine. Google's great advantage is that they were one of the first to realize that data tends to beat clever algorithms at a certain scale.

But for restricted domains there are many techniques that are well-known in academic circles that are not yet mainstream. A "mathy/researchy" feature only requires people versed in that research and math. It is surprising how many jaded IR graduates there are in the valley, looking for an interesting problem to solve.

ntoshev · on Sept 12, 2008

Statistical language processing is a good way to go, but he must have enough data for it to be useful. I'd recommend to start by gathering the data.

abhikshah · on Sept 13, 2008

Amazon's Mechanical Turk seems like a good way to gather training data.. and it might even be cheap enough to use instead of a trained model (at least at first).

sbt · on Sept 12, 2008

What you are trying to do is very hard. Statistical NLP is the way to go for unbounded domains. You might be able to do better semantic analysis for smaller closed domains.

There is a relatively new great book on NLP out now that I suggest you take a look at. Particularly the chapters semantics are very useful, but they should give you an idea of how incredibly difficult what you're trying to do is.

Book: http://www.amazon.com/Language-Processing-Prentice-Artificia...

joshu · on Sept 12, 2008

I recommend an IR and/or NLP book.

Check into Jaccard distance. http://en.wikipedia.org/wiki/Jaccard_index

You might consider looking at WordNet for generalizing words...

MaysonL · on Sept 13, 2008

There was a great Google teck talk posted here a while ago: The Next Generation of Neural Networks by Geoffrey Hinton http://www.youtube.com/watch?v=AyzOUbkUf3M that might be applicable.

snprbob86 · on Sept 12, 2008

Will these people be mentioning the same URLs, by chance? :-)

haidut · on Sept 15, 2008

OK, there is a pretty "simple" way to do it but I can't divulge all the details b/c we are using it for my semantic news startup. However, I can give you a hint on what you need to do. You need to find a way to reduce different words to their "base" semantics and then compute a similarity measure of sorts between the different event descriptions. Traditionally, people have been breaking down the text into word vectors and then calculating a pearson or tanimoto, or whatnot similarity measures b/n the vectors. The problem with this approach is not the similarity measure but the fact that (as aristus points out) different people use different words to describe the same thing. So computing the tanimoto score of two sentences like "Bush visits Iraq to encourage soldiers" and "US. President arrives at Bagdad to raise troops morale" will result in very low similarity score using just the literal words used in both sentences. However, in reality both sentences are almost identical in meaning. So let's say that there was a way to reduce the semantics of both sentences down to their "base", then for sentence one you would have something like "US President present in Middle Eastern country to increase spirit of soldiers", and for sentence two you would have "US President present in capital of Middle Eastern country to increase spirit of soldiers". So NOW if you take the tanimoto measure of similarity or actually any other decent similarity measure b/n the semantic "base" of both sentences you would get a very high score, which is what is needed. So, like I said I can't divulge the method of how to do reduce text to its semantic "base" but there is definitely at least one way, which is what we are using. If you are interested, send me an email to haidut@gmail.com and when I launch my startup within the next couple of weeks I can hopefully divulge more. Sorry for my vagueness but my lawyers will make a minced-meat out of me if I say more:-) Here is something you can do and I am allowed to divulge. You can use Pointwise Mutual Information (PMI) to find out the semantic relatedness of words in both event descriptions and if the overall cross-word similarity is above a certain threshold then you can consider both event descriptions similar. To find out more on how to calculate PMI and some more info on how useful it is, read this paper: http://cogprints.org/1796/0/ECML2001.pdf Overall, what you are trying to do is simple in concept but not easy to implement. Basically, in order for semantic similarity to be computed automatically by a machine you'd need a mapping of EVERY word in EVERY language to its semantic "base" so you can create a table and compute this efficiently. Such mapping does not currently exists and one of the reasons is the constantly changing nature of English language itself - new words come out all the time and even if you could track them all - some of them don't have semantic "base". Using the example I gave above - the fact that Bush is currently a President won't be true in couple of months so that huge word-to-meaning mapping table would have to account for that, so mapping English language to a semantic base if a (very fast) moving target.

That's all I have for now. Good luck in your endeavor.