Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Algorithm to identify the same events based on semantic similarity
16 points by btw0 on Sept 12, 2008 | hide | past | favorite | 10 comments
I am working on a project where it needs to be able to identify the same events described with short phrases by different people. I think workflow should be something like: 1. extract the meaningful words(nouns, verbs) from the phrases. 2. measure the semantic distances between these words. 3. clustering.

I am asking for your advices. Thank you.



The current consensus is to use statistics without trying to write a program that "understands" human language.

A common approach is to break your items down into a vector of "features" (words, phrases, tokens, etc) and apply standard IR techniques like kNN, singular value decomposition (SVD), tf/idf, etc.

One interesting thing about your problem is that people use different words and phrasing for the same thing, which means you need a way to equate different tokens with each other. SVD is pretty good because it finds degrees of similarity based on context: 'Pikachu' becomes strongly associated with 'Pokemon' and 'Squirtle', and also with misspellings like 'Picachoo'. SVD is tricky and expensive if you implement it wrong. The good news is that it recently came off-patent so there are a lot of new libraries popping up.

When I worked on a keyword search engine prototype for Brasil, I found that a mix of Tanimoto and Ferber similarity scores worked very well for finding similarity between short phrases, and I didn't even need to know how to read Portuguese.

PressFlip is using support vector machines for a similar task of clustering news items. The Vowpal Wabbit is a stellar library used by Yahoo. I think they recently open-sourced it. allvoices.com is tackling the same area as well. (disclaimer: I am a dev at allvoices, though I don't work on that part :)


How big is the technology gap in this area between large players and startups? I've always wondered whether having a feature being more mathy/researchy makes it harder for a small team to handle. I guess if everyone is using OSS libraries it may even the score.


It's smaller than you may imagine. Google's great advantage is that they were one of the first to realize that data tends to beat clever algorithms at a certain scale.

But for restricted domains there are many techniques that are well-known in academic circles that are not yet mainstream. A "mathy/researchy" feature only requires people versed in that research and math. It is surprising how many jaded IR graduates there are in the valley, looking for an interesting problem to solve.


Statistical language processing is a good way to go, but he must have enough data for it to be useful. I'd recommend to start by gathering the data.


Amazon's Mechanical Turk seems like a good way to gather training data.. and it might even be cheap enough to use instead of a trained model (at least at first).


What you are trying to do is very hard. Statistical NLP is the way to go for unbounded domains. You might be able to do better semantic analysis for smaller closed domains.

There is a relatively new great book on NLP out now that I suggest you take a look at. Particularly the chapters semantics are very useful, but they should give you an idea of how incredibly difficult what you're trying to do is.

Book: http://www.amazon.com/Language-Processing-Prentice-Artificia...


I recommend an IR and/or NLP book.

Check into Jaccard distance. http://en.wikipedia.org/wiki/Jaccard_index

You might consider looking at WordNet for generalizing words...


There was a great Google teck talk posted here a while ago: The Next Generation of Neural Networks by Geoffrey Hinton http://www.youtube.com/watch?v=AyzOUbkUf3M that might be applicable.


Will these people be mentioning the same URLs, by chance? :-)


OK, there is a pretty "simple" way to do it but I can't divulge all the details b/c we are using it for my semantic news startup. However, I can give you a hint on what you need to do. You need to find a way to reduce different words to their "base" semantics and then compute a similarity measure of sorts between the different event descriptions. Traditionally, people have been breaking down the text into word vectors and then calculating a pearson or tanimoto, or whatnot similarity measures b/n the vectors. The problem with this approach is not the similarity measure but the fact that (as aristus points out) different people use different words to describe the same thing. So computing the tanimoto score of two sentences like "Bush visits Iraq to encourage soldiers" and "US. President arrives at Bagdad to raise troops morale" will result in very low similarity score using just the literal words used in both sentences. However, in reality both sentences are almost identical in meaning. So let's say that there was a way to reduce the semantics of both sentences down to their "base", then for sentence one you would have something like "US President present in Middle Eastern country to increase spirit of soldiers", and for sentence two you would have "US President present in capital of Middle Eastern country to increase spirit of soldiers". So NOW if you take the tanimoto measure of similarity or actually any other decent similarity measure b/n the semantic "base" of both sentences you would get a very high score, which is what is needed. So, like I said I can't divulge the method of how to do reduce text to its semantic "base" but there is definitely at least one way, which is what we are using. If you are interested, send me an email to haidut@gmail.com and when I launch my startup within the next couple of weeks I can hopefully divulge more. Sorry for my vagueness but my lawyers will make a minced-meat out of me if I say more:-) Here is something you can do and I am allowed to divulge. You can use Pointwise Mutual Information (PMI) to find out the semantic relatedness of words in both event descriptions and if the overall cross-word similarity is above a certain threshold then you can consider both event descriptions similar. To find out more on how to calculate PMI and some more info on how useful it is, read this paper: http://cogprints.org/1796/0/ECML2001.pdf Overall, what you are trying to do is simple in concept but not easy to implement. Basically, in order for semantic similarity to be computed automatically by a machine you'd need a mapping of EVERY word in EVERY language to its semantic "base" so you can create a table and compute this efficiently. Such mapping does not currently exists and one of the reasons is the constantly changing nature of English language itself - new words come out all the time and even if you could track them all - some of them don't have semantic "base". Using the example I gave above - the fact that Bush is currently a President won't be true in couple of months so that huge word-to-meaning mapping table would have to account for that, so mapping English language to a semantic base if a (very fast) moving target.

That's all I have for now. Good luck in your endeavor.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: