I am working on a project where it needs to be able to identify the same events described with short phrases by different people. I think workflow should be something like: 1. extract the meaningful words(nouns, verbs) from the phrases. 2. measure the semantic distances between these words. 3. clustering.
I am asking for your advices. Thank you.
A common approach is to break your items down into a vector of "features" (words, phrases, tokens, etc) and apply standard IR techniques like kNN, singular value decomposition (SVD), tf/idf, etc.
One interesting thing about your problem is that people use different words and phrasing for the same thing, which means you need a way to equate different tokens with each other. SVD is pretty good because it finds degrees of similarity based on context: 'Pikachu' becomes strongly associated with 'Pokemon' and 'Squirtle', and also with misspellings like 'Picachoo'. SVD is tricky and expensive if you implement it wrong. The good news is that it recently came off-patent so there are a lot of new libraries popping up.
When I worked on a keyword search engine prototype for Brasil, I found that a mix of Tanimoto and Ferber similarity scores worked very well for finding similarity between short phrases, and I didn't even need to know how to read Portuguese.
PressFlip is using support vector machines for a similar task of clustering news items. The Vowpal Wabbit is a stellar library used by Yahoo. I think they recently open-sourced it. allvoices.com is tackling the same area as well. (disclaimer: I am a dev at allvoices, though I don't work on that part :)