Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't surprising; it's a very difficult problem. But I was hoping to be amazed.) Here are the examples I tried (all of them, no cherrypicking):
daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }
(Hopeless; should have had "son" in there)
pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }
Some of this is that underlying model is insufficiently trained, but some of this is disambiguation. Disambiguation in text is a very, very, hard problem.
I agree disambiguation is a tough problem, but I'm curious to see whether the vector representation learned by word2vec can help in that regard. Have you considered using clustering to attach a few possible cluster memberships to each word (which would hopefully each capture information about a different usage context), then selecting the meanings for each word that give the maximum overlap?
The idea being that the words used to search on your website are likely to be conceptually similar (one doesn't add apples to oranges), so you can assume that "queen (sovereign) + man (noun) - woman" makes more sense that "queen (band) + man (unix command) - woman" because "man (noun)" and "woman" are likely to belong to similar clusters and not "man (unix command)".
Japan - Tokyo + History = Kyoto was what I was expecting but I guess the corpus isn't quite there yet. This was the answer .. http://en.wikipedia.org/wiki/Atpara_Upazila At least it's geographic!
maybe just let the user choose the meaning from a drop down list?
You can (probably) integrate freebase' suggest easily and use the FB api to get to the WP id.
writer + American + Russian + Great - Nobel Prize : expected Nabokov, got Meirkhaim Gavrielov + 1 nonsense result;
plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants” (thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.
EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried "Nixon + Clinton - JFK" and got nothing that looked interesting. Then I noticed that the "Nixon" part of my query was "disambiguated" to something like "non_film", and the word "Nixon" was just stripped out. I think this thing is just broken.
It's not perfect; the real limiting factor is the volume of text. For the research paper behind word2vec, to get accurate associations for common words (King, man, woman, etc.) required a news sources training text of order a billion words. This is roughly the size of all of Wikipedia, but the text in Wikipedia has many more rare words, which spreads the number of training examples per word rather thin. To address that, I'm thinking of expanding the project to use Common Crawl data (http://commoncrawl.org/) to dramatically expand the available supply of data.
Hey juxtaposicion, fascinating work. I have many questions so I am just going to shoot them rapid fire.
What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by? Have your tried any dimensionality reduction algorithms like PCA or Isomap? It would be interesting to find the word vectors that contain the most variation across all of wikipedia. Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?
I guess most of those questions are about the word2vec algorithm, but you are probably in a good place to answer them after working with it. Anyways, cool work, and I am glad you did it in python so I can really dig in and understand it.
I'm not associated with the OP, but I have played with word2vec.
Word2vec can be viewed as dimensionality reduction. In some sense, the dimensionality of words is the same as the vocabulary size, if a one-hot encoding is used (which is most common). For word2vec specifically, the output vector dimension is a parameter specified during training, so you can choose it however you want. I suspect the OP chose something fairly large (>256) because of the performance boost.
The paper associated with word2vec is fairly easy to read and understand, even if you don't have a background in NLP or neural networks: http://arxiv.org/pdf/1301.3781.pdf
>> What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by?
Each dimension roughly is a new way that words can be similar or dissimilar. So I've got 1000-dimensional vectors, so words can be similar or dissimilar in only one thousand 'ways'. So associations like 'luxury', 'thoughtful', 'person', 'place' or 'object' are learned (roughly speaking). Of course, real words are far more diverse, so this is an approximation. The 1000 dimensions is configurable, and in theory more dimensions means more contrast is captured, but you need more training data. In practice, the number 1000 is chosen because that maxes out the size of my large memory machine. That said, the word2vec paper shows good results with 1000D, so it doesn't seem to be a bad choice.
>> Have your tried any dimensionality reduction algorithms like PCA or Isomap?
Yes! I've tried out PCA, and some spectral biclustering using the off-the-shelf algorithms in SciKits Learn. I only played around with this for an hour or so but got discouraging results. Nevertheless, the word2vec papers actually show that this works really well for projecting France, USA, Paris, DC, London, etc. on a two-dimensional plane where the axis roughly correspond to countried & capitals -- exactly what you'd hope for! I wasn't able to replicate that, but Tomas Mikolov was!
>> It would be interesting to find the word vectors that contain the most variation across all of wikipedia.
Hmm, interesting indeed! I'm not sure how I'd got about measuring 'variation' -- would this amount to isolating word clusters and finding the most dense ones? Something like finding a cluster with a hundred variations of the word 'snow' (if you're Inuit)? I'd be willing to part with the raw vector database if there's interest.
>> Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?
Only a little bit, although I'm very interested in finding a faster approach than finding the whole damn dot product (see: https://news.ycombinator.com/item?id=6720359). I worry that traditional location sensitive hashes, kd-trees, and the like work well for 3D locations, but miserably for 1000D data like I have here.
I should reiterate out that most of the hard work revolves around the word2vec algorithm which I used but didn't write. It's awesome, check it and the papers out here: https://code.google.com/p/word2vec/
Build your kd-tree on the vectors expressed in the eigenvector basis. If the eigenvalues decrease fast enough, you can get bounds on the dot product while going only a few levels deep.
Don't most search engines use an inverted index to find the similarity between the query vector and the document vectors? (instead of doing the dot product with every document)
Hey SandB0x, thanks for the advice! I actually started off by using numpy.dot, which is precisely what's needed. The problem is that I need it to go even faster (currently takes a few seconds) but this function is already heavily optimized and uses Intel MKL to accelerate the math. In fact, my cython implementation would be slower than numpy.dot were it not for some embedded logic that breaks out of the dot product half way through calculation. As I'm computing the row * row, going element by element, if the running sum of those products gets to be really negative (indicating that in the dimensions multiplied so far, the two rows are highly dissimilar) I stop tryin to calculate the rest of the row, thereby saving me from calculating the full dot product. This is cheating obvioisly, but it's x3 or x4 faster numpy.dot. So, because there's branching logic in my implementation of the dot product, I can't express it interns of Einstein summations.
Interesting. I've played around with words as vectors (with values) and the cosign similarity algorithm (http://en.wikipedia.org/wiki/Cosine_similarity). This is very cool stuff. I wonder how they're doing it in real-time, it is heavy number crunching
Thanks! The number crunching is indeed very intense. The full body of vectors has a million rows and thousand columns, which fills up all of the 10GB of available memory. When you punch in a query, it adds or subtracts the requested vectors and takes an approximate dot product between every row in the table and the search query. This is about 10^9 operations. I'm only interested in the most similar (high cosine similarity) dot products, so to speed things up I wrote a Cython dot product implementation. This aborts the calculation if the sum starts to look like it'll be very dissimilar, essentially skipping lots of bad guesses. This speeds things up by a factor of ~5 or so. I'm debating offloading this computation to the GPU, which would be perfect for this.
Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?
Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.
Hi, creator here (Chris Moody).
Great question. The underlying algorithm, word2vec, (https://code.google.com/p/word2vec/) isn't built for streaming data which means that at the moment it assumes a fixed number of words from the beginning of the calculation. Unfortunately, until the state-of-the-art advances to accepting streaming data, the whole corpus will have to be rescanned to accept dynamic content. Furthermore, word2vec doesn't scale past OpenMP, single node, shared-memory resources. So while I used MapReduce, it's just for cleaning and preprocessing the text, not training the vectors, which is the hard part.
So there's some exciting work to be done in parallelizing and streaming the word2vec algorithm!
Kind of. LSI is a dimensionality-reduction method, which means it takes really sparse "bag of words" vectors and compresses them in a way which approximates the original structure. Word2vec gives each word a non sparse vector based on the hidden units of a neural network language model. The model is trained to predict the middle word from it's surroundings, which makes it learn how words are used in context.
Hey, nice work! Can you explain the "comma delimited list" functionality any more? It seems (awesomely) similar to a hack I did a while back with Word2Vec which would pick out the word which didn't belong in a list.
I would guess that the "handheld" keyword is big with the Wii because of the new control scheme it introduced which was the main (sometimes only) thing talked about with regard to the system
Interesting.
Currently it generates garbage for lot of queries but, some stuff is kinda fun.
Forrest Gump - comedy + romance gives pulp fiction (!), as good as it gets (match) and polar express (?)
Avatar - action + comedy gives The Office (haha!)
I know people like to keep things positive, but this is completely useless. Apart from a few cherry picked examples, subtracting words makes no sense most of the time, and there is no clear advantage for their method when it comes to adding words.
daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }
(Hopeless; should have had "son" in there)
pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }
(Hopeless; should have had "Mac" in there)
violin - string + woodwind -> { clarinet, oboe, flute }
(OK)
mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol (pharmaceutical), stone (geological term) }
(Poor; should have had Jagger or Richards or something of the kind in the top few results)
sofa - large + small -> { relaxing, asleep, cupboard }
(Poor; I'd have hoped for "armchair" or something of the kind)