Show HN: A search engine that lets you 'add' words as vectors

gjm11 · on Nov 12, 2013

Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't surprising; it's a very difficult problem. But I was hoping to be amazed.) Here are the examples I tried (all of them, no cherrypicking):

daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }

(Hopeless; should have had "son" in there)

pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }

(Hopeless; should have had "Mac" in there)

violin - string + woodwind -> { clarinet, oboe, flute }

(OK)

mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol (pharmaceutical), stone (geological term) }

(Poor; should have had Jagger or Richards or something of the kind in the top few results)

sofa - large + small -> { relaxing, asleep, cupboard }

(Poor; I'd have hoped for "armchair" or something of the kind)

juxtaposicion · on Nov 12, 2013

Some of this is that underlying model is insufficiently trained, but some of this is disambiguation. Disambiguation in text is a very, very, hard problem.

So some of your examples, when clarified, are a bit clearer: Paul McCartney - Beatles + Rolling Stones = Mick Jagger is in the 3rd spot. (http://www.thisplusthat.me/search/Paul%20McCartney%20-%20Bea...) Change stone -> Rolling Stones.

Thank you for the comment!

pyduan · on Nov 13, 2013

I agree disambiguation is a tough problem, but I'm curious to see whether the vector representation learned by word2vec can help in that regard. Have you considered using clustering to attach a few possible cluster memberships to each word (which would hopefully each capture information about a different usage context), then selecting the meanings for each word that give the maximum overlap?

The idea being that the words used to search on your website are likely to be conceptually similar (one doesn't add apples to oranges), so you can assume that "queen (sovereign) + man (noun) - woman" makes more sense that "queen (band) + man (unix command) - woman" because "man (noun)" and "woman" are likely to belong to similar clusters and not "man (unix command)".

minikomi · on Nov 13, 2013

Japan - Tokyo + History = Kyoto was what I was expecting but I guess the corpus isn't quite there yet. This was the answer .. http://en.wikipedia.org/wiki/Atpara_Upazila At least it's geographic!

mjfl · on Nov 13, 2013

Hitler -German +Italian = Sofiene Zaaboub apparently. Was hoping for Mussolini.

riffraff · on Nov 12, 2013

maybe just let the user choose the meaning from a drop down list? You can (probably) integrate freebase' suggest easily and use the FB api to get to the WP id.

nemo1618 · on Nov 12, 2013

Lisp + JVM = Clojure

I'm sold. This is really cool! (Though it's worth noting that a Google search with the same terms returns the exact same result...)

ccanassa · on Nov 16, 2013

I tried Lisp + Java and got Objective C instead.

leephillips · on Nov 12, 2013

I was really excited by this writeup, so I tried it. Four test queries returned nothing that seemed useful or even relevant:

fluid dynamics + electromagnetism : expected magnetrohydrodynamics, got Maxwell’s equations and classical mechanics (not useful);

verse + 5 - rhyme : expected blank iambic pentameter, Shakespeare, etc.: got nonsense;

writer + American + Russian + Great - Nobel Prize : expected Nabokov, got Meirkhaim Gavrielov + 1 nonsense result;

plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants” (thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.

EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried "Nixon + Clinton - JFK" and got nothing that looked interesting. Then I noticed that the "Nixon" part of my query was "disambiguated" to something like "non_film", and the word "Nixon" was just stripped out. I think this thing is just broken.

juxtaposicion · on Nov 12, 2013

It's not perfect; the real limiting factor is the volume of text. For the research paper behind word2vec, to get accurate associations for common words (King, man, woman, etc.) required a news sources training text of order a billion words. This is roughly the size of all of Wikipedia, but the text in Wikipedia has many more rare words, which spreads the number of training examples per word rather thin. To address that, I'm thinking of expanding the project to use Common Crawl data (http://commoncrawl.org/) to dramatically expand the available supply of data.

tveita · on Nov 12, 2013

I had the same experience. I haven't really gotten any convincing results except for the example queries verbatim.

Some queries I tried:

medicine - science + superstition

aquarium - wet + dry

mario - nintendo + sega

luigi - green + red

pocahontas - native americans + aliens

doctoboggan · on Nov 12, 2013

Hey juxtaposicion, fascinating work. I have many questions so I am just going to shoot them rapid fire.

What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by? Have your tried any dimensionality reduction algorithms like PCA or Isomap? It would be interesting to find the word vectors that contain the most variation across all of wikipedia. Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?

I guess most of those questions are about the word2vec algorithm, but you are probably in a good place to answer them after working with it. Anyways, cool work, and I am glad you did it in python so I can really dig in and understand it.

dhammack · on Nov 12, 2013

I'm not associated with the OP, but I have played with word2vec.

Word2vec can be viewed as dimensionality reduction. In some sense, the dimensionality of words is the same as the vocabulary size, if a one-hot encoding is used (which is most common). For word2vec specifically, the output vector dimension is a parameter specified during training, so you can choose it however you want. I suspect the OP chose something fairly large (>256) because of the performance boost.

Nearest neighbor methods work well with word2vec words: https://code.google.com/p/word2vec/#Word_clustering

The paper associated with word2vec is fairly easy to read and understand, even if you don't have a background in NLP or neural networks: http://arxiv.org/pdf/1301.3781.pdf

juxtaposicion · on Nov 12, 2013

Hey doctoboggan, awesome questions!

>> What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by?

Each dimension roughly is a new way that words can be similar or dissimilar. So I've got 1000-dimensional vectors, so words can be similar or dissimilar in only one thousand 'ways'. So associations like 'luxury', 'thoughtful', 'person', 'place' or 'object' are learned (roughly speaking). Of course, real words are far more diverse, so this is an approximation. The 1000 dimensions is configurable, and in theory more dimensions means more contrast is captured, but you need more training data. In practice, the number 1000 is chosen because that maxes out the size of my large memory machine. That said, the word2vec paper shows good results with 1000D, so it doesn't seem to be a bad choice.

>> Have your tried any dimensionality reduction algorithms like PCA or Isomap?

Yes! I've tried out PCA, and some spectral biclustering using the off-the-shelf algorithms in SciKits Learn. I only played around with this for an hour or so but got discouraging results. Nevertheless, the word2vec papers actually show that this works really well for projecting France, USA, Paris, DC, London, etc. on a two-dimensional plane where the axis roughly correspond to countried & capitals -- exactly what you'd hope for! I wasn't able to replicate that, but Tomas Mikolov was!

>> It would be interesting to find the word vectors that contain the most variation across all of wikipedia.

Hmm, interesting indeed! I'm not sure how I'd got about measuring 'variation' -- would this amount to isolating word clusters and finding the most dense ones? Something like finding a cluster with a hundred variations of the word 'snow' (if you're Inuit)? I'd be willing to part with the raw vector database if there's interest.

>> Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?

Only a little bit, although I'm very interested in finding a faster approach than finding the whole damn dot product (see: https://news.ycombinator.com/item?id=6720359). I worry that traditional location sensitive hashes, kd-trees, and the like work well for 3D locations, but miserably for 1000D data like I have here.

I should reiterate out that most of the hard work revolves around the word2vec algorithm which I used but didn't write. It's awesome, check it and the papers out here: https://code.google.com/p/word2vec/

Whoa, that was alot. Thanks!

dhammack · on Nov 12, 2013

For 2D visualization, t-sne is an excellent tool. I've used it with word2vec and you can see clusters of similar words:

https://raw.github.com/dhammack/Word2VecExample/master/visua...

And more in https://github.com/dhammack/Word2VecExample/tree/master/visu...

murbard2 · on Nov 12, 2013

Build your kd-tree on the vectors expressed in the eigenvector basis. If the eigenvalues decrease fast enough, you can get bounds on the dot product while going only a few levels deep.

robseed · on Nov 13, 2013

Don't most search engines use an inverted index to find the similarity between the query vector and the document vectors? (instead of doing the dot product with every document)

Maybe something like this could help: http://radimrehurek.com/gensim/similarities/docsim.html

Cool stuff, thanks.

doctoboggan · on Nov 15, 2013

I'd love access to the raw vector database if you feel like sending it my way.

juxtaposicion · on Nov 12, 2013

Harvard - Boston + Silicon http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B...

benmanns · on Nov 12, 2013

Myspace - bands + success http://www.thisplusthat.me/search/Myspace%20-%20bands%20%2B%...

joshschreuder · on Nov 13, 2013

I did 'Facebook - users' and got Myspace as the second result :)

zamalek · on Nov 13, 2013

Facebook - Exploitation + Privacy = Blog

Kinda makes sense.

SandB0x · on Nov 12, 2013

I saw what you wrote about your dot product speed issue. Did you try using NumPy's einsum function? http://docs.scipy.org/doc/numpy/reference/generated/numpy.ei...

It's really fast for this kind of stuff. Happy to give details about how to use it if you need.

juxtaposicion · on Nov 13, 2013

Hey SandB0x, thanks for the advice! I actually started off by using numpy.dot, which is precisely what's needed. The problem is that I need it to go even faster (currently takes a few seconds) but this function is already heavily optimized and uses Intel MKL to accelerate the math. In fact, my cython implementation would be slower than numpy.dot were it not for some embedded logic that breaks out of the dot product half way through calculation. As I'm computing the row * row, going element by element, if the running sum of those products gets to be really negative (indicating that in the dimensions multiplied so far, the two rows are highly dissimilar) I stop tryin to calculate the rest of the row, thereby saving me from calculating the full dot product. This is cheating obvioisly, but it's x3 or x4 faster numpy.dot. So, because there's branching logic in my implementation of the dot product, I can't express it interns of Einstein summations.

emehrkay · on Nov 12, 2013

Interesting. I've played around with words as vectors (with values) and the cosign similarity algorithm (http://en.wikipedia.org/wiki/Cosine_similarity). This is very cool stuff. I wonder how they're doing it in real-time, it is heavy number crunching

juxtaposicion · on Nov 12, 2013

Thanks! The number crunching is indeed very intense. The full body of vectors has a million rows and thousand columns, which fills up all of the 10GB of available memory. When you punch in a query, it adds or subtracts the requested vectors and takes an approximate dot product between every row in the table and the search query. This is about 10^9 operations. I'm only interested in the most similar (high cosine similarity) dot products, so to speed things up I wrote a Cython dot product implementation. This aborts the calculation if the sum starts to look like it'll be very dissimilar, essentially skipping lots of bad guesses. This speeds things up by a factor of ~5 or so. I'm debating offloading this computation to the GPU, which would be perfect for this.

Edit. In case you're interested in the source: https://github.com/cemoody/wizlang

sillysaurus2 · on Nov 12, 2013

Since the aim is accuracy rather than throughput, would more memory help?

gregors · on Nov 12, 2013

would you mind pm me @gostermag

donretag · on Nov 12, 2013

Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?

Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.

juxtaposicion · on Nov 12, 2013

Hi, creator here (Chris Moody). Great question. The underlying algorithm, word2vec, (https://code.google.com/p/word2vec/) isn't built for streaming data which means that at the moment it assumes a fixed number of words from the beginning of the calculation. Unfortunately, until the state-of-the-art advances to accepting streaming data, the whole corpus will have to be rescanned to accept dynamic content. Furthermore, word2vec doesn't scale past OpenMP, single node, shared-memory resources. So while I used MapReduce, it's just for cleaning and preprocessing the text, not training the vectors, which is the hard part.

So there's some exciting work to be done in parallelizing and streaming the word2vec algorithm!

logn · on Nov 12, 2013

daft punk - repetitive + lyrics == La Roux

nice work!

axblount · on Nov 12, 2013

I guess we all just need a little more LeAnn Rimes. http://www.thisplusthat.me/search/the%20world%20-%20violence...

est · on Nov 13, 2013

Sounds like this paper from Google

http://www.technologyreview.com/view/519581

For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.

jeorgun · on Nov 13, 2013

Is it just me, or do almost half of my searches return `Dvbc' for no apparent reason?

http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%2...

http://www.thisplusthat.me/search/Chrome%20%2B%20open%20sour...

http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source

toolslive · on Nov 12, 2013

Does this relate to Latent Semantic Indexing? http://en.wikipedia.org/wiki/Latent_semantic_indexing

dhammack · on Nov 12, 2013

Kind of. LSI is a dimensionality-reduction method, which means it takes really sparse "bag of words" vectors and compresses them in a way which approximates the original structure. Word2vec gives each word a non sparse vector based on the hidden units of a neural network language model. The model is trained to predict the middle word from it's surroundings, which makes it learn how words are used in context.

CurtMonash · on Nov 13, 2013

Sounds like another go-around at 1990s (& early 2000s) concept search -- Excite, Northern Light, etc.

And it sounds really close to what I was trying at Elucidate.

dhammack · on Nov 12, 2013

Hey, nice work! Can you explain the "comma delimited list" functionality any more? It seems (awesomely) similar to a hack I did a while back with Word2Vec which would pick out the word which didn't belong in a list.

My hack: https://github.com/dhammack/Word2VecExample

Danieru · on Nov 12, 2013

Fun bug: handheld - sony + nintendo = {Wii, Wii, Snes}

I was hoping for the DS or Gameboy but expecting at least something handheld.

emehrkay · on Nov 12, 2013

I would guess that the "handheld" keyword is big with the Wii because of the new control scheme it introduced which was the main (sometimes only) thing talked about with regard to the system

interpol_p · on Nov 13, 2013

Try: psp - sony + nintendo

Result:

1. Gameboy Advance

2. Nintendo DS

3. Nintendo Gamecube

grishma · on Nov 12, 2013

Interesting. Currently it generates garbage for lot of queries but, some stuff is kinda fun. Forrest Gump - comedy + romance gives pulp fiction (!), as good as it gets (match) and polar express (?) Avatar - action + comedy gives The Office (haha!)

yetanotherphd · on Nov 13, 2013

I know people like to keep things positive, but this is completely useless. Apart from a few cherry picked examples, subtracting words makes no sense most of the time, and there is no clear advantage for their method when it comes to adding words.

jboynyc · on Nov 13, 2013

This is neat, and I found a few queries that added interesting results. However, I tried

    Slavoj Žižek - Jacques Lacan - Hegel

which yielded an internal server error, probably due to the diacritics not being encoded properly.

cocoflunchy · on Nov 12, 2013

Bug report: using some non-ascii characters crashes the server (for example é or É).

juxtaposicion · on Nov 13, 2013

Thanks! I'll have to spend more time sanitizing the input.

zhemao · on Nov 13, 2013

Albert Einstein - Smart = Niels Bohr, Werner Heisenberg, Wolfgang Pauli

Ouch, that's cold

breck · on Nov 12, 2013

Neat stuff juxtaposicion.

Seems like this is how Numenta's AI works: http://www.youtube.com/watch?v=iNMbsvK8Q8Y

akennberg · on Nov 12, 2013

Stanford - American + Canadian = University of Toronto

I think it should be Waterloo.

omaranto · on Nov 13, 2013

Maybe because you thought CS and the algorithm thought some other field, say math.

kshahkshah · on Nov 12, 2013

Works for me:

http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%2...

fredsanford · on Nov 13, 2013

Try this slight refinement...

The results were... Interesting.

http://www.thisplusthat.me/search/Dick%20Cheney%20-evil%20%2...

somberi · on Nov 13, 2013

Fantastic work and is relevant to something we are working on in this space. Thanks.

On a lighter note I tried "sarah palin + sexy" and I got John Mccain, Hillary Clinton and Mitt Romney.

pit · on Nov 12, 2013

Also interesting to try something like:

sleep - sleep

http://www.thisplusthat.me/search/sleep%20-%20sleep

bocanaut · on Nov 13, 2013

http://www.thisplusthat.me/search/Germany%20-%20Fun Germany - Fun = USA

:)

corobo · on Nov 13, 2013

Hey this is pretty cool!

superman - male + female:

  - Lex Luthor (hmm..)
  - Superman's pal Jimmy Olsen (haha, what?)
  - Wonder Woman (That'll do it!)

ppymou · on Nov 12, 2013

Great writeup. Curious, are there clear advantages that the vector representation has over graph models (FB graph search, Google Knowledge graph)?

SergeyHack · on Nov 12, 2013

The default example "justin bieber - man + women" was ok, but I have found a better one - "justin bieber - women + man "

Lucy_karpova · on Nov 13, 2013

What are the use cases for this fancy feature? I'm thinking of e-advisor for fun, but what are the real life serious use cases?

iLoch · on Nov 12, 2013

ThisPlusThat.me - fast + slow...

Just kidding! :)

You could also say...

ThisPlusThat.me - another rant + something cool

Thanks for posting this, very interesting work!

iamchmod · on Nov 12, 2013

I thought this one was good "Stanford - Red + Smart" = Berkeley

elwell · on Nov 12, 2013

Server apparently wasn't ready for HN frontpage load

epaga · on Nov 13, 2013

Pretty impressive for my first try.

iPad - cool -> Windows Phone

dlsym · on Nov 13, 2013

reddit - dumb

Expected: HN, Got: Digg