Building a Content-Based Search Engine IV: Earth Mover's Distance

SteveJS · on April 29, 2018

There is a paper that indicates Earth mover, aka Wasserstein-1 is better for convergence of GANs: https://arxiv.org/pdf/1701.07875.pdf

I tried it out on one of the Udacity Deep learning assignments using the Wasserstein loss functions built into tensorflow. I was unsuccessful in my limited use. The discriminator always ‘won out’ rather than the combo finding a saddle point. I eventually got my project to work without it, and did not go back to compare against just swapping EM back in.

nl · on April 29, 2018

EMD is something that really needs more, better implementations.

The one that everyone uses from Python isn't the easiest thing to install, doesn't have a great API and isn't easy to extend.

I think Gensim recently added it, but I think they use the same backend solver.

Edit: this is a better article on EMD anyway: https://markroxor.github.io/gensim/static/notebooks/WMD_tuto...

Edit 2: I forget Textacy has an implmentation built on Spacy. Still uses the same backend solver, but the API is nice (https://chartbeat-labs.github.io/textacy/api_reference.html#...)

cup-of-tea · on April 29, 2018

It needs the Hungarian algorithm to solve it and it's not the easiest algorithm to implement. In fact, it's by far the hardest algorithm I've implemented (I can't exactly remember why). I wrote it in Common Lisp and worked on the performance quite a bit. It's still an O(n^3) algorithm, though.

amorroxic · on April 30, 2018

Sentence similarity were my explorations with WMD too, reached a setup in Keras with a siamese configuration, Wasserstein + KL loss (have a known vocabulary and feeding both word vector sequences as well as their LDA distributions as input). Post training cosine distance between encodings of such sequences look pretty decent - with one issue I've spotted though: WMD really seems to like about the same number of valid tokens in both sentences which is not how real world looks like - eager to see results of EM distance between image feature vectors, cheers.

rkwasny · on April 30, 2018

The problem is WMD/EMD is that solving it is very slow.

Currently it's unfeasible to implement even simple real time search engine using it.

kk58 · on April 29, 2018

Wmd seems to be the best metric to use for similarity

barbegal · on April 29, 2018

WMD as in word movers distance?