Understanding word vectors

wenc · on March 2, 2018

The cognoscenti already know this, but word vectors are a game changer in the multi-class text classification game. Finding the correct representation of text makes the classification task so much easier.

For my problem (1000+ total classes, 1 class per input), I experimented with Naive Bayes + TFIDF (~50% accuracy, < 1 sec training), then Word2vec + CNN model on GPU (~70% accuracy, 6 hrs training), and finally FastText (99% accuracy, 10 minutes training).

FastText [0] in particular is quite impressive. It is essentially a variant of Word2Vec that also supports n-grams, and there is a default implementation in C++ with a built-in classifier that runs on the command-line (no need to setup Tensorflow or anything like that).

Despite it only running on plain CPUs and only supporting a linear classifier, it seems to beat GPU-trained Word2Vec CNN models in both accuracy and speed in my use cases. I later discovered this paper from the authors comparing CNNs (and other algorithms) to FastText, and their results track my experiences [1].

This goes to show that while GPU-accelerated models are cool, sometimes using a simpler, more suitable model can have a significantly better pay-off.

[0] https://fasttext.cc/docs/en/supervised-tutorial.html

[1] https://arxiv.org/abs/1607.01759

bomb199 · on March 2, 2018

Thanks for this, I'll have to take a look at FastText. I've been using word2vec before turning it into a matrix and running it through a CNN. I based it off of Yoon Kim's[0] work. I haven't had much luck though on my 92-class problem. Maybe Fast Text will work better, although I think there are a lot of improvements my model can have still.

[0] https://arxiv.org/abs/1408.5882

akhilcacharya · on March 3, 2018

I'm working on a problem kinda similar to that with a binary classification on a class imbalanced dataset, but fastText and Bidirectional LSTMs appear to work pretty terribly even with oversampling. Is there a better alternative?

wenc · on March 3, 2018

Class imbalance is a tricky problem, but it is unrelated to FastText.

There is no silver bullet. The best solution is to collect more data to bring the classes to a balance. The second best approach is to try algorithms like SMOTE.

kk58 · on March 3, 2018

SMOTE is for more numerical data.

wenc · on March 3, 2018

It is a numerical problem underneath. Word vectors are numerical vectors.

aisofteng · on March 3, 2018

>FastText [0] in particular is quite impressive. It is essentially a variant of Word2Vec that also supports n-grams,

To clarify: fasttext does not support n-grams of words, but instead considers n-grams of characters within words.

wenc · on March 4, 2018

It supports word n-grams as well as character n-grams https://fasttext.cc/docs/en/options.html

See options -minn, -maxn and -wordNgrams.

ppod · on March 2, 2018

If the word embedding is, say, 400 dimensions, are you averaging the vectors of all the words in the document to get a document vector?

wenc · on March 2, 2018

For the CNN model I was using Magpie [0] which takes a sum of the vectors (so yes, equivalent to unweighted average) to represent the document.

For the FastText model, it would be whatever FastText is doing under the hood. I haven't peeked at the code to see what it is doing, but Section 2 of the paper I cited above seems to imply that an average is taken.

[0] https://github.com/inspirehep/magpie

useracd · on March 2, 2018

You're correct! The keras example implementation of FastText uses a global average pooling is used to reduce the dimensionality [0].

[0] https://github.com/keras-team/keras/blob/master/examples/imd...

ppod · on March 2, 2018

Interesting. I had heard specifically that word vectors weren't a game-changer for document classification, because the averaging method didn't work well. But the difference between word2vec and FastText must be important. I think your baseline of Naive Bayes + TF-IDF would do a lot better if it was SVM and TF-IDF, but I didn't expect the big jump between word2vec and FastText.

wenc · on March 2, 2018

> I had heard specifically that word vectors weren't a game-changer for document classification, because the averaging method didn't work well.

As with anything, your mileage may vary.

One aspect of FastText that definitely helped in my case was n-gram support (both word and character, tunable via command-line arguments). In my corpus, I have short fragments of sentences containing misspelled words, incorrect grammar etc. plus my test set has out-of-vocabulary words.

n-grams are more robust to these than Word2Vec which uses a static vocabulary.

swsieber · on March 3, 2018

I thanked your top level comment, but this whole thread is great. I fond my self worth data that sounds like yours. I'm excited to try fast text.

sp332 · on March 2, 2018

I think it's important to use the right "average" operation. word2vec puts words on the surface of a hypersphere, so "adding" vectors means turning around the sphere.

isoprophlex · on March 2, 2018

Just piping in to say that I'm very curious as well! How do you go from word vectors to a description of the document?

aglionby · on March 2, 2018

Combining word embeddings to get a single representation of a sentence in general is a pretty hard problem. Point-wise averaging, addition or summation [0] works alright, but it's not great. Some more recent work looks a using RNNs to combine them; most of the literature I've seen is pretty application-specific, but try [1][2] for flavour.

[0] http://www.aclweb.org/anthology/P/P08/P08-1028

[1] http://arxiv.org/abs/1409.3215

[2] http://arxiv.org/abs/1409.0473

homarp · on March 3, 2018

[0] link should be http://www.aclweb.org/anthology/P/P08/P08-1028.pdf Jeff Mitchell; Mirella Lapata - Vector-based Models of Semantic Composition

useracd · on March 2, 2018

Often the w2v embedding layer will be the first layer of a network, through which a document of word representations will be passed. The output of the embedding layer will be a 2 dimensional tensor with the embedding dimension in one direction and the number of input words in the other. The next step is often to apply zero or more convolutional units over the word direction before some kind of pooling to get a 1 dimensional output. The output of the pooling layer will be a document representation based upon the word vectors.

Other approaches like doc2vec are also used.

ppod · on March 2, 2018

Thank you for this very helpful information, and I have just one more question:

"Often the w2v embedding layer will be the first layer of a network, through which a document of word representations will be passed. The output of the embedding layer will be a 2 dimensional tensor with the embedding dimension in one direction and the number of input words in the other. The next step is often to apply zero or more convolutional units over the word direction before some kind of pooling to get a 1 dimensional output. The output of the pooling layer will be a document representation based upon the word vectors. Other approaches like doc2vec are also used."

I thought the approach you described was doc2vec. If not, then does it have a name/citation?

aglionby · on March 2, 2018

Doc2vec is the name of the gensim implementation of this paper [0]. Briefly, it creates document embeddings (document being anything on the level of phrases, to sentences, to paragraphs and beyond) on the basis of either predicting a word given the context words (distributed memory -- DM), or predicting the context words given a word (distributed bag of words -- DBOW).

[0] https://arxiv.org/abs/1405.4053

isoprophlex · on March 2, 2018

So in simple terms, you concatenate the word vectors and feed them to further network layers to arrive at a reduced-D representation?

Either way... Very nice, thanks!

IanCal · on March 2, 2018

The simplest thing, and one that apparently works fairly well with fasttext vectors, is just to average them.

isoprophlex · on March 2, 2018

I'm surprised (and delighted) you say that such a straightforward approach works well, thanks.

ovi256 · on March 2, 2018

A few remarks:

In the "Doing bad digital humanities with color vectors", if you consider colors as 3D vectors, which they do, you'll see that summing enough uniformly sampled vectors always gives medium browns, because that's the color in the middle of the colorspace. Instead, you should model colors in a polar space and sum vectors in that space. This will prevent going inside the sphere and losing color saturation.

It's explained quite well here in the Interpolation section: http://www.inference.vc/high-dimensional-gaussian-distributi...

If you want to understand contemporary use of words embeddings in ML, a nice simple model is explained, with full code, here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a...

The original model comes from Kim 2014, https://arxiv.org/abs/1408.5882 It's a very neat use of CNNs for language processing, instead of the more popular RNNs/LSTMs. CNNs have the advantage of training much faster.

Ethcad · on March 2, 2018

Sorry if this is a dumb question, but wouldn’t it be a medium gray in RGB? Or are you talking about HSV?

Maybestring · on March 2, 2018

HSV is a polar model. (Holding constant Value)

wodenokoto · on March 2, 2018

Are rnn that common for wordvectors?

I mostly see word2vec and fasttext, neither of which are CNN nor rnn

aglionby · on March 2, 2018

Yes, I've not looked at fasttext but word2vec is a simple 1 hidden-layer network to learn word embeddings, which can then be used as pre-trained word embeddings in other tasks.

CNNs and RNNs would be used in the next stage of the pipeline for whatever your task is (machine translation etc), probably as some way of combining the word vectors. RNNs are especially nice as they can deal with variable length sentences. Note also that it's possible for systems to learn their own systems as part of training, rather than using pre-trained ones from word2vec etc.

wodenokoto · on March 3, 2018

I meant for generating wordvectors.

Word2vec and fasttext use nn that are so shallow that the can be described as types of linear regression.

visarga · on March 2, 2018

Word vectors are at the same time amazing because they contain a huge amount of latent information, and not good enough because they collapse a space of very high dimensionality in ~300 dimensions, so they have a limit to how much they can discriminate between close topics. I have done a lot of experiments on classifying text in thousands of topics, and sometimes they work amazingly well, other times they are really hard to use, depending on how close together are the topics I want to discriminate between.

Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.

A third problem of word vectors is out of vocabulary words and words with low frequency. For OOV, the usual solution is to create character or character-ngram embeddings that can be used to compute embeddings for new words. For low frequency words we usually ignore them (apply a cutoff).

Then there is the problem of phrases and collocations - some words go together, such as "New York" and "give up". The meaning of the phrase is different from the sum of the meanings of the component words. In these cases we need to have lists of phrases and replace them in the original text before training the vectors, so we have proper vectors for phrases.

By the way, one amazing tool that goes with word vectors is the library 'annoy' which can do similarity search in log time. So you can do approx 1000 lookups per second per CPU even if the database contains millions of vectors, pretty good. Annoy can be used to find similar articles, or music recommendations. Another remark - my preferred word vectors are computed with Doc2VecC (a variant of doc2vec with corruption). Doc2VecC seems more apt to discriminate between topics, but the secret is to feed it gigabytes of text.

Playing with word vectors has taught me intuitively how it is to navigate a space of high dimensionality. It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.

doc2vecC: https://github.com/mchen24/iclr2017

annoy: https://github.com/spotify/annoy

aisofteng · on March 3, 2018

>Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.

There has been work on representing words not as vectors but as multimodal Gaussian distributions in order to try to deal with polysemy, such as [0], of which an implementation (which I have not tried to use) is available on GitHub at [1].

>It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.

I appreciate that the author of the parent comment has found some sort of intuition, but I would caution others from trying to use the above quote in order to develop their own intuition as it is meaningless in any rigorous sense.

[0] https://arxiv.org/abs/1704.08424

[1] https://github.com/benathi/word2gm

aglionby · on March 2, 2018

> Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.

This is interesting, and there seems to be a bit of debate about it (at least with compositional distributional semantic models). [0][1] seem to show that sense disambiguation helps in some contexts, and [2] show that they don't in others. It doesn't seem immediately clear who is right here. I agree with you though that it seems pretty likely that disambiguating would be helpful.

[0] https://www.aclweb.org/anthology/W13-3513

[1] https://www.aclweb.org/anthology/P16-1018

[2] https://www.aclweb.org/anthology/D10-1115

kevinwang · on March 2, 2018

It seems obvious to me that disambiguating _correctly_ would necessarily improve performance. As an extreme example, homonyms like bark (the verb) and bark (on a tree) have nothing to do with each other, and ideally would be considered two different words.

taeric · on March 3, 2018

I imagine contranyms would be among the most difficult here. Would be fun to see how these vectors stack up with those.

kevinwang · on March 3, 2018

Well, antonyms are usually right next to each other in word embedding spaces anyways. (since the context they're used in is often hard to tell apart)

taeric · on March 3, 2018

Contranyms would be identical in the embedding space, no?

I'm actually intrigued that nobody has made a text processor named after https://en.wikipedia.org/wiki/Amelia_Bedelia yet. :)

aglionby · on March 2, 2018

Yeah I would agree. I've struggled to find much work that looks at this at a large scale though!

nicklovescode · on March 2, 2018

AllenNLP’s recent work ELMo is very good at this task. Check it out!

aglionby · on March 2, 2018

Much appreciated! Will have a look.

nl · on March 3, 2018

Actually...

This seems like it should obviously make sense, but in practice it doesn’t necessarily. Usually the sense in defined by the context, and so the vector can carry the meaning of multiple contexts at once.

There are some specific tasks where this breaks for some words. I can’t remember one I’ve seen right now, but thinking about the “bark” example, it can cause a problem when both the “woodyness” and “noisyness” of a word cause completely different results. In practice that’s pretty rare.

Generally though, the non-relevant meanings don’t have any negative effects as they can be ignored.

RoboTeddy · on March 2, 2018

> we needed first to find the right sense for each word in a phrase and only then assign the vector

Interesting! I wonder if you could you e.g. arbitrarily split a word into some number of symbols, e.g. two, and each time you're going to apply a training update, only apply the update to one of the symbols -- perhaps initially choosing the symbol facing the greatest loss (forcing the symbols apart in vector space), and then eventually switching over to picking the symbol with the smallest loss (letting each settle onto its own precise meaning)?

hiker512 · on March 2, 2018

Can you recommend an implementation or paper for handling OOV words via character level embeddings?

patelajay285 · on March 2, 2018

We just open sourced a easy to use library called Magnitude that handles out-of-vocabulary words and uses Annoy indexing for fast most_similar queries for word2vec, GloVE, and fastText:

https://github.com/plasticityai/magnitude

Sinidir · on March 6, 2018

Thank you very much. This might help with my Master Thesis :)

visarga · on March 2, 2018

Great work!

fnl · on March 3, 2018

I agree - and also to your initial comments on the actual benefits & limitations of word vectors, that I can 100% subscribe to.

wenc · on March 2, 2018

Just an FYI, FastText's default implementation handles OOV words via word n-grams and character n-grams. (see switches -minn, -maxn, and -wordNgram)

https://fasttext.cc/docs/en/options.html

stared · on March 2, 2018

For an technical intro for word2vec (what's that exactly, how to train it): https://lilianweng.github.io/lil-log/2017/10/15/learning-wor... and its readable implementation in PyTorch: https://adoni.github.io/2017/11/08/word2vec-pytorch/.

And just for getting idea why does it work, and play with examples in your browser: http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html

sumitgt · on March 2, 2018

My mind was blown when I found out how easy it was to get started using pre-trained Glove embedding in Keras. Took my Kaggle game up a few notches over-night.

https://blog.keras.io/using-pre-trained-word-embeddings-in-a...

debt · on March 2, 2018

"Poetry is, at its core, the art of identifying and manipulating linguistic similarity."

ah shit this makes me want to stand on a desk.

ZeroCool2u · on March 2, 2018

This is a highly localized problem, but I've wanted to read the past couple articles by Allison and cannot, because my org has to block gists.

Allison we know this is a lot of work for a very small group but, if you see this, a couple of us here would be super stoked if you could mirror your articles somewhere else as well!

wyldfire · on March 2, 2018

Boy, that's a shame for them to do that. But I've been in similar places before. Do they also block reg'lar github.com?

ZeroCool2u · on March 5, 2018

Yeah, but we're part of a very small subset of places that genuinely need this level of security. I used to do counter-threat infosec here before I moved to data science, so I know how much work goes into our security.

Regular Github is free and clear, so we're good there.

titanomachy · on March 2, 2018

They should block stack overflow as well to make absolutely certain nobody can do any work.

What's it like being a software dev in North Korea?

ZeroCool2u · on March 5, 2018

The cabbage is subpar.

radarsat1 · on March 2, 2018

Gem in here is that I learned about Annoy. https://pypi.python.org/pypi/annoy

Radim · on March 2, 2018

You might be interested to know that Annoy is also integrated into Gensim, which allows you to train, use and query word embeddings on your own data. Gensim implements fast Doc2Vec and FastText too, which are a bit newer embedding techniques [0] :-)

[0] https://twitter.com/gensim_py/status/969222857246101504

patelajay285 · on March 2, 2018

We've recently created an open source library that can help you get started with pre-trained word vectors and Annoy quickly:

https://github.com/plasticityai/magnitude

Radim · on March 2, 2018

Interesting effort :-) Unfortunately, your comparison table there is somewhere between misleading and insulting.

Almost all of the "unique" features listed there are in fact a standard part of Gensim. Fast approximative queries (using Annoy), memory mapping with lazy loading, ngrams features, format convertors, Python interface, parallelization, pre-trained models for download…

There is a way to promote cool new libraries, but this ain't it.

patelajay285 · on March 2, 2018

Hey Radim,

Shoot me an e-mail (link in HN profile), we just created this a few days ago, so it hasn't been up long, and happy to fix any disagreements in the benchmarks. You're absolutely right about Annoy Indexing, but to be fair, I don't think it was part of Gensim when I started using it :). Gensim's a great library and Magnitude's not meant to be an attack on it (in fact we use it for our own converter) and we provide zero-training which Gensim does handle.

patelajay285 · on March 2, 2018

I'll remove the comparison to Gensim :). It really wasn't meant to be an attack on Gensim. I think it's a good and versatile library that handles a lot, but the aim of Magnitude was to be what Keras is to TensorFlow, a simpler interface.

For the record, the claim was "Pythonic interface" not "Python" interface because we support some Pythonic syntactic sugar like "cat in vectors" with the "__contains__" method and "for key, vector in vectors" with the "__iter__" method. It wasn't meant to be in bad faith, but I could see how that claim could be misinterpreted, so I will remove it.

The interface is very similar to Gensim's, but Gensim is after all open source, and we made it very similar on purpose so it could be easily swapped out in our own internal codebase :).

Like I said, I think Gensim's a great library! Thanks making us aware of your concerns, I also sent you an e-mail. I'll update the repository later today to remove the comparison.

Radim · on March 2, 2018

I appreciate the offer, but I'm positive you can spot the issues yourself.

Gensim has many warts for sure, but a benchmark that gives a green tick to itself in "Simple Pythonic interface" and a red check mark to Gensim, while copying the Gensim interface almost verbatim, was not created in good faith.

melzarei · on March 2, 2018

For anyone interested in more explanation, Stanford CS224N course has 2 great lectures on them.

hoerzu · on March 2, 2018

Where is T-SNE? ;)

b_tterc_p · on March 2, 2018

Dimension reduction technique. For word vectors, the usual use case is to take the 400 dimension vector and turn it into a 2 dimension vector that you can use to plot like a scatterplot. Similar things will be close together. Dissimilar things will be far apart. It's kind of like principal components analysis but quirkier.

For those who like Tsne. Check out the relatively new UMAP, which seems to be faster and better.

jxub · on March 2, 2018

For anyone who had to google it like me, T-SNE is T-Distributed Stochastic Neighbor Embedding, and the way I understand it is that it's essentially a statistical way to reduce data dimensionality while still preserving its structure.

juanmirocks · on March 2, 2018

Love the practicality of this simple practical tutorial. The only pity is that it's in Python 2

cup-of-tea · on March 3, 2018

Yeah, I really don't get why they would use Python 2 for this kind of thing.