The cognoscenti already know this, but word vectors are a game changer in the multi-class text classification game. Finding the correct representation of text makes the classification task so much easier.
For my problem (1000+ total classes, 1 class per input), I experimented with Naive Bayes + TFIDF (~50% accuracy, < 1 sec training), then Word2vec + CNN model on GPU (~70% accuracy, 6 hrs training), and finally FastText (99% accuracy, 10 minutes training).
FastText [0] in particular is quite impressive. It is essentially a variant of Word2Vec that also supports n-grams, and there is a default implementation in C++ with a built-in classifier that runs on the command-line (no need to setup Tensorflow or anything like that).
Despite it only running on plain CPUs and only supporting a linear classifier, it seems to beat GPU-trained Word2Vec CNN models in both accuracy and speed in my use cases. I later discovered this paper from the authors comparing CNNs (and other algorithms) to FastText, and their results track my experiences [1].
This goes to show that while GPU-accelerated models are cool, sometimes using a simpler, more suitable model can have a significantly better pay-off.
Thanks for this, I'll have to take a look at FastText. I've been using word2vec before turning it into a matrix and running it through a CNN. I based it off of Yoon Kim's[0] work. I haven't had much luck though on my 92-class problem. Maybe Fast Text will work better, although I think there are a lot of improvements my model can have still.
I'm working on a problem kinda similar to that with a binary classification on a class imbalanced dataset, but fastText and Bidirectional LSTMs appear to work pretty terribly even with oversampling. Is there a better alternative?
Class imbalance is a tricky problem, but it is unrelated to FastText.
There is no silver bullet. The best solution is to collect more data to bring the classes to a balance. The second best approach is to try algorithms like SMOTE.
For the CNN model I was using Magpie [0] which takes a sum of the vectors (so yes, equivalent to unweighted average) to represent the document.
For the FastText model, it would be whatever FastText is doing under the hood. I haven't peeked at the code to see what it is doing, but Section 2 of the paper I cited above seems to imply that an average is taken.
Interesting. I had heard specifically that word vectors weren't a game-changer for document classification, because the averaging method didn't work well. But the difference between word2vec and FastText must be important. I think your baseline of Naive Bayes + TF-IDF would do a lot better if it was SVM and TF-IDF, but I didn't expect the big jump between word2vec and FastText.
> I had heard specifically that word vectors weren't a game-changer for document classification, because the averaging method didn't work well.
As with anything, your mileage may vary.
One aspect of FastText that definitely helped in my case was n-gram support (both word and character, tunable via command-line arguments). In my corpus, I have short fragments of sentences containing misspelled words, incorrect grammar etc. plus my test set has out-of-vocabulary words.
n-grams are more robust to these than Word2Vec which uses a static vocabulary.
I think it's important to use the right "average" operation. word2vec puts words on the surface of a hypersphere, so "adding" vectors means turning around the sphere.
Combining word embeddings to get a single representation of a sentence in general is a pretty hard problem. Point-wise averaging, addition or summation [0] works alright, but it's not great. Some more recent work looks a using RNNs to combine them; most of the literature I've seen is pretty application-specific, but try [1][2] for flavour.
Often the w2v embedding layer will be the first layer of a network, through which a document of word representations will be passed. The output of the embedding layer will be a 2 dimensional tensor with the embedding dimension in one direction and the number of input words in the other. The next step is often to apply zero or more convolutional units over the word direction before some kind of pooling to get a 1 dimensional output. The output of the pooling layer will be a document representation based upon the word vectors.
Thank you for this very helpful information, and I have just one more question:
"Often the w2v embedding layer will be the first layer of a network, through which a document of word representations will be passed. The output of the embedding layer will be a 2 dimensional tensor with the embedding dimension in one direction and the number of input words in the other. The next step is often to apply zero or more convolutional units over the word direction before some kind of pooling to get a 1 dimensional output. The output of the pooling layer will be a document representation based upon the word vectors.
Other approaches like doc2vec are also used."
I thought the approach you described was doc2vec. If not, then does it have a name/citation?
Doc2vec is the name of the gensim implementation of this paper [0]. Briefly, it creates document embeddings (document being anything on the level of phrases, to sentences, to paragraphs and beyond) on the basis of either predicting a word given the context words (distributed memory -- DM), or predicting the context words given a word (distributed bag of words -- DBOW).
In the "Doing bad digital humanities with color vectors", if you consider colors as 3D vectors, which they do, you'll see that summing enough uniformly sampled vectors always gives medium browns, because that's the color in the middle of the colorspace. Instead, you should model colors in a polar space and sum vectors in that space. This will prevent going inside the sphere and losing color saturation.
The original model comes from Kim 2014, https://arxiv.org/abs/1408.5882 It's a very neat use of CNNs for language processing, instead of the more popular RNNs/LSTMs. CNNs have the advantage of training much faster.
Yes, I've not looked at fasttext but word2vec is a simple 1 hidden-layer network to learn word embeddings, which can then be used as pre-trained word embeddings in other tasks.
CNNs and RNNs would be used in the next stage of the pipeline for whatever your task is (machine translation etc), probably as some way of combining the word vectors. RNNs are especially nice as they can deal with variable length sentences. Note also that it's possible for systems to learn their own systems as part of training, rather than using pre-trained ones from word2vec etc.
Word vectors are at the same time amazing because they contain a huge amount of latent information, and not good enough because they collapse a space of very high dimensionality in ~300 dimensions, so they have a limit to how much they can discriminate between close topics. I have done a lot of experiments on classifying text in thousands of topics, and sometimes they work amazingly well, other times they are really hard to use, depending on how close together are the topics I want to discriminate between.
Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
A third problem of word vectors is out of vocabulary words and words with low frequency. For OOV, the usual solution is to create character or character-ngram embeddings that can be used to compute embeddings for new words. For low frequency words we usually ignore them (apply a cutoff).
Then there is the problem of phrases and collocations - some words go together, such as "New York" and "give up". The meaning of the phrase is different from the sum of the meanings of the component words. In these cases we need to have lists of phrases and replace them in the original text before training the vectors, so we have proper vectors for phrases.
By the way, one amazing tool that goes with word vectors is the library 'annoy' which can do similarity search in log time. So you can do approx 1000 lookups per second per CPU even if the database contains millions of vectors, pretty good. Annoy can be used to find similar articles, or music recommendations. Another remark - my preferred word vectors are computed with Doc2VecC (a variant of doc2vec with corruption). Doc2VecC seems more apt to discriminate between topics, but the secret is to feed it gigabytes of text.
Playing with word vectors has taught me intuitively how it is to navigate a space of high dimensionality. It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.
>Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
There has been work on representing words not as vectors but as multimodal Gaussian distributions in order to try to deal with polysemy, such as [0], of which an implementation (which I have not tried to use) is available on GitHub at [1].
>It feels different than 3d-space because each point has a shortcut to other points, each point leads to hundreds of other places which might be far apart. It's like a kaleidoscope where a small change can create a very different perspective.
I appreciate that the author of the parent comment has found some sort of intuition, but I would caution others from trying to use the above quote in order to develop their own intuition as it is meaningless in any rigorous sense.
> Another problem of word vectors is that any word might actually have multiple senses, while vectors are just point estimates. If we wanted to be correct, we needed first to find the right sense for each word in a phrase and only then assign the vector. There is research in "on-the-fly" word vectors that adapt to context, but it's much harder to use.
This is interesting, and there seems to be a bit of debate about it (at least with compositional distributional semantic models). [0][1] seem to show that sense disambiguation helps in some contexts, and [2] show that they don't in others. It doesn't seem immediately clear who is right here. I agree with you though that it seems pretty likely that disambiguating would be helpful.
It seems obvious to me that disambiguating _correctly_ would necessarily improve performance. As an extreme example, homonyms like bark (the verb) and bark (on a tree) have nothing to do with each other, and ideally would be considered two different words.
This seems like it should obviously make sense, but in practice it doesn’t necessarily. Usually the sense in defined by the context, and so the vector can carry the meaning of multiple contexts at once.
There are some specific tasks where this breaks for some words. I can’t remember one I’ve seen right now, but thinking about the “bark” example, it can cause a problem when both the “woodyness” and “noisyness” of a word cause completely different results. In practice that’s pretty rare.
Generally though, the non-relevant meanings don’t have any negative effects as they can be ignored.
> we needed first to find the right sense for each word in a phrase and only then assign the vector
Interesting! I wonder if you could you e.g. arbitrarily split a word into some number of symbols, e.g. two, and each time you're going to apply a training update, only apply the update to one of the symbols -- perhaps initially choosing the symbol facing the greatest loss (forcing the symbols apart in vector space), and then eventually switching over to picking the symbol with the smallest loss (letting each settle onto its own precise meaning)?
We just open sourced a easy to use library called Magnitude that handles out-of-vocabulary words and uses Annoy indexing for fast most_similar queries for word2vec, GloVE, and fastText:
My mind was blown when I found out how easy it was to get started using pre-trained Glove embedding in Keras. Took my Kaggle game up a few notches over-night.
This is a highly localized problem, but I've wanted to read the past couple articles by Allison and cannot, because my org has to block gists.
Allison we know this is a lot of work for a very small group but, if you see this, a couple of us here would be super stoked if you could mirror your articles somewhere else as well!
Yeah, but we're part of a very small subset of places that genuinely need this level of security. I used to do counter-threat infosec here before I moved to data science, so I know how much work goes into our security.
Regular Github is free and clear, so we're good there.
You might be interested to know that Annoy is also integrated into Gensim, which allows you to train, use and query word embeddings on your own data. Gensim implements fast Doc2Vec and FastText too, which are a bit newer embedding techniques [0] :-)
Interesting effort :-) Unfortunately, your comparison table there is somewhere between misleading and insulting.
Almost all of the "unique" features listed there are in fact a standard part of Gensim. Fast approximative queries (using Annoy), memory mapping with lazy loading, ngrams features, format convertors, Python interface, parallelization, pre-trained models for download…
There is a way to promote cool new libraries, but this ain't it.
Shoot me an e-mail (link in HN profile), we just created this a few days ago, so it hasn't been up long, and happy to fix any disagreements in the benchmarks. You're absolutely right about Annoy Indexing, but to be fair, I don't think it was part of Gensim when I started using it :). Gensim's a great library and Magnitude's not meant to be an attack on it (in fact we use it for our own converter) and we provide zero-training which Gensim does handle.
I'll remove the comparison to Gensim :). It really wasn't meant to be an attack on Gensim. I think it's a good and versatile library that handles a lot, but the aim of Magnitude was to be what Keras is to TensorFlow, a simpler interface.
For the record, the claim was "Pythonic interface" not "Python" interface because we support some Pythonic syntactic sugar like "cat in vectors" with the "__contains__" method and "for key, vector in vectors" with the "__iter__" method. It wasn't meant to be in bad faith, but I could see how that claim could be misinterpreted, so I will remove it.
The interface is very similar to Gensim's, but Gensim is after all open source, and we made it very similar on purpose so it could be easily swapped out in our own internal codebase :).
Like I said, I think Gensim's a great library! Thanks making us aware of your concerns, I also sent you an e-mail. I'll update the repository later today to remove the comparison.
I appreciate the offer, but I'm positive you can spot the issues yourself.
Gensim has many warts for sure, but a benchmark that gives a green tick to itself in "Simple Pythonic interface" and a red check mark to Gensim, while copying the Gensim interface almost verbatim, was not created in good faith.
Dimension reduction technique. For word vectors, the usual use case is to take the 400 dimension vector and turn it into a 2 dimension vector that you can use to plot like a scatterplot. Similar things will be close together. Dissimilar things will be far apart. It's kind of like principal components analysis but quirkier.
For those who like Tsne. Check out the relatively new UMAP, which seems to be faster and better.
For anyone who had to google it like me, T-SNE is T-Distributed Stochastic Neighbor Embedding, and the way I understand it is that it's essentially a statistical way to reduce data dimensionality while still preserving its structure.
For my problem (1000+ total classes, 1 class per input), I experimented with Naive Bayes + TFIDF (~50% accuracy, < 1 sec training), then Word2vec + CNN model on GPU (~70% accuracy, 6 hrs training), and finally FastText (99% accuracy, 10 minutes training).
FastText [0] in particular is quite impressive. It is essentially a variant of Word2Vec that also supports n-grams, and there is a default implementation in C++ with a built-in classifier that runs on the command-line (no need to setup Tensorflow or anything like that).
Despite it only running on plain CPUs and only supporting a linear classifier, it seems to beat GPU-trained Word2Vec CNN models in both accuracy and speed in my use cases. I later discovered this paper from the authors comparing CNNs (and other algorithms) to FastText, and their results track my experiences [1].
This goes to show that while GPU-accelerated models are cool, sometimes using a simpler, more suitable model can have a significantly better pay-off.
[0] https://fasttext.cc/docs/en/supervised-tutorial.html
[1] https://arxiv.org/abs/1607.01759