Sense2vec – A Fast and Accurate Method for Word Sense Disambiguation

syllogism · on Nov 23, 2015

Like a lot of work in NLP at the moment, this is a reasonably straight-forward mash-up of existing techniques. This particular idea is pretty obvious. What wasn't obvious was whether it would work well, which was why nobody else got around to trying it yet. The experiments are nicely conducted, with strong baselines across multiple evaluations. The authors also include experiments with variations of the idea, to further validate the approach.

To understand the technique, first understand word2vec:

http://rare-technologies.com/word2vec-tutorial/

http://colah.github.io/posts/2014-07-NLP-RNNs-Representation...

Now understand part-of-speech tagging:

http://spacy.io/blog/part-of-speech-POS-tagger-in-python/

By default word2vec gives you clusters for each word, this paper is giving you clusters for word_POS, e.g. The_DT apple_NNP employee_NN is_VBZ eating_VBG an_DT apple_NN. The same trick is done with named entity labels as well.

The following papers explain how the new word vectors are used in a dependency parser:

Collobert and Weston (2011): http://arxiv.org/pdf/1103.0398.pdf

Wang and Manning (2014): http://cs.stanford.edu/~danqi/papers/emnlp2014.pdf

Yoav Goldberg (2015): http://u.cs.biu.ac.il/~yogo/nnlp.pdf Survey/review, aimed at grad students

benten10 · on Nov 23, 2015

Since this appears to be the 'related works' thread, I'll add my links here.

A reasonable amount of work has already been done in disambiguation of the meanings of words with word embeddings, some of it trained, some untrained.

These particular papers that have recently caught my attention, and have the potential to bring significant gains wrt existing technologies are these:

1. Infinite-Length Word embeddings http://arxiv.org/pdf/1511.05392v2.pdf I am particularly excited about this paper, as the authors suggest a method to create variable-length word vectors, depending on the multiplicity of meanings a word can have (so theoretically, if a word had a _lot_ of meanings, the word vectors would me very large. Reminds one of the word 'Aladeen'). IF the authors happen to me hanging out here, the community is eagerly waiting for the reference implementation. : )

2. Joint Learning of Words and Meaning Representations for Open-Text Semantic Parsing http://jmlr.csail.mit.edu/proceedings/papers/v22/bordes12/bo...

3. Breaking Sticks And Ambiguities With Adaptive Skip-Gram http://arxiv.org/pdf/1502.07257v2.pdf

Edit: added paper 3.

deepGem · on Nov 23, 2015

If you really want to understand word2vec watch CS224D second lecture. https://www.youtube.com/watch?v=T8tQZChniMk

abtinf · on Nov 23, 2015

"This particular idea is pretty obvious. What wasn't obvious was whether it would work well, which was why nobody else got around to trying it yet."

Ideas are often obvious after you hear them for the first time.

bhickey · on Nov 23, 2015

This one is actually kind-of, sort-of obvious. For NaNoGenMo this year I smushed together word2vec and a POS tagger. What the authors have done here is really cool and goes miles beyond my hacks, but the kernel of the idea should be obvious to anyone familiar with word2vec.

Udik · on Nov 23, 2015

How does this help with words, for example substantives, that can have completely different meanings depending on the context? Ex:

The ball is bouncing on the floor

The ball takes place in the hotel

nl · on Nov 23, 2015

Not sure this will help much here, because in both of those cases "ball" is a noun.

However, plain "old" word2vec will handle it fine.

A round ball is associated with words like bounce, and various sporting terms. It will appear in a the vector space alongside those words.

A dance ball is associated with dance, gowns, "go to", etc.

The example in the paper around "Washington" shows how this can be used to distinguish between the NER (named entity resolution) type "PERSON" and "GPE" (Geo-political-entity) - it disambiguates very well between Washington (George) and Washington (DC).

Udik · on Nov 23, 2015

Hmm, but isn't word2vec a dictionary in word -> vector? If it's the case, the two - completely unrelated - meanings of the word "ball" will be represented by a single vector. Which would mean that in that representation, for example, the concept of a ball is near both to "sphere" and "music" or "dancing". So that the perfect ball is a dancing sphere, or something like that.

j_jochem · on Nov 23, 2015

> We demonstrate that these embeddings can disambiguate both contrastive senses such as nominal and verbal senses as well as nuanced senses such as sarcasm.

If this is true, that's astonishing. Also, this might allow us to build assistive technologies for people who are unable to perceive sarcasm.

habitue · on Nov 23, 2015

Supervised disambiguation? Isn't the entire reason word2vec is exciting because it's unsupervised?

ya3r · on Nov 23, 2015

I have seen this many times that people claim the word2vec is unsupervised. But I think that is inaccurate.

word2vec is using a very weak supervision which is the order in which words appear in a meaningful sentence. And I think it is fascinating to use this kind of weak supervision to build distributed embedding for words.

Matumio · on Nov 23, 2015

This makes me wonder if the supervised/unsupervised distinction is useful at all. Maybe you could say instead that all learning algorithms simply need some way to measure similarity between the training examples, and it doesn't make a fundamental difference if you cluster examples by their target label ("supervised") or by their input vectors ("unsupervised") or by the context in which they appear (word2vec).

fnl · on Nov 23, 2015

Skip-grams only take focus word, context word pairs. So those pairs do not take order into account

ya3r · on Nov 24, 2015

being in the same context is some kind of order.

fnl · on Nov 24, 2015

That's still a bit far fetched. "Weakly supervised" refers to using a small amount of labeled data. This is not the case for word2vec and similar embedding methods. As a matter of fact, rather, the method presented here, sense2vec, would qualify, at it indeed is weakly supervised.

ya3r · on Nov 24, 2015

Where we use "small amount of labeled data" is called "semi-supervised".

gbrits · on Nov 23, 2015

> word2vec is using a very weak supervision which is the order in which words appear in a meaningful sentence.

What's supervised about that? Can't this be produced by a large enough corpus?

accurrent · on Nov 23, 2015

Not the only reason though. Its exciting because it turns a word into a reasonably sized vector. After that the vector can be fed into neural networks (nueral networks take vectors for input) or operations can be performed. Like Paris - France ~= Beijing - China.

hellameta · on Nov 23, 2015

Word2Vec is exciting because of its result, not particularly its method. This paper is a pretty straight forward extension of word2vec by combining it's 'unsupervised' result with various labels to make it more useful. You can think of word2vec as a kind of independent representation of language (which is a stretch for the purists but for sake of conversation I think it's okay) that can be applied to various domains by 'supervising' it.

andrewtbham · on Nov 23, 2015

I think this paper is interesting... The section on sarcasm is interesting. They got good results distinguishing apple the noun (which is similar to apples, pear, peach, blueberry) from Apple the proper noun (which is similar to Microsoft, iphone, ipad, samsung.)

While they got good results telling bank the noun from bank the verb... they didn't differentiate bank the noun (financial ) from bank the noun (the side of a river).

Or even more complicated... look at all the uses of bank on word net.

http://wordnetweb.princeton.edu/perl/webwn?s=bank

I'm not sure I can grasp the implications of disambiguating some words, but not others. For some applications it might make sense but... trying to disambiguate words or leave them as single vectors is still, in my mind, an open research question.

williamtrask · on Nov 23, 2015

Great point. In my opinion, for improving the quality of syntactic tasks such as POS / dependency parse, the difference between disambiguating and not disambiguating riverbank and financial bank will be minimal. However, for semantic tasks (perhaps NER, information extraction, question answering) the difference would be more profound. This paper is primarily focused on a much more efficient method to do the former.

jilebedev · on Nov 23, 2015

For a layman's introduction to how (pardon the hyperbole) soul-crushingly difficult this problem is, have a look at this amateur attempt to process language inputted by players into a video game: https://www.youtube.com/watch?v=Ff6V1yFafW4

fnl · on Nov 23, 2015

What always worries me with all WSD approaches is the performance tradeoff: How much more performance is gained from using more complex per-sense word vectors designs vs. "standard" word embeddings. Setup complexity can often increase significantly for these models and training times are much longer, while the gains from these approaches are not terribly clear to date.

williamtrask · on Nov 23, 2015

In this case, there is no performance tradeoff except that of running your core NLP pipeline... for which there are several very fast options

fnl · on Nov 23, 2015

So your results demonstrate that directed dependency labeling works better with vectors learned from PoS-tagged words than with PoS-tagged vectors (learned from untagged words)? And if so, why are you sure you are not overfitting on the corpus or that the "unseen" (in your case: label +) word (pairs) issues will in the end do more harm than what you gain when using this approach on truly independent data/text?

EDIT: Sorry, this question above is probably too convoluted to understand. As I understand, the evaluation of the UAS in the paper was made by letting the parser use the gold PoS labels from the UD treebank (plus either word embeddings). But what would happen if the PoS labels for evaluating the dependency parser came from a PoS tagger, as would be the case when working on unseen data? I might imagine that "plain" embeddings could maybe produce a better UAS in that case, because they are not as "overfitted" as the "enriched" embeddings (as those are derived from the PoS-tagger labeled words in the first place).

williamtrask · on Dec 1, 2015

Re-read this and perhaps identified a mis-understanding. Above you mentioned that "directed dependency labeling works better with vectors learned from PoS-tagged words than with PoS-tagged vectors (learned from untagged words)".

The method does not pos-label vectors or words in the model. Instead, it pos-labels the text that sense2vec is trained on. In this way, you get multiple vectors for words with multiple-POS usages (as if they were different words altogether). The change to the syntactic parser was just based on using POS to select which word embedding (of the several available for each word) to use as input. Sense2vec was trained on predicted POS labels. The parser used gold standard POS tags in both normal word2vec and sense2vec for even comparison.

williamtrask · on Nov 24, 2015

Perhaps, but seeing as POS taggers are ~97% accurate (at least in English), I'd expect this to be minimal. Furthermore, the baseline neural network also has access to the gold standard POS tags, so the comparison of adding POS disambiguated embeddings is pretty clean. It's the difference between "words + pos tags" as features and "pos-disambiguated word + pos tags".

fnl · on Nov 24, 2015

Using accuracy to measure PoS taggers makes results look good, but is ensnaring due to their huge bias: Tagging every word by the majority tag found during training and everything else as either NNP or NNPS (with suffix -s) means the statistical baseline is already well beyond 90% accuracy. However, my point was that from the results shown its not clear to me if the gains in attachment score you saw when using Gold Standard PoS tags would be lost in a "real-world" usage when you have to rely on the tagger's own PoS tags. In such a case, it could be that your embeddings contribute much less "new knowledge" than what you see in your results, using independent (Gold) PoS tags. This might be mitigated by using two independently trained and set up PoS taggers, however. But this finally gets us back to my initial concern: How much performance gain really is in there from all this added complexity and is that "worth the effort"?

williamtrask · on Nov 24, 2015

Generally, the industry benchmarks dependency parsing using gold standard POS tags. However, your point is well taken. Personally, I have little doubt that it would still yield the same level of improvement, but fortunately a bit of experimentation can settle it for sure :)

Perhaps also relevant to this conversation, the disambiguation for pre-training did in fact use "real-world" tags (not gold standard). Thus, sense2vec as an algorithm was able to sort through the noise generated by mistakes in the part-of-speech tagger to still generate meaningful embeddings.

n0us · on Nov 23, 2015

Anyone know of an open source implementation? I've only had a chance to scan the document but it appears to only go into the theory.

syllogism · on Nov 23, 2015

We'll have this implemented in spaCy before too long. It's actually super easy to do --- all you need is to merge the part of speech tags or entity labels onto the tokens before feeding the text to Gensim or another word2vec implementation.

I've wanted to do this for a while, so it's nice to see that it works well.

danieldk · on Nov 23, 2015

Indeed, the implementation is very simple. We had considered this idea some times as well, but like you didn't get around to implement/evaluate it yet. So, it's good to hear that it works.

One thing notably absent from the paper is a discussion of the trade-off between augmenting tokens with annotations in this way for sense disambiguation vs. data sparseness. Their approach may make the embeddings for frequent senses better, but the difficulty in WSD is typically in low-frequency senses. I think that particularly in disambiguation using part-of-speech tags, there is still a high semantic relatedness between senses, especially in languages with frequent nominalization or verbalization.

I can imagine that an model that predicts a target (or context) in decomposed form (token and label) might improve embeddings for low-frequent senses.

syllogism · on Nov 23, 2015

This isn't really WSD though, or at least, only very weakly.

Rare words are usually pretty unambiguous for part-of-speech. I would guess this mostly has an effect on the top 5,000 items of the vocabulary, and most of the rest of the lexicon only has a single "sense".

danieldk · on Nov 23, 2015

This isn't really WSD though, or at least, only very weakly.

Sure. I was pointing to real WSD where sparseness becomes even a stronger problem than when your definition of sense is restricted to part-of-speech tag or sentiment.

Rare words are usually pretty unambiguous for part-of-speech.

I was talking about (possibly) frequent words where some part-of-speech are infrequent, not about rare words. To take five more or less random examples form the Brown corpus (yes, we train on large corpora, but I think similar distributions could hold for less frequent forms in languages with e.g. frequent nomalization, not everyone speaks English!):

   mother NN 173 VB 1
   code NN 20 VB 1
   hanging NN 1 VBG 20
   level JJ 14 NN 172 VB 2
   services NNS 115 VBZ 1

If your learning method is as coarse-grained as simply throwing the token plus part-of-speech into word2vec or wang2vec, some will be below the frequency cut-offs (or will be to sparse to learn good embeddings), while other 'senses' may in reality be semantically similar.

syllogism · on Nov 23, 2015

Thanks for the explanation. I see what you're saying now.

benten10 · on Nov 23, 2015

This is really excellent for researchers approaching the field from the more NLP angle than the ML/CompLing angle. It was only a few weeks(days?) ago that Chris Manning came up with the whitepaper arguing that linguists have just as much/even more relevance now and shouldn't worry that neural networks will eat their lunch, and then we get to see this. This is awesome! : )

YeGoblynQueenne · on Nov 24, 2015

Have you a link to that whitepaper? I found an uncorrected draft, but I'd like to read the finished thing- cheers.

williamtrask · on Nov 25, 2015

i think this one: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00239

YeGoblynQueenne · on Nov 25, 2015

That's the draft one but thank you anyway :)

striking · on Nov 23, 2015

It appears they're only releasing the theory, that the implementation is up to us.

Which, given the time they've put into this paper, is not an unreasonable thing.

Edit: just noticed their email addresses are of the http://www.digitalreasoning.com/ domain. Seems they're not in the business of releasing open source versions, so to speak.

OS versions of similar tech exist, but this is new.

williamtrask · on Nov 23, 2015

working on it. we'll see.