TL;DR: Hugging Face, the NLP research company known for its transformers library (DISCLAIMER: I work at Hugging Face), has just released a new open-source library for ultra-fast & versatile tokenization for NLP neural net models (i.e. converting strings in model input tensors).
Main features:
- Encode 1GB in 20sec
- Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece...
- Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...)
- Written in Rust with bindings for Python and node.js
I love the work done and made freely available by both spaCy and HuggingFace.
I had my own NLP libraries for about 20 years, simple ones were examples in my books, and more complex and not so understandable ones I sold as products and pulled in lots of consulting work with.
I have completely given up my own work developing NLP tools, and generally I use the Python bindings (via the Hy language (hylang) which is a Lisp that sits on top of Python) for spaCy, huggingface, TensorFlow, and Keras. I am retired now but my personal research is in hybrid symbolic and deep learning AI.
Hybrid symbolic and NN will be my next area of hobby research, currently getting my masters degree in NLP. Do you have a few good resources to get startedor/read about?
I can't believe the level of productivity this Hugging face team has.
They seemed to have found the ideal balance of software engineering capability and Neural network knowledge, in a team of highly effective and efficient employees.
Idk what their monetization plan is as a startup, but it is 100% undervalued at 20 million, and that is just the quality of that team. Now, if only I can figure out how to put a few thousand $ in a series-A startup as just some guy.
I see them as an acqui-hire target. Especially form Facebook since they are so geographically close to FAIR labs in NY or Google and get integrated into Google AI like Deep Mind did. (esp. since google uses a ton of Transformers any ways)
I can't think of many small teams that can be acquired and can build a company's ML infrastructure as fast as this team.
If they have the money for it, OCI and Azure may also be keeping a look out for them.
It used to be that pre-DeepLearning tokenizers would extract ngrams (n-token sized chunks) but this doesn't seem to exist anymore in the word embedding tokenizers I've come by.
Is this possible using HuggingFace (or another word embedding based library)?
I know that there are some simple heuristics like merging noun token sequences together to extract ngrams but they are too simplistic and very error prone.
Most implementations are actually moving in the opposite direction. Previously, there was a tendency to look to aggregate words into phrases to better capture the "context" of a word. Now, most approaches are splitting words into sub-word parts or even characters. With networks that capture temporal relationships across tokens (as opposed to older, "bag of words" models), multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts.
> multi-word patterns can effectively be captured by attending to the temporal order of sub-word parts
Indeed. Do you have an example of a library or snippet that demonstrates this?
My limited understanding of BERT (and other) word embeddings was that they only contain the word's position in the 728 (I believe) dimensional space but doesn't contain queryable temporal information no?
I like ngrams as a sort of untagged / unlabelled entity.
When using BERT (and all the many things like it, such as earlier ELMO, ULMfit and later ROBERTA/ERNIE/ALBERTa/etc) as the 'embeddings' you provide as input all the tokens in a sequence. You don't get an "embedding for word foobar in position 123", you get an embedding for all the sequence at once, so whatever corresponds to that token is a 728-dimensional "embedding for word foobar in position 123 conditional on all the particular other words that were before and after it'. Including very long-distance relations.
One of the simpler ways to try that out in your code seems to be running BERT-as-a-service https://github.com/hanxiao/bert-as-service , or alternatively the huggingface libraries that are discussed in the original article.
It's kind of the other way around compared to word2vec-style systems; before that you used to have a 'thin' embedding layer that's essentially just a lookup table followed by a bunch of complex layers of neural networks (e.g. multiple Bi-LSTMs followed by CRF); in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.
> in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.
Would you say they are still usually called "embeddings" when using this new style? This sounds more like just a pretrained network which includes both some embedding scheme and a lot of learning on top of it, but maybe the word "embedding" stuck anyway?
They do seem to be still called "embeddings" although yes, that's become a somewhat misleading misonmer in some sense.
However, the analogy still is somewhat meaningful, because if you want to look at the properties of a particular word or token, it's not just a general pretrained network, it still preseves the one-to-one mapping between the input token and the output vector corresponding to each particular token; which is very important for all kinds of sequence labeling or span/boundary detection tasks. So you can use them just as word2vec embeddings - for example, if you'd do word similarity or word difference metrics with 'transformer-stack-embeddings' then that would work just as well as word2vec (though you'd have to get to a word-level measurement instead of wordpiece or BPE subword tokens) with the added bonus of having done contextual disambiguation; you probably could do a decent word sense disambiguation system just by directly clustering these embeddings; the mouse-as-animal and mouse-as-computer-peripheral should have clearly different embeddings.
> Do you have an example of a library or snippet that demonstrates this?
All NLP neural nets (based on LSTM or Transformer) do this. It's their main function - to create contextual representations of the input tokens.
The word 'position' in the 728 dimensional space is an embedding and it can be compared with other words by dot product. There are libraries that can do dot product ranking fast (such as annoy).
Are there examples on how this can be used for topic modeling, document similarity etc? All the examples I’ve seen (gensim) use bag-of-words which seems to be outdated.
Big transformers neural network are probably overkill for topic modeling. More traditional methods implemented in Gensim or scikit learn such as tfidf vectors followed by SVD (aka LSI) or LDA or NMF are probably just fine to extract topics (soft clustering).
The reason is that you do not need to finely understand the structure of individual sentences to group documents by similar topics. Word order does not matter much for this task. Hence the success of methods that use Bag of Words (eg TFIDF) as their input representation.
It might be that the corpus I was trying to cluster needs better preprocessing, or perhaps better n-grams. Using Bigrams only I saw a lot of common words that were meaningless, but adding them as stop words made the results worse. Hence my wondering if some other vectorization would produce better results.
On a related note, as a newcomer just trying to get things done (i.e. applied NLP) I find the whole ecosystem great but frustrating, so many frameworks and libraries but not clear ways to compose them together. Any resources out there that help make a sense of things?
It’s not meaningless words - it’s common English words that are overloaded and I think considering their position in sentences instead would give better results.
I haven’t yet tried TFIDF though so I’ll see what that will do.
It's mostly understanding text and generating text. You can do named entity extraction, question answering, summarisation, dialogue bots, information extraction from semi-structured documents such as tables and invoices, spelling correction, typing auto-suggestions, document classification and clustering, topic discovery, part of speech tagging, syntactic trees, language modelling, image description and image question answering, entailment detection (if two affirmations support one another), coreference resolution, entity linking, intent detection and slot filling, build large knowledge bases (databases of triplets subject-relation-object), spam detection, toxic message detection, ranking search results in search engines and many many more.
All of the above, it's like asking what problems can you solve with math? HuggingFace's transformers are said to be a swiss army knife for NLP. I haven't worked with them yet, but the main fundamental utility seems to be generating fixed-length vector representations of words. Word2vec started this, but the vectors have gotten much better with stuff like BERT.
There's a lot! Sentence detection, parts of speech (POS) detection to name a couple. These can be used to determine key concepts in documents that lack metadata. For example: you could cluster on common phrases to identify relationships in data.
The sentence "Hello, y'all! How are you ?" is tokenized into words. Those words are then encoded into integers representative of the words' identity in the model's dictionary.
But there's also good detail in the source [2] which says, "A Tokenizer works as a pipeline, it processes some raw text as input and outputs an Encoding. The various steps of the pipeline are: ...."
Main features: - Encode 1GB in 20sec - Provide BPE/Byte-Level-BPE/WordPiece/SentencePiece... - Compute exhaustive set of outputs (offset mappings, attention masks, special token masks...) - Written in Rust with bindings for Python and node.js
Github repository and doc: https://github.com/huggingface/tokenizers/tree/master/tokeni...
To install: - Rust: https://crates.io/crates/tokenizers - Python: pip install tokenizers - Node: npm install tokenizers