What Kagglers Are Using for Text Classification

minimaxir · on Dec 27, 2018

It should be noted that CNNs and LSTMs are an order of magnitude slower than things like bag-of-words/fasttext unless you're using an expensive GPU, and the accuracy benefit if any may be marginal in practice.

Kaggle prioritizes chasing a metric, but real-world data science has more considerations.

autokad · on Dec 28, 2018

an active kaggler here.

I dont use NN because they simply don't have great accuracy, and most importantly they have a huge amount of variance. this is mostly because the data on kaggle is not very large. the gbm trifecta (xgboost, catboost, lgbm) also does really really well.

> "Kaggle prioritizes chasing a metric, but real-world data science has more considerations."

this is counter to your point. most real-world considerations need things like model explanability.

I notice that things dont make hacker news if they are doing anything other than NN. model with xgboost gets X% accuracy - crickets. model with X-Y% accuracy with DNN - headline news.

I also notice that teams in industry tend to throw a DNN at a problem and never try something more simpler like xgboost. I saw a team with an LSTM for text lament they had 80% accuracy on training/evaluation, but when pushed to prod dropped down to 50%. I saw the errors they were getting, I said: maybe its too complicated and not generalizing well, have you tried xgboost?

they retorted LSTM's w/ word2vec is very robust. I thought, obviously its not given your results. I tried to offer the idea that word2vec was trained on an entirely different kind of corpus (also a problem when trying to use word2vec in kaggle)

soraki_soladead · on Dec 28, 2018

I agree good models like xgboost get buried and inexperienced practitioners jump to deep models too quickly. Often without understanding how to properly architect and tune them. Always start with a simple baseline (EDIT: and good process).

However, what's lacking in the ML practitioner community is nuance. Some applications need deep models some problems need xgboost. There isn't a "best" model in text classification because it depends on your data and problem.

asah · on Dec 28, 2018

This chart is great for newbies, perhaps give it an update?

https://scikit-learn.org/stable/tutorial/machine_learning_ma...

autokad · on Dec 28, 2018

I'm not a huge fan of the chart as it looks pretty 2014 dated. SVMs are rarely used anymore, so perhaps if you replaced all of them (including in the regression part) with gbm. Didn't notice any mention of logistic regression, which I still heavily use as well.

dimentionality reduction is part of any step, which depends on number of observations, features, and model. my favorite is l1 or pca, but I am not afraid to use stepwise regression or some tree method.

soraki_soladead · on Dec 27, 2018

For most businesses a GPU is not really expensive, particularly if the results are important to the business.

Chasing a single metric is certainly too narrow but getting the best results often does matter in a professional context too. While Kaggle can go overboard on massive ensembles using a state-of-the-art approach to the problem is often warranted outside of Kaggle.

scottlocklin · on Dec 28, 2018

Do you have an example backing up your assertions?

In my experience, the worst text classification is usually fine. Labels are usually too inaccurate and subjective for "accuracy" to matter much.

soraki_soladead · on Dec 28, 2018

Electronic medical records. Anything in healthcare/medical, financial services, legal. There's a long way from the "worst text classification" to "sufficient text classification" in most real world use cases. With a reasonable budget you can relabel data and work around subjectivity.

995533 · on Dec 27, 2018

Nobody cares how long it takes to train a model. What matters is prediction speeds, which are comparable (and NLP less likely to require high frequency, where a few more milliseconds matters).

Besides that, the accuracy gains are not marginal anymore (BoW can't compete like it used to, especially with pre-trained models).

Cybiote · on Dec 27, 2018

> Nobody cares how long it takes to train a model.

This isn't true. It depends on your priorities and goals. Machine learning that spends most of its time unable to learn is not real AI. Some of us are interested in sample and energy efficient learning capable of on-line incremental updates immune to catastrophic forgetting. Not just because this is truer to actual learning but because it moves away from being dependent on a handful of companies to do the actual training.

Anticipating some replies: no, transfer learning or meta-learning methods don't really avoid this. In the case of transfer learning, you still have that high coupling between a handful of sources. The down-sides of this is its own discussion. In addition, there are times where the ability to extract local relations can be dulled by the dominant wikipedia and common-crawl representations. Meta-learning gets you fast updates but you still cannot stray too far away from the domains that were met at training time.

> What matters is prediction speeds

I'm not a fan of bag of words models either but a simple dot product is always going to be faster than many matrix multiplies and or convolutions. The implementor should always try these as a base-line and decide if the performance accuracy trade-off is worth it for them.

995533 · on Dec 27, 2018

Nobody in business cares if you are doing proper AI or dumb curve fitting. What matters is the complexity (engineering debt) and performance (accuracy, robustness).

Online learning, sample - and energy efficiency are unrelated to training times. Like said: nobody cares if you ran Vowpal Wabbit for 1 hour or 100 hours, as long as you are not constantly babysitting it and calling that paid work (or have the unusual requirement of daily retraining while using an online model).

> simple dot product is always going to be faster than many matrix multiplies

If you care about this (because it is profitable), you rewrite in lower-level language or predict with cloud GPU (which will be at least comparable to simple dot product, while adding performance)

Cybiote · on Dec 27, 2018

You've clarified your stance from nobody to nobody in business. That's good, although, I think that is opinion based on your experiences. I suspect that business will care if researchers can make it easy to learn on premise on their small datasets while maintaining high accuracy. The ability to easily update and adapt under non-stationarity without having to retrain from scratch benefits all. The same is true of models that maintain uncertainty or that can explain decision outputs. Tracking uncertainty, robustness to changes, on-line updatability and explainability are all related in that they are examples of things that become easier under causal modeling.

A parallel discussion we are having is whether the gain in accuracy is always worth the gain in complexity and loss in speed. It's something to decide on a case by case basis. It's basic hygiene to reach for the simplest model first.

dna_polymerase · on Dec 27, 2018

> Nobody in business cares if you are doing proper AI or dumb curve fitting.

What is proper AI? It's all dumb curve fitting right now.

rundigen12 · on Dec 27, 2018

> Nobody cares how long it takes to train a model.

LOTS of people care how long it takes to train a model. A few minutes, vs. a day, vs. a week, vs. a month? Yea, that matters.

Think about how long it takes to try out different hyperparameters or make other adjustments while conducting research...

If you're Google maybe you don't care as much because you can fire off a hundred different jobs at once, but if you're a resource-limited mere mortal, yea, that wait time adds up.

sandeepeecs · on Dec 28, 2018

Yes I agree. most people who come to us at alpes AI do care about training time. how fast they can do experiments

Another important aspect is training and incremental training on edge device.

At the time when privacy is becoming very important and you cannot export data from mobile devices etc. Training time on mobile is an important factor

995533 · on Dec 27, 2018

If you are building large-scale systems that take weeks or months to train, you are at a point where you shouldn't care about this. Throw more compute at the problem, it will pay for itself.

If we are talking days or hours: start parameter search on Friday and return best parameters on Monday.

Do research and iteration on heavily subsampled datasets.

If you are building models for yourself, or for Kaggle, you may care in as much as your laptop gets uncomfortably hot.

darkpuma · on Dec 27, 2018

Time to train a model matters for applications where you want to have end users training models on their own computers without spending so much CPU/GPU time that they have to plan their day around it.

Consider for instance an RSS reader that classifies articles to determine whether or not to interrupt the user with a notification. This should be fast to train and update the model on the fly every time the user enters a correction (e.g. 'this article actually isn't interesting', or 'interrupt me with articles like this in the future'.)

995533 · on Dec 27, 2018

I would not retrain such a model on all data, just do online updates. Also I still think for that use case training times and latency are negligible (nobody cares or nobody notices any difference between training a BoW and bi-LSTM.)

If you are deploying on resource-constrainted devices (IE: low-end PC's without GPU), it is not unusual to take a lot of time training a model on a very powerful computer (which nobody cares about), then distilling or transfering the result for test time.

Fomite · on Dec 27, 2018

I recently had a very real world project be forced to abandon some promising methods because they were taking too long to train.

995533 · on Dec 27, 2018

It was not possible to increase speed by getting more powerful compute?

Fomite · on Dec 27, 2018

No. Resources are not infinite, and we were already on the edge of what the resources at most sites where training would be done could be expected to have.

995533 · on Dec 27, 2018

Thanks. I think you are a correct exception to what I said. I should have known that using words like "nobody" would not go over well on HN (but tedious to type "a very large percentage"), despite that statement being a verbatim quote from one of the world's leading ML engineers and, to me, not controversial.

I do consider the cloud both widely available and near infinite in resource adding capability.

If it is really not economically feasible to add resources, then the performance gains were not as promising as thought (whether cloud or on-site).

richk449 · on Dec 28, 2018

> Thanks. I think you are a correct exception to what I said. I should have known that using words like "nobody" would not go over well on HN (but tedious to type "a very large percentage")

In the future, you could use “most”.

Fomite · on Dec 27, 2018

So the problem in my circumstance is two-fold:

1) The ML experts in the field have all, pretty much, settled on the need for a uniform method to train models, but for each model needing to be trained on-site.

2) While the cloud might be near infinite in terms of adding capacity, "Hey guys, lets stage up some health-data compatible AWS instances to do something that was a side project we're not even sure will work" in what is always a cash-starved part of healthcare is...well...a pretty big ask.

digitalzombie · on Dec 27, 2018

> Nobody cares how long it takes to train a model.

That's a reckless generalization. I care.

My thesis would take forever if I didn't do any optimization. Also my data is 20 rows with ~6000 predictors.

There are models out there that can take months! I worked on one that took months. We had to tweak it and optimize it to see if we can get it to acceptable training time.

autokad · on Dec 28, 2018

> "Nobody cares how long it takes to train a model."

In kaggle some competitions it takes over 7 hours to train a model, and I can generally think of 10 things a day to try. prediction only takes about a minute.

> "especially with pre-trained models" if the corpus are different, pre-trained models do not help much, if not hurt.

wodenokoto · on Dec 28, 2018

Do you know of a good way to combine fasttext with none text features?

Let's say I know seasonality is a strong feature in classifying my text, how can I add this? With a BOW I can literally just add SEASON_AUTUMN as a word to the text, and I'll get that extra feature as a dummy variable in my feature vector.

But for fasttext, if I add such a word, it will just be averaged out in the final document feature vector.

computerex · on Dec 27, 2018

Agreed. For many many applications, bag of words/tf-idf/support vector machines are enough but with much better training time.

alexcnwy · on Dec 27, 2018

CNNs are usually much faster than LSTMs.

minimaxir · on Dec 27, 2018

Not related to the original post, but I recommend giving CuDNNLSTMs a try (if using Keras like the OP), the speed increase is absurd. (2-3x).

alexcnwy · on Dec 27, 2018

thanks for the tip!

biomodel · on Dec 27, 2018

Not sure why anyone would use 2D CNNs for processing text when there is no spatial correlation in the embedding features. Recent work such as https://arxiv.org/abs/1803.01271 show that for most tasks, 1D CNNs outperform recurrent architectures while being faster to train

madavidj · on Dec 27, 2018

Probably because the author followed this blog: http://www.wildml.com/2015/12/implementing-a-cnn-for-text-cl...

That blog used a 2d cnn because tensorflow didn't have a 1d version at the time of writing, so he just created a dummy 2nd dimension of length 1 and called it a day.

soraki_soladead · on Dec 27, 2018

This is just a bug in their code. The paper they cite uses 1D convolutions. Though, I suppose having an unused dimension only really hurts efficiency.

gnulinux · on Dec 27, 2018

> Though, I suppose having an unused dimension only really hurts efficiency.

That might not be true as it might increase bias and thus might need a more careful hyperparameter tuning to avoid overfitting.

elyase · on Dec 27, 2018

We have done extensive testing in the context of chatbot intent classification and in our particular problem nothing (including CNN, LSTM, fasttext plus LUIS, Watson and other proprietary classifiers) has been able to beat a simple linear model trained on char n-gram features.

eggie5 · on Dec 27, 2018

I've seen a rule of thumb that if the ratio of samples to words per sample is less than 1500 you prob don't have enough data for embeddings/cnn

minimaxir · on Dec 27, 2018

Chatbot intent would be a good use case for a linear model, as a single word/ngram would have a high impact on the result (in contrast to advanced architectures which try to account for ambiguity/contradictions in documents)

briga · on Dec 27, 2018

I've seen the same things in the models I've built. For basic intent classification simpler models seem to be more accurate, not to mention they train faster and require less memory. There seems to be a lot of emphasis on shiny complex neural network architectures, even when simple models work just fine.

FridgeSeal · on Dec 27, 2018

> There seems to be a lot of emphasis on shiny complex neural network architectures, even when simple models work just fine.

It's resume-driven-development for data scientists.

I've never seen an interviewer impressed with the fact that a job was performed using not-deep learning, but say that you used deep learning (despite how spurious it might be) and they light up like it's Christmas.

helpme3 · on Dec 28, 2018

This isn't that surprising. I think the reason for this is that, even though the model is linear, the space of n-grams is so large that there usually is a line that separates any two classes.

laichzeit0 · on Dec 27, 2018

Anecdotally, I found the same thing. N-gram BOW is surprisingly difficult to beat.

thanatropism · on Dec 27, 2018

I've been curious about what manifold learning (sp. now that we have UMAP) would do to that kind of workflow's performance.

wenc · on Dec 27, 2018

I wonder how FastText (essentially word2vec + word & char n-grams + other stuff) stacks up against these algorithms.

In my own tests on my own corpuses, CPU-based FastText is faster to train and produces significantly better results (precision/recall) than the GPU-bound CNN algorithms that I've tried, but have not compared it against RNN techniques.

physicsyogi · on Dec 27, 2018

I’ve found that some CNNs consistently beat fasttext in terms of model quality. But I’ve beaten those CNNs and fasttext by doing transfer learning with ULMFit and fasts I. But if we’re talking training speed, fasttext is indeed aptly named.

linalgmixer · on Dec 27, 2018

FastText is essentially a linear classifier and it's not surprising that it would train quickly. As for the prediction metrics, I imagine that will depend on the type of data and problem you're working with. Linear models perform very well on certain types of problems (I've had great success with SVMs on text classification problems personally) but for more complicated tasks, I imagine that the deep learning models would perform very, very well (relatively).

platz · on Dec 27, 2018

Three methods and no idea how I would choose between the three of them aside randomly trying each one and measuring performance.

(Not for winning kaggle but for an actual problem)

platz · on Dec 27, 2018

"The information you'd need to choose is included in there. If you're doing this professionally, you should strive to have enough of a high-level understanding of NLP to be able to make these decisions without having a rubric handed to you on a silver platter. In a nutshell, though: Strive to use the simplest model that will get the job done. Less elaborate models are easier to understand and (usually) less prone to things like overfitting, so they'll be more tractable to work with in a business context. To that end: Use a convolutional net when you can get away with a small, fixed-size context window. Use an LSTM when you need long-term memory. Attention can be expensive, so you use it when you have cause to believe you can gain a lot by giving selective attention to features, and have both a lot of training data and a lot of computing resources.

It's also worth considering that you might be best off going with none of these options. Cool as deep learning is, I've personally never actually been able to justify using it in a professional setting. Simpler models such as logistic regression and decision trees have characteristics that are near-useless for getting you to the top of a Kaggle leaderboard, but can be indispensable when working on many real-world business problems"

- anonymous comment reply

platz · on Dec 27, 2018

This is the kind of context that is very helpful.

currymj · on Dec 27, 2018

this is unfortunately almost always true, you just have to try every possible combination of everything with lots of hyperparameters. nothing makes any sense, it's total chaos, and we are wandering blind in the wastelands.

platz · on Dec 27, 2018

minimaxir's other comment was helpful by not trying to guess an outcome but explaining things we do know about, like training/estimation cost and order of complexity, which were topics absent from the article

nerdponx · on Dec 27, 2018

Sometimes you just need to rent 3 separate EC2 instances and try precisely that.

SpaceManNabs · on Dec 28, 2018

Seems like the article is mistaken on one part.

Attention was first coined in this paper (as far as I know): https://arxiv.org/pdf/1409.0473v7.pdf

The second page of the introduction of the "Hierarchical Attention Networks for Document Classification" paper mentioned in the article even cites it.

rundigen12 · on Dec 27, 2018

I was really hoping to see a summary comparison of the performance(s) of the different models at the end, e.g. accuracy vs. complexity vs. execution time, etc.

Here's a summary from the end of each section...

1. TextCNN: "This kernel scored around 0.661 on the public leaderboard."

2. BiDirectional RNN: 0.671

3. Attention Models: 0.682

ykevinator · on Dec 28, 2018

Thank you, that's exactly what I was scrolling through the bickering to find.

PaulHoule · on Dec 27, 2018

It would be nice to see how these methods compare to the classical methods based on word occurrences.

nerdponx · on Dec 27, 2018

Kaggle is a pretty serious natural-selection environment for machine learning algorithms. Basically, if bag-of-words worked better, the contest winners would still use it.

PaulHoule · on Dec 27, 2018

One issue is the kind of problem.

I remember getting 95%-ish accuracy with BoW and the SVM circa 2004 when it came to questions like "is this paper about astrophysics or organic chemistry?"

In that case you have a distinct vocabulary for different topics and it is hard to beat BoW.

Sentiment analysis, on the other hand, is where BoW goes to die since now "not good" means something very different than "good", and even simple heuristics like treating "not X" as a term that is different from "X" give limited gain because negation is expressed with constructions like "i don't believe that is good" and there is no k-word window that you reliably catch negation in since there isn't a limit on how complex sentences are.

There is also the question of "is the improvement between method A and method B worth it?" For instance the Netflix prize was much celebrated because some brilliant people busted their ass to go from 92% to 95% accuracy on movie recommendations. In the end the algorithm proved to be too complex for the value it created. (eg. Who would notice that they got 8 bad recommendations instead of 5 out of a hundred? An additional half a bad recommendation out of 10?)

The real "Netflix optimization problem" is how to spend as little on acquiring content as possible while motivating people to keep their subscriptions and that is something Netflix will keep closer to their chest and not promote a public competition on. (eg. if it were valuable why would they let competitors know about it?)

nerdponx · on Dec 27, 2018

Good point, I should have written "for machine learning algorithms on problems that the industry is currently interested in".

radnam · on Dec 27, 2018

It is important to understand "worked better". In practice, I would trade off using simpler methods for few points increase in precision/recall/your choice of metric.

minimaxir · on Dec 27, 2018

You're assuming that they aren't trying: it's not too hard to try out bag-of-words and see what happens. But things like Attention and LSTMs are pretty good, but not without their costs.

nerdponx · on Dec 27, 2018

I'm assuming the opposite: bag-of-words is the go-to baseline, but as another commenter pointed out, just separating instances based on vocabulary is not sufficient in today's sophisticated text classification problems.

ScoutOrgo · on Dec 27, 2018

No mention of ULMFiT? http://nlp.fast.ai/classification/2018/05/15/introducting-ul...

eggie5 · on Dec 27, 2018

Why not just use a 1d conv over the sequence of embeddings?

brian_herman__ · on Dec 27, 2018

This was really interesting and insightful thanks!

nixpulvis · on Dec 27, 2018

I already don't like humans parsing my words half the time, I'm confident I'll hate most algorithms.

Nothing inherently wrong with these methods, just a lot of possibilities for misuse in my eyes.