It should be noted that CNNs and LSTMs are an order of magnitude slower than things like bag-of-words/fasttext unless you're using an expensive GPU, and the accuracy benefit if any may be marginal in practice.
Kaggle prioritizes chasing a metric, but real-world data science has more considerations.
I dont use NN because they simply don't have great accuracy, and most importantly they have a huge amount of variance. this is mostly because the data on kaggle is not very large. the gbm trifecta (xgboost, catboost, lgbm) also does really really well.
> "Kaggle prioritizes chasing a metric, but real-world data science has more considerations."
this is counter to your point. most real-world considerations need things like model explanability.
I notice that things dont make hacker news if they are doing anything other than NN. model with xgboost gets X% accuracy - crickets. model with X-Y% accuracy with DNN - headline news.
I also notice that teams in industry tend to throw a DNN at a problem and never try something more simpler like xgboost. I saw a team with an LSTM for text lament they had 80% accuracy on training/evaluation, but when pushed to prod dropped down to 50%. I saw the errors they were getting, I said: maybe its too complicated and not generalizing well, have you tried xgboost?
they retorted LSTM's w/ word2vec is very robust. I thought, obviously its not given your results. I tried to offer the idea that word2vec was trained on an entirely different kind of corpus (also a problem when trying to use word2vec in kaggle)
I agree good models like xgboost get buried and inexperienced practitioners jump to deep models too quickly. Often without understanding how to properly architect and tune them. Always start with a simple baseline (EDIT: and good process).
However, what's lacking in the ML practitioner community is nuance. Some applications need deep models some problems need xgboost. There isn't a "best" model in text classification because it depends on your data and problem.
I'm not a huge fan of the chart as it looks pretty 2014 dated. SVMs are rarely used anymore, so perhaps if you replaced all of them (including in the regression part) with gbm. Didn't notice any mention of logistic regression, which I still heavily use as well.
dimentionality reduction is part of any step, which depends on number of observations, features, and model. my favorite is l1 or pca, but I am not afraid to use stepwise regression or some tree method.
For most businesses a GPU is not really expensive, particularly if the results are important to the business.
Chasing a single metric is certainly too narrow but getting the best results often does matter in a professional context too. While Kaggle can go overboard on massive ensembles using a state-of-the-art approach to the problem is often warranted outside of Kaggle.
Electronic medical records. Anything in healthcare/medical, financial services, legal. There's a long way from the "worst text classification" to "sufficient text classification" in most real world use cases. With a reasonable budget you can relabel data and work around subjectivity.
Nobody cares how long it takes to train a model. What matters is prediction speeds, which are comparable (and NLP less likely to require high frequency, where a few more milliseconds matters).
Besides that, the accuracy gains are not marginal anymore (BoW can't compete like it used to, especially with pre-trained models).
> Nobody cares how long it takes to train a model.
This isn't true. It depends on your priorities and goals. Machine learning that spends most of its time unable to learn is not real AI. Some of us are interested in sample and energy efficient learning capable of on-line incremental updates immune to catastrophic forgetting. Not just because this is truer to actual learning but because it moves away from being dependent on a handful of companies to do the actual training.
Anticipating some replies: no, transfer learning or meta-learning methods don't really avoid this. In the case of transfer learning, you still have that high coupling between a handful of sources. The down-sides of this is its own discussion. In addition, there are times where the ability to extract local relations can be dulled by the dominant wikipedia and common-crawl representations. Meta-learning gets you fast updates but you still cannot stray too far away from the domains that were met at training time.
> What matters is prediction speeds
I'm not a fan of bag of words models either but a simple dot product is always going to be faster than many matrix multiplies and or convolutions. The implementor should always try these as a base-line and decide if the performance accuracy trade-off is worth it for them.
Nobody in business cares if you are doing proper AI or dumb curve fitting. What matters is the complexity (engineering debt) and performance (accuracy, robustness).
Online learning, sample -
and energy efficiency are unrelated to training times. Like said: nobody cares if you ran Vowpal Wabbit for 1 hour or 100 hours, as long as you are not constantly babysitting it and calling that paid work (or have the unusual requirement of daily retraining while using an online model).
> simple dot product is always going to be faster than many matrix multiplies
If you care about this (because it is profitable), you rewrite in lower-level language or predict with cloud GPU (which will be at least comparable to simple dot product, while adding performance)
You've clarified your stance from nobody to nobody in business. That's good, although, I think that is opinion based on your experiences. I suspect that business will care if researchers can make it easy to learn on premise on their small datasets while maintaining high accuracy. The ability to easily update and adapt under non-stationarity without having to retrain from scratch benefits all. The same is true of models that maintain uncertainty or that can explain decision outputs. Tracking uncertainty, robustness to changes, on-line updatability and explainability are all related in that they are examples of things that become easier under causal modeling.
A parallel discussion we are having is whether the gain in accuracy is always worth the gain in complexity and loss in speed. It's something to decide on a case by case basis. It's basic hygiene to reach for the simplest model first.
> Nobody cares how long it takes to train a model.
LOTS of people care how long it takes to train a model. A few minutes, vs. a day, vs. a week, vs. a month? Yea, that matters.
Think about how long it takes to try out different hyperparameters or make other adjustments while conducting research...
If you're Google maybe you don't care as much because you can fire off a hundred different jobs at once, but if you're a resource-limited mere mortal, yea, that wait time adds up.
If you are building large-scale systems that take weeks or months to train, you are at a point where you shouldn't care about this. Throw more compute at the problem, it will pay for itself.
If we are talking days or hours: start parameter search on Friday and return best parameters on Monday.
Do research and iteration on heavily subsampled datasets.
If you are building models for yourself, or for Kaggle, you may care in as much as your laptop gets uncomfortably hot.
Time to train a model matters for applications where you want to have end users training models on their own computers without spending so much CPU/GPU time that they have to plan their day around it.
Consider for instance an RSS reader that classifies articles to determine whether or not to interrupt the user with a notification. This should be fast to train and update the model on the fly every time the user enters a correction (e.g. 'this article actually isn't interesting', or 'interrupt me with articles like this in the future'.)
I would not retrain such a model on all data, just do online updates. Also I still think for that use case training times and latency are negligible (nobody cares or nobody notices any difference between training a BoW and bi-LSTM.)
If you are deploying on resource-constrainted devices (IE: low-end PC's without GPU), it is not unusual to take a lot of time training a model on a very powerful computer (which nobody cares about), then distilling or transfering the result for test time.
No. Resources are not infinite, and we were already on the edge of what the resources at most sites where training would be done could be expected to have.
Thanks. I think you are a correct exception to what I said. I should have known that using words like "nobody" would not go over well on HN (but tedious to type "a very large percentage"), despite that statement being a verbatim quote from one of the world's leading ML engineers and, to me, not controversial.
I do consider the cloud both widely available and near infinite in resource adding capability.
If it is really not economically feasible to add resources, then the performance gains were not as promising as thought (whether cloud or on-site).
> Thanks. I think you are a correct exception to what I said. I should have known that using words like "nobody" would not go over well on HN (but tedious to type "a very large percentage")
1) The ML experts in the field have all, pretty much, settled on the need for a uniform method to train models, but for each model needing to be trained on-site.
2) While the cloud might be near infinite in terms of adding capacity, "Hey guys, lets stage up some health-data compatible AWS instances to do something that was a side project we're not even sure will work" in what is always a cash-starved part of healthcare is...well...a pretty big ask.
> Nobody cares how long it takes to train a model.
That's a reckless generalization. I care.
My thesis would take forever if I didn't do any optimization. Also my data is 20 rows with ~6000 predictors.
There are models out there that can take months! I worked on one that took months. We had to tweak it and optimize it to see if we can get it to acceptable training time.
> "Nobody cares how long it takes to train a model."
In kaggle some competitions it takes over 7 hours to train a model, and I can generally think of 10 things a day to try. prediction only takes about a minute.
> "especially with pre-trained models"
if the corpus are different, pre-trained models do not help much, if not hurt.
Do you know of a good way to combine fasttext with none text features?
Let's say I know seasonality is a strong feature in classifying my text, how can I add this? With a BOW I can literally just add SEASON_AUTUMN as a word to the text, and I'll get that extra feature as a dummy variable in my feature vector.
But for fasttext, if I add such a word, it will just be averaged out in the final document feature vector.
Not sure why anyone would use 2D CNNs for processing text when there is no spatial correlation in the embedding features. Recent work such as https://arxiv.org/abs/1803.01271 show that for most tasks, 1D CNNs outperform recurrent architectures while being faster to train
That blog used a 2d cnn because tensorflow didn't have a 1d version at the time of writing, so he just created a dummy 2nd dimension of length 1 and called it a day.
We have done extensive testing in the context of chatbot intent classification and in our particular problem nothing (including CNN, LSTM, fasttext plus LUIS, Watson and other proprietary classifiers) has been able to beat a simple linear model trained on char n-gram features.
Chatbot intent would be a good use case for a linear model, as a single word/ngram would have a high impact on the result (in contrast to advanced architectures which try to account for ambiguity/contradictions in documents)
I've seen the same things in the models I've built. For basic intent classification simpler models seem to be more accurate, not to mention they train faster and require less memory. There seems to be a lot of emphasis on shiny complex neural network architectures, even when simple models work just fine.
> There seems to be a lot of emphasis on shiny complex neural network architectures, even when simple models work just fine.
It's resume-driven-development for data scientists.
I've never seen an interviewer impressed with the fact that a job was performed using not-deep learning, but say that you used deep learning (despite how spurious it might be) and they light up like it's Christmas.
This isn't that surprising. I think the reason for this is that, even though the model is linear, the space of n-grams is so large that there usually is a line that separates any two classes.
I wonder how FastText (essentially word2vec + word & char n-grams + other stuff) stacks up against these algorithms.
In my own tests on my own corpuses, CPU-based FastText is faster to train and produces significantly better results (precision/recall) than the GPU-bound CNN algorithms that I've tried, but have not compared it against RNN techniques.
I’ve found that some CNNs consistently beat fasttext in terms of model quality. But I’ve beaten those CNNs and fasttext by doing transfer learning with ULMFit and fasts I. But if we’re talking training speed, fasttext is indeed aptly named.
FastText is essentially a linear classifier and it's not surprising that it would train quickly. As for the prediction metrics, I imagine that will depend on the type of data and problem you're working with. Linear models perform very well on certain types of problems (I've had great success with SVMs on text classification problems personally) but for more complicated tasks, I imagine that the deep learning models would perform very, very well (relatively).
"The information you'd need to choose is included in there. If you're doing this professionally, you should strive to have enough of a high-level understanding of NLP to be able to make these decisions without having a rubric handed to you on a silver platter.
In a nutshell, though: Strive to use the simplest model that will get the job done. Less elaborate models are easier to understand and (usually) less prone to things like overfitting, so they'll be more tractable to work with in a business context. To that end: Use a convolutional net when you can get away with a small, fixed-size context window. Use an LSTM when you need long-term memory. Attention can be expensive, so you use it when you have cause to believe you can gain a lot by giving selective attention to features, and have both a lot of training data and a lot of computing resources.
It's also worth considering that you might be best off going with none of these options. Cool as deep learning is, I've personally never actually been able to justify using it in a professional setting. Simpler models such as logistic regression and decision trees have characteristics that are near-useless for getting you to the top of a Kaggle leaderboard, but can be indispensable when working on many real-world business problems"
this is unfortunately almost always true, you just have to try every possible combination of everything with lots of hyperparameters. nothing makes any sense, it's total chaos, and we are wandering blind in the wastelands.
minimaxir's other comment was helpful by not trying to guess an outcome but explaining things we do know about, like training/estimation cost and order of complexity, which were topics absent from the article
I was really hoping to see a summary comparison of the performance(s) of the different models at the end, e.g. accuracy vs. complexity vs. execution time, etc.
Here's a summary from the end of each section...
1. TextCNN: "This kernel scored around 0.661 on the public leaderboard."
Kaggle is a pretty serious natural-selection environment for machine learning algorithms. Basically, if bag-of-words worked better, the contest winners would still use it.
I remember getting 95%-ish accuracy with BoW and the SVM circa 2004 when it came to questions like "is this paper about astrophysics or organic chemistry?"
In that case you have a distinct vocabulary for different topics and it is hard to beat BoW.
Sentiment analysis, on the other hand, is where BoW goes to die since now "not good" means something very different than "good", and even simple heuristics like treating "not X" as a term that is different from "X" give limited gain because negation is expressed with constructions like "i don't believe that is good" and there is no k-word window that you reliably catch negation in since there isn't a limit on how complex sentences are.
There is also the question of "is the improvement between method A and method B worth it?" For instance the Netflix prize was much celebrated because some brilliant people busted their ass to go from 92% to 95% accuracy on movie recommendations. In the end the algorithm proved to be too complex for the value it created. (eg. Who would notice that they got 8 bad recommendations instead of 5 out of a hundred? An additional half a bad recommendation out of 10?)
The real "Netflix optimization problem" is how to spend as little on acquiring content as possible while motivating people to keep their subscriptions and that is something Netflix will keep closer to their chest and not promote a public competition on. (eg. if it were valuable why would they let competitors know about it?)
It is important to understand "worked better". In practice, I would trade off using simpler methods for few points increase in precision/recall/your choice of metric.
You're assuming that they aren't trying: it's not too hard to try out bag-of-words and see what happens. But things like Attention and LSTMs are pretty good, but not without their costs.
I'm assuming the opposite: bag-of-words is the go-to baseline, but as another commenter pointed out, just separating instances based on vocabulary is not sufficient in today's sophisticated text classification problems.
Kaggle prioritizes chasing a metric, but real-world data science has more considerations.