> So, you’re working on a new ML project. The first thing you do, model-wise – you implement a simple heuristic baseline and see where you stand. Second, you try a simple ML solution and analyze how much it improves the baseline. One thing you can try to do after this stage, at least what I like to do, is to try to estimate what would be your upper bound in terms of predictive performance, and let an AutoML solution squeeze the most out of your data and preprocessing.
100% agreed! And yet, most data scientists completely jump over the first two steps described here because it is not "sexy enough" and not "valuable"
I find having a simple baseline, and ideally a human baseline, extra important for a real world project. I had once the experience of working on a complicated project without any human baseline, even when requested. It was like walking through a dark forest without any lights. A lot of bruises.
I like your "dark forest" analogy! With students I used to talk about "fog" or simply "doing things blindly", but the "dark forest" image makes everyone want to get out there ASAP!
Basically I assume a good AutoML solution, given time and enough compute, can come up with a very good model, much better than I can do in a few weeks, assuming no feature engineering. This way I can focus more on what could be missing and I don't really have to think "what if I used more estimators in that random forest" et al.
I’ve used AutoKeras (one of the frameworks he mentions in the article) and as a machine learning amateur it was super helpful. It still has taken a lot of work to optimize for my use case, but it was nice to get decent results nearly “out-of-the-box.”
What about data? Does using AutoML increase the amount of data needed, due to needing more complex cross-validation setups, or can you get away with roughly the same split?
I ask because I find that where the most domain knowledge is needed is when less labeled data is available, and that's also where I would assume AutoML doesn't perform as well.
Some, if not all, AutoML solutions will require a minimum amount of data to try certain methods.
But you are entirely correct. The last commercial use of AutoML I saw demoed had very limited data and while the metrics were ok, I could have made just as good predictions using linear regression in Excel or even just a calculator and some basic heuristics.
That's not to say AutoML is bad, I have used H2O with success, it only replaces a small part of the Data Science pipeline.
100% agreed! And yet, most data scientists completely jump over the first two steps described here because it is not "sexy enough" and not "valuable"