Hacker News new | past | comments | ask | show | jobs | submit login
AutoML Solutions: What I Like and Don’t Like About AutoML as a Data Scientist (alexandruburlacu.github.io)
73 points by aburla on Sept 19, 2022 | hide | past | favorite | 10 comments



> So, you’re working on a new ML project. The first thing you do, model-wise – you implement a simple heuristic baseline and see where you stand. Second, you try a simple ML solution and analyze how much it improves the baseline. One thing you can try to do after this stage, at least what I like to do, is to try to estimate what would be your upper bound in terms of predictive performance, and let an AutoML solution squeeze the most out of your data and preprocessing.

100% agreed! And yet, most data scientists completely jump over the first two steps described here because it is not "sexy enough" and not "valuable"


I find having a simple baseline, and ideally a human baseline, extra important for a real world project. I had once the experience of working on a complicated project without any human baseline, even when requested. It was like walking through a dark forest without any lights. A lot of bruises.


I like your "dark forest" analogy! With students I used to talk about "fog" or simply "doing things blindly", but the "dark forest" image makes everyone want to get out there ASAP!


> estimate what would be your upper bound in terms of predictive performance

Could you share how do you approach this estimation step of the predictive performance?


Basically I assume a good AutoML solution, given time and enough compute, can come up with a very good model, much better than I can do in a few weeks, assuming no feature engineering. This way I can focus more on what could be missing and I don't really have to think "what if I used more estimators in that random forest" et al.

There's no formal "estimation".


I’ve used AutoKeras (one of the frameworks he mentions in the article) and as a machine learning amateur it was super helpful. It still has taken a lot of work to optimize for my use case, but it was nice to get decent results nearly “out-of-the-box.”


> the correct way to think of AutoML is as an enabler that lets you focus more on the data side of things

Yes! This is exactly how we are starting to use it and it makes a ton of sense.

It also allows for very rapid POCs to just flesh out a problem space, then a data scientist can come later and refine.


Happy to sell automl.io domain to anyone interested :)


What about data? Does using AutoML increase the amount of data needed, due to needing more complex cross-validation setups, or can you get away with roughly the same split?

I ask because I find that where the most domain knowledge is needed is when less labeled data is available, and that's also where I would assume AutoML doesn't perform as well.


Some, if not all, AutoML solutions will require a minimum amount of data to try certain methods.

But you are entirely correct. The last commercial use of AutoML I saw demoed had very limited data and while the metrics were ok, I could have made just as good predictions using linear regression in Excel or even just a calculator and some basic heuristics.

That's not to say AutoML is bad, I have used H2O with success, it only replaces a small part of the Data Science pipeline.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: