Hacker News new | past | comments | ask | show | jobs | submit login

>"when you want to classify a new sample, you take a model trained on the complete labeled data you have and use the prediction of that."

Using which set of features? You have 34 different models with different features...




You run the whole training process on the complete data. Including feature selection.


I see. So usually what you would do is run the CV a bunch of times to test various features/hyperparameters, knowing this will overfit to the data used for the cv.

After deciding on features/hyperparameters (based on the overfit cv), you train the model on all the data used for cv at once. Then test the resulting model on a holdout set (that was not used for the cv). The accuracy on that holdout would then be the accuracy to report.

This sounds much like what you are describing, except you only do one cv and do not use it to decide anything. The cv is only to give an estimate of accuracy.

Is that correct? It does seem to legitimately avoid leakage. However, it seems impossible that an anything close to optimal feature generation process or the hyperparameters were known beforehand. Do you just use defaults here?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: