Select a training set, leaving out one sample for validation. For all features, train a classifier on the training set using that feature. Keep the one that gives the highest discrimination score on the training set. Repeat with more features. Then evaluate the final classifier on the validation sample, which has so far not been seen in any of the steps. The result provides an estimate of the risk on unseen data from the same distribution.
To get the estimation variance down, you can repeat this for all possible choices of validation sample. That means, you start the feature selection process on the new training set over from scratch and obtain another risk estimate. If they kept the features selected earlier, that estimate would be "contaminated" and not independent, but if they correctly start over, the procedure is valid.
My understanding is you are saying create N (N=34 in this case) different parallel models that use different features/etc. Then take the average (or whatever summary stat) of the accuracies to get the predictive skill.
When we want to use these models, we run new/test data through all N=34 models in parallel and calculate a prediction from each. Then somehow these predictions need to be combined (one again an average, etc). This is the average of the predictions, not accuracies/whatever.
Where was the step combining these predictions present during the training? It seems your scheme necessarily calculates an accuracy based on a different process than needs to be applied to new data.
No, when you want to classify a new sample, you take a model trained on the complete labeled data you have and use the prediction of that. The validation procedure using those 34 models trained on subsets of the data is just to tell you how accurate you should expect the result to be. Afterwards, you can throw those models away.
Of course you could build an ensemble model, but if you want to know the expected accuracy of doing that, you need to include the ensemble-building into your validation procedure. (Or use some theorem that lets you estimate the ensemble performance from that of individual models, if that is possible.)
I see. So usually what you would do is run the CV a bunch of times to test various features/hyperparameters, knowing this will overfit to the data used for the cv.
After deciding on features/hyperparameters (based on the overfit cv), you train the model on all the data used for cv at once. Then test the resulting model on a holdout set (that was not used for the cv). The accuracy on that holdout would then be the accuracy to report.
This sounds much like what you are describing, except you only do one cv and do not use it to decide anything. The cv is only to give an estimate of accuracy.
Is that correct? It does seem to legitimately avoid leakage. However, it seems impossible that an anything close to optimal feature generation process or the hyperparameters were known beforehand. Do you just use defaults here?
To get the estimation variance down, you can repeat this for all possible choices of validation sample. That means, you start the feature selection process on the new training set over from scratch and obtain another risk estimate. If they kept the features selected earlier, that estimate would be "contaminated" and not independent, but if they correctly start over, the procedure is valid.