I was expecting this to be more about running inference in production, though th...

sgt101 · on Oct 17, 2020

There seems to be the idea that training an ML model is like compiling code - but every "compile" leaks information into the training pipeline. Repeated testing and choosing (unless it is on a fresh draw) is an optimization step, you are optimizing on the test set.

Using a fresh draw is difficult and expensive, especially since the labels may not be available. Using A/B is expensive, multi-armed bandits are more efficient, but again there is an optimisation element there (waits for shouting to start)

Additionally surely there is a really significant qualitative judgement step about any model that is going to be used to make real world decisions?

mlthoughts2018 · on Oct 17, 2020

You don’t typically perform optimizations iteratively with feedback from the final test set. Instead you split your training set into validation and training, and you iterate on that, leaving your true hold out test set completely unexamined all along.

You would do model comparisons, quality checks, ablation studies, goodness of fit tests and so forth only using the training & validation portions.

Finally you test the chosen models (in their fully optimized states) on the test set. If performance is not sufficient to solve the problem, then you do not deploy that solution. If you want to continue work, now you must collect enough data to constitute at minimum a fully new test set.

sgt101 · on Oct 18, 2020

I agree with the process you describe - but the traps are things like running a beauty contest (Kaggle ?) of n models against the final test set...