> If you repeatedly re-measure for 95% confidence, you strongly select for rando...

> If you repeatedly re-measure for 95% confidence, you strongly select for random noise.

You can't use standard methods for early stopping - as you rightly point out, you get gibberish if you naively keep peeking at a growing data set. Instead, you have to use statistical methods that explicitly adjust for the repeated sampling in early stopping trials. This does make early stopping more complicated to analyse than a trial with a pre-determined duration.