This is an important thing to be aware of, but I wouldn't take the numbers at strictly face value. Repeated peeks at the same running experiment are not independent of each other. Furthermore once the underlying difference between A and B starts asserting itself statistically, it doesn't stop. And finally a chance fluctuation in the opposite direction from an underlying difference has to be much larger to get to statistical significance than one in the same direction. These are massive complications that make the statistics very hard to calculate.
I addressed this in my 2008 tutorial on A/B testing at OSCON. What I did is ran Monte Carlo simulations of running an A/B test while continuously following the results with various sets of parameters, and running the test to different confidence levels. In that model I peeked at every single data point. You can find the results starting at http://elem.com/~btilly/effective-ab-testing/#slide59. (See http://meyerweb.com/eric/tools/s5/features.html#controlchart for the keyboard shortcuts to navigate the slides.)
My advice? Wait until you have at least a certain minimum sample size to decide. Only decide there with high certainty. And then the longer the experiment runs, the lower the confidence you should be willing to accept. This procedure will let you stop most tests relatively fast, but still avoids making significant mistakes.
Just wanted to let you know that that slideshow changed my life. It made the company I'm with (FreshBooks) truly shine while doing split tests, which made me look good. Anyways, thanks dude.
I'm usually using a G-test, which is known to be inaccurate unless A and B each have at least 10 failures and 10 successes. So that puts a minimum size right there.
After that it is all a question of how many trials you were prepared to run, and how aggressively you need the answer. I tend to push to be conservative because the product manager is guaranteed to push the other way. For instance if you've got less than 500 trials I like to push for 99.9% confidence because, "If the effect is really this strong, we'll get there pretty quickly." After that I ease off. I've never tried to sit down and calculate any kind of optimal way to do so.
If I was to try to formalize it, I'd be inclined to set things up so that for some underlying bias which is smaller than what we're hoping to measure (for instance a 2% bias), our odds of picking the wrong answer are the same each time. I don't have any analysis behind that idea, it is just something that sounds reasonable to me. If I was to head down that road I'd probably run some Monte Carlo simulations to show that I'd be highly likely to wind up with test result answers in some vaguely reasonable time frame with whatever volume the website had. But again I haven't tried to do that analysis.
Came here to say this. G-tests are the way to go, any other method and you end up with problems. I usually go with a 98 to 99% confidence, because really, the odds of the loser being much worse than the winner are so small that it doesn't really matter and I would rather iterate quickly.
I addressed this in my 2008 tutorial on A/B testing at OSCON. What I did is ran Monte Carlo simulations of running an A/B test while continuously following the results with various sets of parameters, and running the test to different confidence levels. In that model I peeked at every single data point. You can find the results starting at http://elem.com/~btilly/effective-ab-testing/#slide59. (See http://meyerweb.com/eric/tools/s5/features.html#controlchart for the keyboard shortcuts to navigate the slides.)
My advice? Wait until you have at least a certain minimum sample size to decide. Only decide there with high certainty. And then the longer the experiment runs, the lower the confidence you should be willing to accept. This procedure will let you stop most tests relatively fast, but still avoids making significant mistakes.