This is an important thing to be aware of, but I wouldn't take the numbers at strictly face value. Repeated peeks at the same running experiment are not independent of each other. Furthermore once the underlying difference between A and B starts asserting itself statistically, it doesn't stop. And finally a chance fluctuation in the opposite direction from an underlying difference has to be much larger to get to statistical significance than one in the same direction. These are massive complications that make the statistics very hard to calculate.
I addressed this in my 2008 tutorial on A/B testing at OSCON. What I did is ran Monte Carlo simulations of running an A/B test while continuously following the results with various sets of parameters, and running the test to different confidence levels. In that model I peeked at every single data point. You can find the results starting at http://elem.com/~btilly/effective-ab-testing/#slide59. (See http://meyerweb.com/eric/tools/s5/features.html#controlchart for the keyboard shortcuts to navigate the slides.)
My advice? Wait until you have at least a certain minimum sample size to decide. Only decide there with high certainty. And then the longer the experiment runs, the lower the confidence you should be willing to accept. This procedure will let you stop most tests relatively fast, but still avoids making significant mistakes.
Just wanted to let you know that that slideshow changed my life. It made the company I'm with (FreshBooks) truly shine while doing split tests, which made me look good. Anyways, thanks dude.
I'm usually using a G-test, which is known to be inaccurate unless A and B each have at least 10 failures and 10 successes. So that puts a minimum size right there.
After that it is all a question of how many trials you were prepared to run, and how aggressively you need the answer. I tend to push to be conservative because the product manager is guaranteed to push the other way. For instance if you've got less than 500 trials I like to push for 99.9% confidence because, "If the effect is really this strong, we'll get there pretty quickly." After that I ease off. I've never tried to sit down and calculate any kind of optimal way to do so.
If I was to try to formalize it, I'd be inclined to set things up so that for some underlying bias which is smaller than what we're hoping to measure (for instance a 2% bias), our odds of picking the wrong answer are the same each time. I don't have any analysis behind that idea, it is just something that sounds reasonable to me. If I was to head down that road I'd probably run some Monte Carlo simulations to show that I'd be highly likely to wind up with test result answers in some vaguely reasonable time frame with whatever volume the website had. But again I haven't tried to do that analysis.
Came here to say this. G-tests are the way to go, any other method and you end up with problems. I usually go with a 98 to 99% confidence, because really, the odds of the loser being much worse than the winner are so small that it doesn't really matter and I would rather iterate quickly.
It's a good article, and a good intro to the pitfalls of statistical interpretation, but I think it reaches the wrong conclusion. Yes, when one has a very limited data set and needs to draw a conclusion in a hurry, and one has full confidence that there are no confounding variables in one's experiment, then paying very close attention to small differences in p-values can make sense. But how often is this the case when testing a new logo or signup page?
I'm less mathematically sophisticated than the author, and would choose a simpler approach: ignore weak results. If one determines that there is a 95% chance that 51% of people prefer Logo A, either stick with what what you have, go with the one you like, or keep searching for a better logo. If you can't see the effect in the raw data without rigorous mathematical analysis, it's probably not a change worth spending much time on.
Instead of adjusting your significance test for each 'peek', simply ignore anything less than 99.9% 'significant'. And while you are at it, ignore anything that's less than a 10% improvement, on the assumption that structural errors in your testing are likely to overwhelm any effects smaller than this. Drug trials and the front page of Google aside, if the effect is so small that it flips into and out of 'significance' each time you peek, it's probably not the answer you want.
This is important enough of a usage note that I'm going to probably mention it in my software's documentation. I personally largely ignore this issue and thing I'm probably safe doing so with my usual testing workflow, but it is an easy thing to burn yourself on if you sit and watch your dashboard all day.
Crikey, how did that fall off my list? I have a half-written analysis sitting in my home directory. It will probably take me a few days to finish the other half: I have a couple things ahead of it in the queue and this next application isn't going to write itself.
When I do stuff like this I purposefully ignore the results for the time I've set for the experiment. It's very easy to fall prey to thinking you have a result that will not change in the longer term. Things like daily and weekly cycles for instance can really throw off your analysis.
The only danger is having a 'hidden variable' influence your results and averaging over the longer term masks that influence. For example, if you are not geo-targeting your content, you could conclude after a long run of testing that a certain page performs better than another, only to throw away the averaged out effect of having the different pages up on different times of the day, one of them performing significantly better for one audience and vv.
So you should keep all your data in order to figure out if such masking is happening and giving you results that are good but that could be even better.
This is an interesting issue and I have seen users of my app (Visual Website Optimizer) complaining that their results were statistically significant a day before but now they aren't. Justifiably, they expect significance to freeze in time once it has been achieved. However, as you say significance is also a random function and not necessarily monotonically increasing or decreasing.
The constraint here is not the math or technology rather it is users' needs. They want data, reporting and significance calculation to be done in real time. And even though we have a test duration calculator, I haven't seen any user actually making use of it. Plus many users will not even wait for statistical significance to be achieved.
Though, in VWO, we will love to wait calculating significance until end of experiment. I'm sure the users won't like it at all.
While I understand the merit of what this article is saying I really want to caution always requiring a strict high confidence level when making a decision in start-ups. Requiring a strict confidence level does make sense for a company like Zynga who has nearly a limitless supply of users to run tests on, but for a start-up the value of being able to make a decision quickly often outweighs the value of being '95% confident'. Let's not forget all this wasted time worrying about the details of all of this math.
In my opinion, peak early and often and when your gut tells you something is true, it probablly is.
It is actually a mathematical fact if at any point in your A/B tests A is bigger than B, based on that data there is at least a 50% prob that asymptotically A is bigger than B.
This article neither advocates a high nor low confidence level. The point is that your confidence level is meaningless if you don't fix the sample size.
"...when your gut tells you something is true, it probablly is."
If you run a data-driven business with a philosophy like that, you've rewound management science to about 1700 AD. Human "guts" aren't evolved for evaluating UX effectiveness from sparse data.
The first calculation that the author is setting up is a power calculation, which is a strong start. Based on your expectations about the effect size of the treatment (in this case, the difference between A and B), and your desired probability of correctly identifying a difference, you can figure out how large of a sample size you need to see an effect. (This is called Beta.)
If you're going to take several peeks as you run your trial and you want to be particularly rigorous, consider alpha spending functions. In medicine, alpha-spending functions are often used to take early looks trial results. 'Alpha' is what you use to determine which P-values you will consider significant. To oversimplify a bit, early peeks (before you've got your full sample size) have very extreme alphas. If your trial ultimately uses an alpha of 0.05, a prespecified early look may use an alpha of 0.001. (There are ways of calculating a meaningful alpha values; these are just examples drawn from a hat.)
By setting useful alphas and betas, you can benefit from true, potent treatment effects (if present) earlier than you might otherwise, without too much risk of identifying spurious associations.
Great point, do you have any recommendation for a paper on alpha spending functions? This looks interesting way to compensate for early peeking into significance.
Argh, having a hard time finding the paper I was thinking of. I thought it was in JAMA in 2009 but perhaps not. It had a lot of this information nicely graphed, but alas. I'll keep digging around and reply again if I find it.
Why are we still using p<0.05 for web A/B testing? p<0.05 made sense when each individual data point cost real money to generate: grad students interviewing participants or geologists making individual measurements. p < 0.05 was a good tradeoff between certainty and cost.
Now, in the world of the web where measurement has an upfront cost but 0 incremental cost, why not move to p < 0.001 or p < 0.0001? Sure, you need to increase the magnitude of data you're gathering by 2 or 3 but that's so much easier than delving into the epistemological complexities of p < 0.05
While interesting, I think that this is more of a mathematical proof for something that people doing any sort of testing should remember:
Don't stop before the test is complete, just because you've gotten an answer.
I generally leave my A/B tests up well after I've gotten a significance report, mostly because I'm lazy but also because I know that given enough time and enough entries, the significance reports can change.
Especially in the multi-variate tests that Evan wrote about, just because you get one result as significant doesn't preclude other possibilities from also being significant.
That would be a solution to using pairwise statistics to come up with an answer to an A/B/C/D test. That is not a solution to the challenge of evaluating an A/B test at multiple time points.
Interesting that frequentist A/B software packages let you essentially break the test without telling you. Are there bayesian A/B testers that give you a likelihood ratio instead?
I addressed this in my 2008 tutorial on A/B testing at OSCON. What I did is ran Monte Carlo simulations of running an A/B test while continuously following the results with various sets of parameters, and running the test to different confidence levels. In that model I peeked at every single data point. You can find the results starting at http://elem.com/~btilly/effective-ab-testing/#slide59. (See http://meyerweb.com/eric/tools/s5/features.html#controlchart for the keyboard shortcuts to navigate the slides.)
My advice? Wait until you have at least a certain minimum sample size to decide. Only decide there with high certainty. And then the longer the experiment runs, the lower the confidence you should be willing to accept. This procedure will let you stop most tests relatively fast, but still avoids making significant mistakes.