Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[dupe] On the pitfalls of A/B testing (stavros.io)
24 points by loarake on July 14, 2013 | hide | past | favorite | 6 comments


tl;dr: Don't bother with confidence intervals. Use a G-test instead.

Calculate it here: http://elem.com/~btilly/effective-ab-testing/g-test-calculat...

Read more here: http://en.wikipedia.org/wiki/G-test

And plain English here: http://en.wikipedia.org/wiki/Likelihood_ratio_test


    When A/B testing, you need to always remember three things:

    The smaller your change is, the more data you need to be sure 
    that the conclusion you have reached is statistically significant.
Is that a mathematically provable result? It seems hard to conceptualize what a 'small' or 'big' change is. I would have expected another argument along the lines of "If you make more than one change at a time, you are not going to be able to know which one of your changes caused the result".


This property is quite intuitive. Small and big here are relative to the variance of the underlying distributions.

Simple case: think about trying to decide if as normal distribution has mean 0 or mean 1. If the std dev is 0.001, it won't take you very many samples to be fairly confident to this resolution, but of the deviation is 1000, you'll need a lot of samples.

Similarly if the std deviation is only 1, but you are trying to decide if the mean is 0 or 0.001, Far more samples needed.

The intuition generalizes quite well. In the OP case, typically requires sample size estimates will be proportional to the square of the ratio between the size you want to measure and a deviation estimate.


His choice of words is not very fortunate. If webpage A has an underlying conversion ratio of X, and webpage B has an underlying conversion ratio of Y, then if X is not equal to Y, in theory we could always find evidence that X is "significantly" different from Y.

If X is close to Y, we need a very large sample to achieve statistical significance on the hypothesis that they're not equal, whereas if X and Y are far apart, it is likely that a small sample will already indicate that fact.


I think the big issues people see in A/B testing is because of a fairly tricky reason: the underlying distribution of the data. The usual ways of estimating how big your sample size are have one huge giraffe of a problem hiding in them: they assume the underlying distribution is normal.

The correct way to estimate your sample size is to use the cumulative distribution function of your underlying distribution. See a brief explanation from Wikipedia here: http://en.wikipedia.org/wiki/Sample_size_determination#By_cu...

Now what's the problem with A/B testing? Most of the stuff we test A/B for is incredibly non-normal. Often 99% of visits do not convert. We're looking at extremely skewed data here. Generally the more skewed the distribution, the more samples we need.

For a very basic understanding of why: consider a very simple distribution with 99.99% of the time you get $0 and 0.01% of the time you get $29 - fairly similar to what we A/B test. Do you think a sample of 1000 or 10000 is going to be anywhere near enough here? Of course not.


In statistics there is a "golden rule", that when np > 5 and n(1-p) > 5, then the normal distribution is a good approximation for the binomial distribution. Here n is the number of experiments and p is the conversion rate.

Our A/B testing data results from a Bernoulli experiment, and thus is binomially distributed. So indeed, if we use tests that assume a normal distribution, if we want to approximate the binomial distribution when p is 0.0001, n needs to be roughly 50k.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: