This is a great book. I read it a couple years ago and I remember a couple takeaways that apply well to AB testing:
1 - Monitoring tests on an ongoing basis and then calling them as soon as they hit some confidence threshold (like 95%) will give you biased results. It's important to determine your sample size up front and then let the test run all the way through, or at least be aware that the results are less reliable if you stop early.
2 - Testing for multiple metrics requires a much larger sample. If you run a test and then compare conversion rate, purchase amount, pageviews per session, retention, etc. etc., you'll have a much higher error rate since the more things you measure, the more likely you are to get an outlier. You either need to run a separate test for each metric or increase your sample size a lot to account for this effect (iirc the math for exactly how much is in the book).
Thanks, I'm glad you enjoyed the book! (Author here -- the website got its first publicity here on HN.)
Regarding AB testing, you might be interested in this recent research, which uses real data from Optimizely to estimate how often people get AB test false positives because they stopped as soon as they hit significance: https://ssrn.com/abstract=3204791
> Specifically, about 73% of experimenters stop the experiment just when a positive effect reaches 90% confidence. Also, approximately 75% of the effects are truly null. Improper optional stopping increases the false discovery rate (FDR) from 33% to 40% among experiments p-hacked at 90% confidence
While it may be possible take the frequentist approach to AB testing, Bayesian inference is becoming the way to go with this.[1] Instead of directly setting up a yes-no hypothesis test with the nearly impossible to use correctly p-values[2], Bayesian approaches aim to directly estimate whatever quantity you want. With the Bayesian approach, you get estimates for A, B, and A-B (or whatever combination you want, e.g. (A-B)/A). Each of those estimates are properly called posterior probability distributions and describe the range of possible values. The end result is that instead of saying "A is better than B (p < 0.05)" you get a probability distribution of A-B. From that probability distribution you can answer any question you want: the most likely difference between A and B (the average), the probability that A is better than B (just integrate the area above 0), or whatever is needed to make a decision.
> Monitoring tests on an ongoing basis and then calling them as soon as they hit some confidence threshold (like 95%) will give you biased results ...
The Bayesian method doesn't really solves this as much it answers a fundamentally different question -- modeling how your personal belief changes. These typically will not have the interpretation as normalized long term frequency. As long as the Bayesian posteriors are not interpreted as frequentist probabilities, they are perfectly acceptable.
That said, this 'peeking' problem can be easily resolved in the frequentist setting and this is well known in stats, probability and hopefully ML literature. The core results are really old. In fact this was classified information during world war II. If you are interested, search for sequential hypothesis test. They are actually more efficient than their batch cousins.
Bayesian vs Frequentist is an orthogonal axis from sequential/online vs batch. Think of a 2 X 2 box, you can choose to be in any quadrant you want.
Hi, I didn't see anything on the page about "intended audience" - would you say this is appropriate for someone who has done the basic classes of statistics in uni but is now pretty rusty or would you need a more solid foundation to be able to grasp the content fully?
1 - Monitoring tests on an ongoing basis and then calling them as soon as they hit some confidence threshold (like 95%) will give you biased results. It's important to determine your sample size up front and then let the test run all the way through, or at least be aware that the results are less reliable if you stop early.
2 - Testing for multiple metrics requires a much larger sample. If you run a test and then compare conversion rate, purchase amount, pageviews per session, retention, etc. etc., you'll have a much higher error rate since the more things you measure, the more likely you are to get an outlier. You either need to run a separate test for each metric or increase your sample size a lot to account for this effect (iirc the math for exactly how much is in the book).