> Nooo! First, if one actually works, you’ve massively increased the “noise” for...

yorwba · on Sept 14, 2023

The more experiments you run in parallel, the more likely it becomes that at least one experiment's branches do not have an even distribution across all branches of all (combinations of) other experiments.

And the more experiments you run, whether in parallel or sequentially, the more likely you're to get at least one false positive, i.e. p-hacking. XKCD is using "find a subgroup that happens to be positive" to make it funnier, but it's simply "find an experiment that happens to be positive". To correct for p-hacking, you would have to lower your threshold for each experiment, requiring a larger sample size, negating the benefits you thought you were getting by running more experiments with the same samples.

lukego · on Sept 14, 2023

... and one such correction is the (simple, conservative, underused) Bonferroni Correction.

AlexeyMK · on Sept 14, 2023

Super helpful - looked it up, will aim to apply next time!

Curious how the bonferroni correction applies in cases where the overlap is partial - IE, experiment A ran from Day 1 to 14, and experiment B ran (on the same group) from days 8 to 21. Do you just apply the correction as if there was full overlap?

lukego · on Sept 14, 2023

I believe you would apply the correction for every comparison you make regardless of the conditions. It's a conservative default to avoid accidentally p-hacking.

There might be other more specific corrections that give you power in a specific case. I don't know about that, I went Bayesian somewhere around this point myself.

RandomLensman · on Sept 14, 2023

There are a bunch of procedures under the label Family-wise Error Correction, some have issues in situations with non-independence (Bonferoni can handle any dependency structure, I think).

If there are a lot of tests/comparisons could also look at controlling for the False Discovery Rate (usually increases power at the expense of more type I errors).

AlexeyMK · on Sept 14, 2023

Thanks, that is a well reasoned argument!

My take is for small n (say 5 experiments at once) with lots of subjects (>10k participants per branch) and a decent hashing algorithm, the risk of uneven bucketing remains negligible. Is my intuition off?

False positives for experiments is definitely something to keep an eye on. The question to ask is what is our comfort level for trading-off between false positives and velocity. This feels similar to the IRB debate to me, where being too restrictive hurts progress more than it prevents harm.

bertil · on Sept 14, 2023

No, the risk of uneven bucketing of more than 1% is minimal, and even when it’s the case, the contamination is much smaller than other factors. It’s also trivial to monitor at small scales.

False positives do happen (Twyman's law is the most common way to describe the problem: underpowered experiment with spectacular results). The best solution is to ask if the results make sense using product intuition and continue running the experiment if not.

They are more likely to happen with very skewed observations (like how much people spend on a luxury brand), so if you have a goal metric that is skewed at the unit level, maybe think about statistical correction, or bootstrapping confidence intervals.

bertil · on Sept 14, 2023

You are confusing:

a. the Family-Wise Error Rate (FWER what xkcd 882 is about) and the many solutions of Multiple Comparison Correction (MCC: Bonferoni, Homes-Sidak, Benjamini-Hochberg, etc.) with

b. Contamination or Interaction: your two variants are not equivalent because one has 52% of its members part of Control from another experiment, while the other variant has 48%.

FWER is a common concern among statisticians when testing, but one with simple solutions. Contamination is a frequent concern among stakeholders, but very rare to observe even with a small sample size, and that even more rarely has a meaningful impact on results. Let’s say you have a 4% overhang, and the other experiment has a remarkably large 2% impact on a key metric. The contamination is only 4% * 2% = 0.08%.

It is a common concern and, therefore, needs to be discussed, but as Lukas Vermeer explained here [0], the solutions are simple and not frequently needed.

[0] https://www.lukasvermeer.nl/publications/2023/04/04/avoiding...