My take is for small n (say 5 experiments at once) with lots of subjects (>10k participants per branch) and a decent hashing algorithm, the risk of uneven bucketing remains negligible. Is my intuition off?
False positives for experiments is definitely something to keep an eye on. The question to ask is what is our comfort level for trading-off between false positives and velocity. This feels similar to the IRB debate to me, where being too restrictive hurts progress more than it prevents harm.
No, the risk of uneven bucketing of more than 1% is minimal, and even when it’s the case, the contamination is much smaller than other factors. It’s also trivial to monitor at small scales.
False positives do happen (Twyman's law is the most common way to describe the problem: underpowered experiment with spectacular results). The best solution is to ask if the results make sense using product intuition and continue running the experiment if not.
They are more likely to happen with very skewed observations (like how much people spend on a luxury brand), so if you have a goal metric that is skewed at the unit level, maybe think about statistical correction, or bootstrapping confidence intervals.
My take is for small n (say 5 experiments at once) with lots of subjects (>10k participants per branch) and a decent hashing algorithm, the risk of uneven bucketing remains negligible. Is my intuition off?
False positives for experiments is definitely something to keep an eye on. The question to ask is what is our comfort level for trading-off between false positives and velocity. This feels similar to the IRB debate to me, where being too restrictive hurts progress more than it prevents harm.