I've seen procedural errors over and over. Such errors often come down to our temptation to see numbers as objective truth that doesn't require deeper thought. In very narrowly defined scopes, that might be true? But in complicated matters, that rarely holds and it's up to us to keep ourselves honest.
For example, if an experiment runs for a while and there is no statistically significant difference between cohorts, what do we do? It's a tie, but so often the question gets asked "which cohort is 'directionally' better?" The idea is that we don't know how much better it is, but whichever is ahead must still be better. This reasoning doesn't work unless there is something unusual going on at 0 difference in a specific case. Many of us are not comfortable with the idea of a statistical tie (e.g., the 2000 US Presidential election - the outcome was within the margin of error for counting votes and there is no procedure for handling that, and irrationality ensued). So the cohort that's ahead must be better, even if it's not statistically significant, right? We don't know if it is or it's not, but declaring it so satisfies our need for simplicity and order.
Ties should be acknowledged, and tie breakers should be used and be something of value. Which cohort is easier to build future improvements upon? Which is easier to maintain? Easier to understand? Cheaper to operate? Things like that make good tie breakers. And it's worth a check that there wasn't a bug that made the cohorts have identical behavior.
Another example of a procedural mistake is shipping a cohort the moment a significant difference is seen. Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better. Why? Users had long ago memorized the old location and expect it to be there. Now they have to find the new location. So the new location might perform worse initially, but be better in the steady state. If you aren't thinking things like that through (e.g., "Does one cohort have a short term advantage for some reason? Is that advantage a good thing?") and move too quickly, you'll be led astray.
To bring this last one back to the math a little more closely. A core issue here is that we have to use imperfect proxy metrics. We want to measure which button location is "better". "Better" isn't quantifiable. It has side effects, so we try to measure those. Does the user click more quickly, or more often, or buy more stuff, or ...? That doesn't mean better, but we hope it is caused by better. But maybe only in the long run as outlined above. Many experiments in for-profit corporate settings would ideally be measured in infinite time horizon profit. But we can't measure or act on that, so we have to pick proxies, and proxies have issues that require careful thought and careful handling.
> Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better.
In the 2000s I used this effect to make changes to AdSense text colors. People ignored the add but if I changed the colors on some cadence more people clicked on them. Measurable difference in income.
For example, if an experiment runs for a while and there is no statistically significant difference between cohorts, what do we do? It's a tie, but so often the question gets asked "which cohort is 'directionally' better?" The idea is that we don't know how much better it is, but whichever is ahead must still be better. This reasoning doesn't work unless there is something unusual going on at 0 difference in a specific case. Many of us are not comfortable with the idea of a statistical tie (e.g., the 2000 US Presidential election - the outcome was within the margin of error for counting votes and there is no procedure for handling that, and irrationality ensued). So the cohort that's ahead must be better, even if it's not statistically significant, right? We don't know if it is or it's not, but declaring it so satisfies our need for simplicity and order.
Ties should be acknowledged, and tie breakers should be used and be something of value. Which cohort is easier to build future improvements upon? Which is easier to maintain? Easier to understand? Cheaper to operate? Things like that make good tie breakers. And it's worth a check that there wasn't a bug that made the cohorts have identical behavior.
Another example of a procedural mistake is shipping a cohort the moment a significant difference is seen. Take the case of changing a button location. It's possible that the new location is much better. But the first week of the experiment might show that the status quo is better. Why? Users had long ago memorized the old location and expect it to be there. Now they have to find the new location. So the new location might perform worse initially, but be better in the steady state. If you aren't thinking things like that through (e.g., "Does one cohort have a short term advantage for some reason? Is that advantage a good thing?") and move too quickly, you'll be led astray.
To bring this last one back to the math a little more closely. A core issue here is that we have to use imperfect proxy metrics. We want to measure which button location is "better". "Better" isn't quantifiable. It has side effects, so we try to measure those. Does the user click more quickly, or more often, or buy more stuff, or ...? That doesn't mean better, but we hope it is caused by better. But maybe only in the long run as outlined above. Many experiments in for-profit corporate settings would ideally be measured in infinite time horizon profit. But we can't measure or act on that, so we have to pick proxies, and proxies have issues that require careful thought and careful handling.