If you'd rather go through some of this live, we have a section on Stats for Growth Engineers in the Growth Engineering Course on Reforge (course.alexeymk.com). We talk through stat sig, power analysis, common experimentation footguns and alternate methodologies such as Bayesian, Sequential, and Bandits (which are typically Bayesian). Running next in October.
Other than that, Evan's stuff is great, and the Ron Kohavi book gets a +1, though it is definitely dense.
Yes, but it is _rough_. What actually hurts is "browse on mobile, buy on desktop" type behavior.
Still worth doing, but you end up needing more black magic than you'd like (IP-based assignment, Ad Network-sourced assignment, CDN proxies for Analytics tools, etc).
I myself am a rather recent convert to using Bayesian statistic, for the simple reason, that I was trained and have used frequentist statistics extensively in the past and I had no experience using Bayesian statistics. Once you take the time to master the basic tools, it becomes quite straightforward to use. I am currently away from my computer and resources, which makes it difficult to suggest them. As a somewhat shameless plug, you could check the https://www.frontiersin.org/articles/10.3389/fpsyg.2020.0094... paper and the related R-package https://cran.r-project.org/web/packages/bayes4psy/index.html and GitHub repository https://github.com/bstatcomp/bayes4psy, which were made to be accessible to users with frequentist statistics experience.
To brutaly simplify the distinction. Using frequentist statistics and testing, you are addressing the question, whether based on the results, you can reject the hypothesis that there is no difference between two conditions (e.g., A and B in A/B testing). The p-value broadly gives you the probability that the data from A and B are sampled from the same distribution. If this is really low, then you can reject the null hypothesis and claim that there are statistically significant differences between the two conditions.
In comparison, using Bayes statistic, you can estimate the pobability of a specific hypothesis. E.g. the hypothesis that A is better than B. You start with a prior belief (prior) in your hypothesis and then compute the posterior probability, which is the prior adjusted for the additional empirical evidence that you have collected. The results that you get can help you address a number of questions. For instance, (i) what is the probability that in general A leads to better results than B. Or related (but substantially different), (ii) what is the probability that for any specific case using A you have a higher chance of success than using B. To illustrate the difference, the probability that men in general are taller than women approaches 100%. However, if you randomly pick a man and a woman, the probability that the man will be higher than the woman is substantially lower.
In your A/B testing, if the cost of A is higher, addressing the question (ii) would be more informative than question (i). You can be quite sure that A is in general better than B, however, is the difference big enough to offset the higher cost?
Related to that, in Bayes statistics, you can define the Region of Practical Equivalence (ROPE) - in short the difference between A and B that could be due to measurment error, or that would be in practice of no use. You can then check in what proportion of cases, the difference would fall within ROPE. If the proportion of cases is high enough (e.g. 90%) then you can conclude that in practice it makes no difference whether you use A or B. In frequentist terms, Bayes allows you to confirm a null hypothesis, something that is impossible using frequentist statistic.
In regards to priors - which another person has mentioned - if you do not have specific reason to believe beforehand that A might be better than B or vice versa, you can use a relatively uninformative prior, basically saying, “I don’t really have a clue, which might be better”. So issue of priors should not discourage you to using Bayes statistics.
> Nooo! First, if one actually works, you’ve massively increased the “noise” for the other experiments
I get that a bunch at some of my clients. It's a common misconception. Let's say experiment B is 10% better than control but we're also running experiment C at the same time. Since C's participants are evenly distributed across B's branches, by default they should have no impact on the other experiment.
If you do a pre/post comparison, you'll notice that for whatever reason, both branches of C are doing 5% better than prior time periods, and this is because half of them are in the winner branch of B.
NOW - imagine that the C variant is only an improvement _if_ you also include the B variant. That's where you need to be careful about monitoring experiment interactions, I called out in the guide. But better so spend a half day writing an "experiment interaction" query than two weeks waiting for the experiments to run in sequence.
That is a valid concern to be vigilant for. In this case, XKCD is calling out the "find a subgroup that happens to be positive" hack (also here, https://xkcd.com/1478/). However, here we're testing (a) 3 different ideas and (b) only testing each of them once on the entire population. No p-hacking here (far as I can tell, happy to learn otherwise), but good that you're keeping an eye out for it.
The more experiments you run in parallel, the more likely it becomes that at least one experiment's branches do not have an even distribution across all branches of all (combinations of) other experiments.
And the more experiments you run, whether in parallel or sequentially, the more likely you're to get at least one false positive, i.e. p-hacking. XKCD is using "find a subgroup that happens to be positive" to make it funnier, but it's simply "find an experiment that happens to be positive". To correct for p-hacking, you would have to lower your threshold for each experiment, requiring a larger sample size, negating the benefits you thought you were getting by running more experiments with the same samples.
Super helpful - looked it up, will aim to apply next time!
Curious how the bonferroni correction applies in cases where the overlap is partial - IE, experiment A ran from Day 1 to 14, and experiment B ran (on the same group) from days 8 to 21. Do you just apply the correction as if there was full overlap?
I believe you would apply the correction for every comparison you make regardless of the conditions. It's a conservative default to avoid accidentally p-hacking.
There might be other more specific corrections that give you power in a specific case. I don't know about that, I went Bayesian somewhere around this point myself.
There are a bunch of procedures under the label Family-wise Error Correction, some have issues in situations with non-independence (Bonferoni can handle any dependency structure, I think).
If there are a lot of tests/comparisons could also look at controlling for the False Discovery Rate (usually increases power at the expense of more type I errors).
My take is for small n (say 5 experiments at once) with lots of subjects (>10k participants per branch) and a decent hashing algorithm, the risk of uneven bucketing remains negligible. Is my intuition off?
False positives for experiments is definitely something to keep an eye on. The question to ask is what is our comfort level for trading-off between false positives and velocity. This feels similar to the IRB debate to me, where being too restrictive hurts progress more than it prevents harm.
No, the risk of uneven bucketing of more than 1% is minimal, and even when it’s the case, the contamination is much smaller than other factors. It’s also trivial to monitor at small scales.
False positives do happen (Twyman's law is the most common way to describe the problem: underpowered experiment with spectacular results). The best solution is to ask if the results make sense using product intuition and continue running the experiment if not.
They are more likely to happen with very skewed observations (like how much people spend on a luxury brand), so if you have a goal metric that is skewed at the unit level, maybe think about statistical correction, or bootstrapping confidence intervals.
a. the Family-Wise Error Rate (FWER what xkcd 882 is about) and the many solutions of Multiple Comparison Correction (MCC: Bonferoni, Homes-Sidak, Benjamini-Hochberg, etc.) with
b. Contamination or Interaction: your two variants are not equivalent because one has 52% of its members part of Control from another experiment, while the other variant has 48%.
FWER is a common concern among statisticians when testing, but one with simple solutions. Contamination is a frequent concern among stakeholders, but very rare to observe even with a small sample size, and that even more rarely has a meaningful impact on results. Let’s say you have a 4% overhang, and the other experiment has a remarkably large 2% impact on a key metric. The contamination is only 4% * 2% = 0.08%.
It is a common concern and, therefore, needs to be discussed, but as Lukas Vermeer explained here [0], the solutions are simple and not frequently needed.
Yes: you could use bayesian priors and a custom model to give yourself more confidence from less data. But...
Don't: for most businesses that are so early they can't get enough users to hit stat-sig, you're likely to be better off leveraging your engineering efforts towards making the product better instead of building custom statistical models. This is nerd-sniping-adjacent, (https://xkcd.com/356/) a common trap engineers can fall into: it's more fun to solve the novel technical problem than the actual business problem.
Though: there are a small set of companies with large scale but small data, for whom the custom stats approaches _do_ make sense. When I was at Opendoor, even though we had billions of dollars of GMV, we only bought a few thousand homes a month, so the Data Science folks used fun statistical approaches like Pair Matching (https://www.rockstepsolutions.com/blog/pair-matching/) and CUPED (now available off the shelf - https://www.geteppo.com/features/cuped) to squeeze a bit more signal from less data.
Another option is monetizing via a niche job board for your readers. Pallet.com is good for that - will find companies, etc. Disclosure, tiny investor.
You can get around the "check in every transaction" problem with an ORM, but now you're (more) coupled to your ORM which you will occasionally inevitable need to circumvent for something or other. And now you've made it (more of a) leaky abstraction.
or you can create view for every table that supports soft deletion and ensure all of your read-only queries are using those table
CREATE VIEW current_customers AS
SELECT * FROM customers where deleted_at is null;
SELECT * FROM current_customers JOIN ...
Of course, this comes with its downsides e.g. views need to be recreated in every migration and there might be some complex join operations that might not work.
It may be work sometime in the future, but having to prefix every query with a null check sounds bonkers.
A view can be optimised to do all that for you.
I made the what the flip expression reading that they did a prefix on everything.
It can be the other way around, that the table has prefix/suffix, and the view doesnt. Alternatively view can be created with the same name, but in a different schema, which is set with higher priority for the user (e.g. via search_path in PG)
In postgres you cannot make a foreign-key reference to a view, which effectively means you can't prevent another table from referencing a soft-deleted record.
I wish Postgres had something like a "CHECK contraint" on a foreign key.
There is a hack of sorts. You create a duplicate of the primary key column, named e.g. id_active. You create a CHECK constraint which says something like "(status = 'Deleted' AND id_Active IS NULL) OR (status <> 'Deleted' AND id_active = id)". You create a unique index on "id_active", and point your foreign key to that. When you create a record, populate both id and id_active to same value; when you soft-delete it, set id_active to NULL. Actually, maybe a simpler solution is to make id_active a "GENERATED ALWAYS AS ... STORED" column–although I'm not sure if Postgres supports them for foreign keys? That's a relatively new Postgres feature and I haven't done much yet with the more recent versions in which that feature was added.
I’m also not sure about the foreign key restrictions on generated columns (and glancing at the docs I don’t see anything about it on there) but for all intents and purposes they are real columns so I’d imagine it probably works. Apparently they run after the before triggers, I’m not totally sure where foreign keys are checked but probably after that?
As an aside, they’re a great feature. We’re using them to generate columns that we can index for efficient joins between tables and also for creating text strings for searching over using trigram indexes. The whole thing is really seamless.
I haven't tested it but it might be the case that it'd need to be a stored generated column to be referenced like that but that shouldn't be a big deal.
I'm not trying to shill for the company I work at, but in Hasura, you can make FK's and relationships between views and treat them like regular insertable/updatable tables.
It was one of the many things that impressed me so much when I was a user that made me want to hack on the tool for a living.
It's an article on a personal blog targeting a specific readership familiar with the term.
More broadly, I think the level of granularity at which it makes sense to define terms (Mesos, Kubernetes, even Uber) has to roughly match the level of familiarity of the reader that will get something meaningful from the piece.
> It's an article on a personal blog targeting a specific readership familiar with the term.
He's the CTO of a "wellness" tech company now and writes broadly on the topic of engineering and engineering management. The folks who will read this likely are familiar enough with these technical terms to glean knowledge, or at least to gloss over them.
https://playbooks.hypergrowthpartners.com/p/picking-your-lif...
I wrote this up about a year ago for a more comprehensive perspective for companies series A+