This isn't just a startup thing. This is common also at FAANG.
Not only are expriments commonly multi-arm, you also repeat your experiment (usually after making some changes) if the previous experiment failed / did not pass the launch criteria.
This is further complicated by the fact that lauch criteria is usually not well defined ahead of time. Unless it's a complete slam dunk, you won't know until your launch meeting whether the experiment will be approved for launch or not. It's mostly vibe based, determined based on tens or hundreds of "relevant" metric movements, often decided on the whim of the stakeholder sitting at the lauch meeting.
The idea is not do do science. The idea is to loosely systematize and conceptualize innovation. To generate options and create a failure tolerant system.
I'm sure improvements could be made... but this isn't about being a valid or invalid expirement.
The standard for science a much higher ie. publishing a effect when it arose by chance as an academic
When you A/B test generally mistakes are reversible and will not make your company bankrupt or lose your job. Something being a 1 in 20 fluke is acceptable risk, you'll get most decisions right. Compare this however to hairy decisions on entering a new market or creating a new product line, there are no A/B tests or scientific frameworks here, you gather all the evidence you can, estimate the risk and make a decision
You're describing conditioning analyses on data. Gelman and Loken (2013) put it like this:
> The problem is there can be a large number of potential comparisons when the details of data analysis are highly contingent on data, without the researcher having to perform any conscious procedure of fishing or examining multiple p values. We discuss in the context of several examples of published papers where data-analysis decisions were theoretically-motivated based on previous literature, but where the details of data selection and analysis were not pre specified and, as a result, were contingent on data.
Not only are expriments commonly multi-arm, you also repeat your experiment (usually after making some changes) if the previous experiment failed / did not pass the launch criteria.
This is further complicated by the fact that lauch criteria is usually not well defined ahead of time. Unless it's a complete slam dunk, you won't know until your launch meeting whether the experiment will be approved for launch or not. It's mostly vibe based, determined based on tens or hundreds of "relevant" metric movements, often decided on the whim of the stakeholder sitting at the lauch meeting.