I'm usually using a G-test, which is known to be inaccurate unless A and B each have at least 10 failures and 10 successes. So that puts a minimum size right there.
After that it is all a question of how many trials you were prepared to run, and how aggressively you need the answer. I tend to push to be conservative because the product manager is guaranteed to push the other way. For instance if you've got less than 500 trials I like to push for 99.9% confidence because, "If the effect is really this strong, we'll get there pretty quickly." After that I ease off. I've never tried to sit down and calculate any kind of optimal way to do so.
If I was to try to formalize it, I'd be inclined to set things up so that for some underlying bias which is smaller than what we're hoping to measure (for instance a 2% bias), our odds of picking the wrong answer are the same each time. I don't have any analysis behind that idea, it is just something that sounds reasonable to me. If I was to head down that road I'd probably run some Monte Carlo simulations to show that I'd be highly likely to wind up with test result answers in some vaguely reasonable time frame with whatever volume the website had. But again I haven't tried to do that analysis.
Came here to say this. G-tests are the way to go, any other method and you end up with problems. I usually go with a 98 to 99% confidence, because really, the odds of the loser being much worse than the winner are so small that it doesn't really matter and I would rather iterate quickly.
After that it is all a question of how many trials you were prepared to run, and how aggressively you need the answer. I tend to push to be conservative because the product manager is guaranteed to push the other way. For instance if you've got less than 500 trials I like to push for 99.9% confidence because, "If the effect is really this strong, we'll get there pretty quickly." After that I ease off. I've never tried to sit down and calculate any kind of optimal way to do so.
If I was to try to formalize it, I'd be inclined to set things up so that for some underlying bias which is smaller than what we're hoping to measure (for instance a 2% bias), our odds of picking the wrong answer are the same each time. I don't have any analysis behind that idea, it is just something that sounds reasonable to me. If I was to head down that road I'd probably run some Monte Carlo simulations to show that I'd be highly likely to wind up with test result answers in some vaguely reasonable time frame with whatever volume the website had. But again I haven't tried to do that analysis.