With 56 tests and looking at 2 dimensions, having ~6 false ideas certainly isn't surprising even if you calculate to 95% CI..
But despite our focus on the null hypothesis, it is actually just the cost break even to make false conclusions when two variants are equally effective. So you are even less likely to make a mistake in a real A/B test the larger that mistake is going to be and cost.
Emulating statistical theories poorly is the essence of many methods of AI that despite occasionally ridiculous results are quite powerful.
If you instead simply measure your conversion rate of each campaign, you can not tell if you are negatively trending in terms of the content because there will be outside factors of user engagement especially if you lack an adequate source of new subscribers. Only the most engaged humans will pull the signal out of that noise enough to get better over years without an external system. (If I were not going to A/B test, I would at least first estimate conversion rates and then track my Brier Scores over time.)
I don't have any experience with email campaigns, but what I see with web ones is that most organisations are fairly delusional when it comes to understanding their actual audience and communally they are rarely open to accepting corrections from people with less domain knowledge, (but domain knowledge is primarily a fancy form of bias.)
For example, if your product caries risk, many of the customers who contact you will be the most risk averse looking for security, but 95% of your customers will probably be risk tolerant. I've repeatedly corrected some via an A/B test where the B versions conversion rate continued to apply long after it became the only version, but their staff will implement A style changes whenever given an opportunity to do changes without testing. I'd guess that customers they deal with give them a false impression and our repeated demonstrations that there are significantly more customers with a different view is not an adequate correction.
Not using outsiders it is harder to get such a full corrections to your biases, but you can at least get a new employee's (or spouses or critical customer's) idea tried out in a way where they are much more likely to quantitatively show promise in way that demands acceptance rather than being chalked up to noise using the human emotional system.
But despite our focus on the null hypothesis, it is actually just the cost break even to make false conclusions when two variants are equally effective. So you are even less likely to make a mistake in a real A/B test the larger that mistake is going to be and cost.
Emulating statistical theories poorly is the essence of many methods of AI that despite occasionally ridiculous results are quite powerful.
If you instead simply measure your conversion rate of each campaign, you can not tell if you are negatively trending in terms of the content because there will be outside factors of user engagement especially if you lack an adequate source of new subscribers. Only the most engaged humans will pull the signal out of that noise enough to get better over years without an external system. (If I were not going to A/B test, I would at least first estimate conversion rates and then track my Brier Scores over time.)
I don't have any experience with email campaigns, but what I see with web ones is that most organisations are fairly delusional when it comes to understanding their actual audience and communally they are rarely open to accepting corrections from people with less domain knowledge, (but domain knowledge is primarily a fancy form of bias.)
For example, if your product caries risk, many of the customers who contact you will be the most risk averse looking for security, but 95% of your customers will probably be risk tolerant. I've repeatedly corrected some via an A/B test where the B versions conversion rate continued to apply long after it became the only version, but their staff will implement A style changes whenever given an opportunity to do changes without testing. I'd guess that customers they deal with give them a false impression and our repeated demonstrations that there are significantly more customers with a different view is not an adequate correction.
Not using outsiders it is harder to get such a full corrections to your biases, but you can at least get a new employee's (or spouses or critical customer's) idea tried out in a way where they are much more likely to quantitatively show promise in way that demands acceptance rather than being chalked up to noise using the human emotional system.