Hacker News new | past | comments | ask | show | jobs | submit login
Why You Don't A/B Test, and How You Can Start This August (kalzumeus.com)
119 points by dangrossman on Aug 5, 2013 | hide | past | favorite | 56 comments



I hope no-one minds me touting a free course I'm offering on bandit algorithms:

http://noelwelsh.com/data/2013/08/05/why-you-should-know-abo...

I was literally working on the post when Patrick's email arrived. It's more focused on the algorithms and will, I think, be a good compliment to what Patrick is offering.


I upvoted you from zero, in the interest of HN remaining a supportive and congenial place for people who are doing stuff to move the industry's state of play forward.


Thanks! I appreciate your support, especially given you don't agree with me on the benefits of bandit testing.


I signed up as someone who has a deep need for this. We are trying to run through 4 months of cold calling data to identify the ideal customer profile / call frequency etc. that has the highest odds of converting into a sale. I vaguely remember learning some of it in my Marketing Research class but have since forgotten.

From your post it sounded like a free online course but it seems like the free course will be presented in person? I wish you had a more formal online offering.


The free course is online.

Let me give the backstory, which I hope will clarify things:

I'm giving a talk at Strata London (a "Big Data" conference) in a few months. I won't be able to go into details for the talk, so I want to create supplementary material. I seems sharing this material online will be good for others (and for me). I don't want to plonk it all on the web because 1) I don't want to make all my material available till after the talk and 2) I want to build a mailing list of people interested in this kind of stuff.

I may build the material into a more formal course if there is sufficient interest.


I've been doing some simulations with bandit algorithms. Early results make them look pretty good (http://www.eanalytica.com/blog/split-test-methodologies-prev...).

However, this result could also be explained by the bandit having a lot more variations to play with. I have a lot more work to do on this before I feel sure of an answer


Interesting. I'd like to know more about how you setup your experiments.


Have chucked the simulation code up on github: https://github.com/richardfergie/split-test-simulator

Current things that I think are wrong with it:

* Successive tests should not be independent; picking the wrong option should have repercussions beyond the next test

* The distribution of conversion rates from which samples are drawn was picked because it is convenient not because it is correct

* Takes no account of resource issues around creating new variations e.g. it is unlikely that people can create a new variation every day but some strategies assume they can.

Please let me know your thoughts


I've now started flagging these kinds of submissions on HN, for the simple reason that they are little more than ads for something vaguely specified that may or may not even exist yet. There is absolutely no directly actionable content in this article that I can see, nor any other original ideas that anyone following HN doesn't see several times a week on the front page alone.

I post this only so I can note that this is not intended as a criticism of Patrick, just a view that the hero worship and endless reposting of A/B evangelism has far overstayed its welcome IMHO. If people want to read the fluff as well, then as we get told every five minutes they can sign up for Patrick's newsletter. Please, for the love of all that is holy, spare those of us who have chosen not to do so the endless join-my-mailing-list spam that is starting to dominate HN, and save the HN submissions for substantial content.


This email is more infomercial than the emails from the past weeks. I reread the last two emails and I think they are more interesting, but this got much more upvotes here.

Anyway, this email convinced me to try to run an A/B test. I have a button that I don't like and I got negative feedback about that button in the past. So I will (probably) try to optimize the microcopy, but I must understand and change the php code that renders that button. I'll report the result when the experiment finish. Wish me good luck!


Flagging is for stuff that's inappropriate for this site, like political articles, the latest injustice, politics, news stories, conspiracy theories and so on.

For stuff like this that you may not happen to like, just don't vote for it, and ignore it. I tend not to vote up patio11's stuff these days even though it's good, just because it gets so much attention.

But it is 100% on topic for this site and does not deserve to be flagged.

How much time has patio11 spent giving free advice and information here? So what if he also sells some stuff? That's what businesses do, sheez!


Flagging is for stuff that's inappropriate for this site

I don't think almost completely content-free promotional material is appropriate for this site, whoever writes it and whatever it is advertising.

I don't flag many submissions, and following the FAQ, I don't normally comment at all when I do. However, I don't know what actually happens when something gets flagged, so in this case I made an exception because I'm aware that Patrick himself is wary of submitting pieces like this. I didn't want anything to imply a personal attack, nor general criticism of links about A/B testing if they do offer new/interesting material.


It's worth mentioning that that's exactly why Patrick doesn't normally submit this stuff himself. Infact, I've seen him apologies/ask people not to just re-post stuff here and just subscribe to his newsletter/blog/etc.

It's also worth noting that you don't HAVE to click through to something that's on the kalzumeus domain, if you subscribe to the newsletter or don't want to.

Personally, I think anything that generates a reasonable number of good comments is great (not necessarily this.) That's why HN is great.


One thing that unnerves me about A/B testing is that you can't test for evilness. How much of this glossary[1] do you think is intentionally designed? I doubt that all of these changes were intentional, it's just that those changes showed objective results.

A/B testing can't test for morality, and you may very well be implying something with your B design that you didn't mean to imply, which none the less rates higher in your test.

In a business setting, it becomes awfully hard to argue morality against objective numbers. It's hard to do that anyway. Many business operate with a profit first motive. So once a dark pattern is in, how are you going to get it out?

Don't get me wrong, A/B testing is a tool, and like all good tools it can be used for good or evil. It just worries me that A/B testing, despite good intentions, can lead to evil results. I don't see anything on this page about how to avoid evil results.

1. http://darkpatterns.org/


I can see how you don't want to be in a situation where you are holding in your hand evidence that Shady Marketing Tactic X is proven to convert better. But I think the solution is simple: Don't A/B test shady features. If something makes you uncomfortable or seems like a dark pattern, don't test it, because you would never want it on your site anyway.

I don't think this is a failing of A/B testing. That someone can get numbers that show "evil" tactics can make money is not materially different from someone coding "evil" HTML in Notepad and getting results. As you said, it is just a tool.


The problem is you might not be aware you're making a shady feature. As an example, if you are making an ad for your app fooer. You A/B test some banners on an ad distribution network. You find that a simpler ad with a "download now" hyperlink graphic proves effective.

No big deal right? People are going to be seeing these banners on places like news sites right? It's not tricking people. It's also pretty reasonable right? People are used to clicking on hyperlinks, it overcomes the resistance to click on images. A reasonable move for a banner ad to make.

Except the ad distribution network serves content sites like softpedia as well.

You saw the spike because you emulated the download button right under the mirror list, and people were confused. They downloaded your application under the pretext of getting something else.

Or what about if I want a "Hey you should sign up for notifications for other job offerings in this field" message box? I'd love to have one of those in my application. I A/B test to see that this message box drastically increases signups.

Except it was because people thought it was a paywall.

The road to hell is paved with good intentions.

Consider this case about the Weebly blogging service: http://minimaxir.com/2013/05/overly-attached-startup/

They were A/B testing messaging for first week signups. This hyper aggressive and insane route was actually considered. Do you think they're black hat? I don't think so, but what if this had become standard practice?


I know exactly why we[1] don't A/B test.

It is because I don't want to serve the people that it would be effective on.

If you're marginally buying / not buying my product based on color scheme or (buzz)word order or some other piece of puff, we're all better served if you just moved along.

[1] You know who we are.


So you think you and your "target customers" are immune to the effects of subtle changes? You and your customers are the lucky ones with complete, unfiltered, and unbiased access to the conscious and subconscious (well, not to you) processes going on in your brain that influence your day-to-day life?

Please.


Just a thought -- this POV implies that there is a single way to explain every piece of information on your site that is optimal for each person who might be a customer. (The corollary is that all your customers are the same.) A person doesn't have to be swayed by a buzzword, color scheme, or fluff to respond to different presentation of facts. You know this is true if you've ever had to explain anything to anyone. Some examples of how different presentations of your content might affect an individual:

- English: first language?

- Country/region of origin?

- What did they read before coming to your site?

A lot of things influence human behavior, and A/B testing isn't just to tease out the fluffy bits you can use to sell to the gullible.


I'm a customer of yours. I'm in your target market (guessing: technical person who wants a proper backup solution).

However, I would never have found or bought your product if I hadn't seen your link on a thread on HN about git-annex a few days ago. If I hadn't followed that link specifically from talking about a project I'm really enthusiastic about I doubt I would have bought.

> It is because I don't want to serve the people that it would be effective on.

If you remove the set of all people it will be effective on from the set of all humans then you will get ∅.


If your A/B testing is about color schemes and buzzword order, you're doing it wrong. Random spaghetti testing is a surefire way to waste your time.

“Green vs orange” is not the essence of A/B testing. It’s about understanding the target audience. This starts with research, and your hypotheses are validated with split testing. Doing research and analysis can be tedious and it’s definitely hard work, but it’s something you need to do.

Serious gains in conversions don’t come from psychological trickery, but from analyzing what your customers really need, the language that resonates with them and how they want to buy it. It’s about relevancy and perceived value of the total offer.

Unless you have the ability to foresee the future, it's impossible to know in advance which language, content and layout will resonate the best with your target audience.


Most of the case studies around A/B testing is very misleading. The value you get from testing color schemes is likely to be insignificant - unless you have a really horrible color scheme as one of the options or you have traffic comparable to Googles and Facebooks of the world.

A/B testing is overrated, imho. We have blogged about it here -> http://blog.nudgespot.com/2013/06/time-to-rethink-ab-testing.... Customers are lot more intelligent about their purchase decisions than what these blogs like us to believe.

That does not mean you shouldn't test. You definitely need to test and learn from your customers. There are other better ways of doing it, than just A/B testing as is often advocated.


Is your desire to only serve certain people based in the long run on profit motive (i.e. you think your fewer customers will be more valuable and stick around longer), or is it just out of principle even if it means less profit in the long run?


It is done out of principle.


How do you know your current design is excluding the right people? The people who are buying your product now might just be reacting to your current color scheme (or lack thereof) or word order.


I don't mean this to be snarky so please don't take it that way. I don't see where color scheme, button text or carousel slide order will make a difference if you're selling something someone really needs, is informed about, etc. Maybe the argument is less "red buttons get 0.8% more people to buy" and more "red buttons get people to read our copy 40 seconds longer and that means 0.8% more people will understand what we're selling," but it's always been presented as "convince those on the edge to buy whether they understand/need the product or not."


One thing I learned early on from working at a college computer lab in my late teens is that people very often don't even see certain elements on the screen. The post I responded to made it sound like people are deciding not buy because they consciously evaluated the button color and didn't like it.

Maybe people didn't notice the button when it was a different color. An A/B test couldn't tell you that but having watched a lot of people using unfamiliar software, that seems a lot more plausible.

If the point of a website is to communicate then A/B tests are a good way of making sure that you're communicating effectively.


> I don't see where color scheme, button text or carousel slide order will make a difference if you're selling something someone really needs, is informed about, etc

This is true if you are unique in the market of a lot better than competitors. But if you are only 5% better even people who know quite a lot about your services and niche will not always be able to tell.

Agree that "white hat" conversion rate optimisation should be about getting people to understand the product


If you are a consumer facing site, every single customer of yours will have a point where they know nothing about you and need to decide whether to go with you or a competitor. That's the kind of thing that A/B testing is good for.

However if you've got an "enterprise product" that people have to use because their management tells them that they have to, then this does not apply to you.

So what you say can be true. But is probably true for fewer companies than think it is true about them.


If you're looking for things to A/B test, please check out my post:

https://news.ycombinator.com/item?id=6163397

It details 19 A/B test ideas that have worked for us in the past. I hope HN finds it useful!


I'm interested to see the sample size calculations. Those numbers don't jibe with any calculations I've ever done.


Nor me, to collect a 1% change:

   > power.prop.test(p1=0.1, p2=0.11, power=0.8,sig.level=0.05)
     Two-sample comparison of proportions power calculation

              n = 14750.79
             p1 = 0.1
             p2 = 0.11
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

   NOTE: n is number in *each* group
Edit: so that's 30,000 samples. To do the same with a 5% change in the same area is still 1300 samples (alpha=0.8, p=0.05)


I normally consider relative minimum discernable effect. Your 5% absolute change is a 50% increase is the base rate. I also typically go for a higher power (e.g. 0.9). Under these conditions 60K samples is more typical.

Sample size calculator here: http://www.evanmiller.org/ab-testing/sample-size.html

You'll need to change the defaults to match the above to get the figures I mention.


You're right. That's a bit more reasonable, so back to my 10% change in base rate (1% abs.) but with a 90% power:

  > power.prop.test(p1=0.1, p2=0.11, power=0.9,sig.level=0.05)

     Two-sample comparison of proportions power calculation

              n = 19746.62
             p1 = 0.1
             p2 = 0.11
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

  NOTE: n is number in *each* group
Requires about 40,000 samples per test. I would strongly recommend anyone serious about doing this look in to MAB testing, as A-B testing is way too expensive for reasonable scale testing (unless you have a strong a priori hypothesis to test).


If you have any questions, feel free to ask.


Very often your current users get used to your design and hate it when you change something. Is there a way to do A/B testing on your current users, or should you try it only on new ones?


You can, theoretically, only A/B test things against new users. Most A/B testing tools do not support this behavior out of the box.

The one time I made a change which was drastic enough to consider doing that, I just put a revert button on the interface (on both sides) and got ready to tell people how to use it. It turns out nobody asked.

In general, though, many of the things which are most valuable to test border on imperceptible for long-term users of the site. For example, do you use Dropbox? (Picking a well-known example locally.) Can you identify the H1 on dropbox.com? Can you identify the button copy on dropbox.com? Most seasoned users can't, and won't notice changes to these elements, yet they're strongly influential on free trial signups.

Can you freehand sketch the Dropbox credit card form? How about identifying the button copy on it? These have substantial impact on purchases at the margins. People only see them once (well, typical case for software), and nobody remembers them for more than a few minutes.

Think back to your first run experience with Dropbox. How many steps did it have? What were their names? I'm going to bet you that the onboarding process for Dropbox has had more dedicated optimization effort than everything else the company does combined, but the median number of exposures per user is one.


Great insight! Though I'd expand on your bet and claim that at almost every major site, a majority of testing is channeled into onboarding. It's certainly the case at Twitter and Facebook that a lot of work goes into optimizing a user's first steps.


Why are you suggesting A/B instead of multi armed bandit approaches?


Great question. It's a long answer and it gets sort of involved.

1) It is easier to get adoption of A/B testing -- which many people have heard of, which many agree that they should be doing, and which captures substantially all of the benefits of bandit testing -- than bandit testing, at the typical company. e.g. If I go to your software company CMO and say "Do you know what A/B testing is?", if the answer is "No", then the CMO is not quite top drawer. "No, and there is no reason why I should know that" is a perfectly acceptable answer for bandit testing.

2) There are some subtleties about actually administering bandit tests, for example in how tests interact which each other or with exogenous trends in your traffic mix, which sound like they could cause operational nightmares. A/B testing does not have 1-to-1 analogues to these problems, and many of the theoretical problems with A/B testing are addressable in practice via e.g. good software and good implementation practices, both of which exist in quantity.

3) A/B testing has vastly better tool support than bandit testing, which currently has one SaaS startup and zero OSS frameworks which I am personally aware of.

4) On a purely selfish note which I'd be remiss in not mentioning, I'm personally identified with A/B testing in a way that I am not with bandit testing.

5) Again, convincing people to start A/B testing will be better 100 out of 100 times than failing to convince people to start bandit testing, which is the default result. Consider the operational superiority of A/B testing for software companies in August 2013, then look at the empirical results: very few companies actually test every week.

(There is also a zeroth answer, which is "I have reviewed the arguments for doing bandit algorithms over A/B testing and frankly don't find them all that credible" but for the purpose of the above answers I assumed that we both agreed bandit was theoretically superior.)


A/B testing is a catch-all term for multi-variant experimentation. Multi-armed bandit is a specific approach to testing[1], and even though most frameworks provide A/B/N testing -- that is, not necessarily just two variants -- it is easier to say 'A/B' instead of 'A/B/N'.

http://analytics.blogspot.ca/2013/01/multi-armed-bandit-expe...


Indeed Google Analytics has added the multi-armed bandit approach in Content Experiment. It's quite slick btw, but definitely more difficult to implement than traditional split testing. My 2 cents:

[1] the sample size calculator (How many subjects are needed for an A/B test?): http://www.evanmiller.org/ab-testing/sample-size.html

[2] <unashamed plug> easyAB: a jquery plugin for easily start A/B testing using Google Analytics. http://srom.github.io/easyAB/ </plug>


Accessibility, I'd guess. Same reason "lift heavy three times a week" is a perfectly good training recommendation for 90% of the U.S.


Why the throwaway account to ask that?


Do you have something that isn't quite as deep technically as the slides you linked to but still explain the math behind a/b testing? Statistics were never my strong side.

Also where did you learn all this A/B testing stuff? I have never heard of it before I started reading^H^H^H^H^H^H^H^H stalking you.


Like most things, I learned the first 10% by reading on the Internet (A/B testing is very much not the new hotness that was just discovered by software companies in 2008) and the next 90% by throwing stuff at the wall and taking notes on what sort of stuff tended to stick.


> 60% yearly revenue increase in 2012 on the strength of a brief series of A/B tests...

I appreciate your transparency about Bingo Card Creator, but sometimes you make it sound like such easy money, do you ever worry about your openness encouraging direct competitors?


It had competitors before it launched (about a dozen of them) and has been cloned at least 3 times due to my forum participation over the years.

At the risk of stating the obvious:

a) If one is sufficiently skilled to duplicate e.g. the Bingo Card Creator SEO strategy, all one has to do is apply it to a higher-value niche like e.g. distressed real estate, like one of the gents at the Bootstrapped With Kids podcast did, and you'll make radically more money.

b) There are probably easier competitors in the world to take money from than me.

c) Even supposing BCC revenue were to be materially impaired, I'd barely notice that. In point of fact, it is (down 40% or so this year) and I am still not more than peripherally aware of that.


"down 40% or so this year"

What is your best guess as to what is causing that? Is it primarily penguin/panda related?


The bit about 37signals reminded me of their timelapse video, "evolution of a homepage"[1], which shows just how many things they try before sticking to something (albeit shortly).

[1] http://vimeo.com/29088090


All very good stuff, but this line stood out in particular:

"Video makes Waterfall software development look like bugs-in-your-teeth speed, though"

As someone who is coming to the close of a 4-year work stint on a short film, I can tell you, you ain't wrong.

Indeed, it has gotten to the point that I've started a computer game design project as light relief.


> It was a preventable accident.

Was it really preventable? How should we know when to make a quick decision and when to A/B test?

[Clarification: it seems to me that it's better to decisively get things done]

Considering that the most widely known A/B tests are based on 41 shades of blue and pixel-perfection, I'm not sure that this is as obvious as the article claims.


the most widely known A/B tests are based on 41 shades of blue and pixel-perfection

This is not true among people who actually A/B test for a living. (The 41 shades of blue thing is a test cherry picked to suggest that testing is not material. The only reason the world knows about that test, as opposed to others conducted by Google/MSN, is because someone who believed they didn't fit in a culture of testing called out that test as the reductio ad absurdum of that culture.)

Without saying exactly what it was that the client didn't know at the time, suffice it to say that if you got five A/B testing practitioners in a room and asked them for the top five things to try on that client's site, every last one would have listed the problematic area as something to test. I mean, it wasn't the H1 on the front page, but it could have been.

This is similar to "How do we make our pages load faster?" Are there large amounts of subjectivity and risk involved here? Yes. Trying to outguess your favorite SQL query optimizer sometimes feels like reading chicken entrails. But, if you're not using gzip yet, then you should turn on gzip, because gzip always wins.


I accept that you are an expert and that there is value in this expertise, but I remain unconvinced that it's so obvious where to execute A/B tests (in contrast to page load time, where bottlenecks can be measurably identified).


what the hell is S/J testing? never heard of it and apparently neither has the googles.


Steve Jobs.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: