Erik Bernhardsson    About

In defense of false positives (why you can’t fail with A/B tests)

Many years ago, I used to think that A/B tests were foolproof and all you need to do is compare the metrics for the two groups. The group with the highest conversion rate wins, right?

Then, for a long period, I ran a lot of tests. I started using confidence intervals, and learned about all the pitfalls of A/B testing. What to think about when running many A/B tests, why you shouldn’t check your metrics every day, why you shouldn’t optimize for a local maximum, and so on. I started becoming paranoid.

There’s about a million blog posts out there saying how everyone’s doing A/B testing wrong and how you should do it instead. It’s like there’s some secret society of people and the only way to join this club is to sacrifice a limb. You clearly have no idea what you’re doing… why are you even thinking about A/B testing? Go back and hack on your cousin’s webshop, or else pick up this 1,500 page book about Bayesian statistics and come back in two years.

The other side of this is about half a gazillion people who argue that A/B testing is inherently flawed. What I used to say was: Don’t throw out the baby with the bathwater. As long as you’re aware of the pitfalls, it’s a great tool. I thought for many years that running A/B test without a thorough understanding of all its shortcomings was dangerous.

But then I changed. Here’s the thing: Even if you’re doing A/B testing completely wrong, you are probably benefitting from it. Even if you don’t care about confidence intervals, multiple comparison corrections, or if you are basically too impatient to wait, you probably are still doing the right thing. The reason is that user metrics optimization is not a drug trial.

What do I mean with this? There’s just a few things that govern the success of an A/B test

  1. The impact of a true positive. Assuming you end up deploying the right thing, what’s the business impact?
  2. The cost of a false positive. Assuming you end up deploying the wrong thing, what’s the business impact?
  3. The prior probability of success. Before you start running the test, what’s the probability of success? In the long run, what’s the success rate of testing?

For a drug trial, the impact of a true positive is huge. You found a cure for baldness! But the cost of a false positive is even bigger: it turns out your drug doesn’t work, and it’s also causing hallucinations. Finally, if you’re a drug company, you probably evaluated 100 different drugs before finding one that seems to work, meaning the success rate of any specific drug is minimal.

This is why drug trials are subject to such intense scrutiny by government agencies. It’s also why most published research findings are false.

But you’re not a drug company, nor are you trying to find the Higgs boson. You’re basically evaluating whether a bigger “sign up” button leads to more conversions. In fact, most of your tests are driven by strong hypotheses with a large prior belief. You have a clever idea of how to impact users and historically few A/B tests show negative results.

The cost of deploying the wrong thing (false positives) is also low. You might end up with the wrong color button or some extra code that adds small tech debt. But not more than that. After all, a feature can’t be horrible if metrics aren’t tanking.

The other thing people argue a lot about is what success metric matters. In my experience, it usually never matters. I’ve very rarely seen statistically significant impacts going in two directions (one metric going up, the other going down) as long as you pick metrics in a sensible way (eg. avoid ratio metrics). But what I have seen is insignificant tests. Lots of them. So if you have to pick a metric, the most important thing is _you should just pick the one with the largest signal to noise.** **_Just don’t cherry-pick metric after the test is run.

Conclusion: Don’t listen to all the haters. Do more A/B testing.

Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.