We run a ton of A/B tests at Spotify and we look at a ton of metrics. Defining metrics is a little bit of an art form. Ideally you want to define success metrics before you run a test to avoid cherry picking metrics. You also want to define a metric that has as high signal to noise ratio. And of course, most importantly, your metric should ideally correlate to high level business impact as much as possible.
One pet peeve I have is metrics defined as ratios. While some of them are useful, there are usually severe caveats that you can spot by just thinking about what goes in the numerator and what goes into the denominator.
Example 1: Average session length
Bad metric. What happens if you add a couple of short-term sessions on top of your existing numbers without changing anything else? Eg. you could improve the number of sessions by 10% but the total session time by 5%. This is a good thing, but your metric would tell another story.
Example 2: Number of clicks per user
What if you launch a feature that sucks and you churn out a bunch of low-intent users? You might end up with high-intent users who drive this up, going against what you mean by “success”.
Example 3: Repeat consumption (or bounce rate)
If you encourage content discovery, you might hope that people enjoy new content so much they come back to it. But you might also improve superficial discovery even more so, meaning this metric goes down.
Example 4: Skip rate
Imagine Spotify Radio. Same thing as 2: churning out a bunch of low-intent users may actually improve the skip rate, although this is a bad thing. Conversely, building a better product might paradoxically increase skip rate, because an influx of low-intent users who dig the feature.
So what metric should you use?
In general, unless you have a really good reason for it, avoid metrics that look at the ratio between two quantities. Instead, I prefer metrics such as total time, number of daily active users, total number of clicks. These metrics are pretty uninteresting in themselves (what does it tell you that the user base spent 10,000 years listening to playlists yesterday?) but they let you draw conclusions about the differences. Eg. if the total number of clicks went up by +5%, then that’s a good thing.
For A/B tests where you have uneven proportions between groups, you can simply extrapolate to the whole population by dividing by the ratios. Eg. if 1% of the users are in test group A, and 2% in group B, multiply the metrics by 100x and 50x, respectively. Alternatively, just divide them by the total number of registered users in each bucket. That’s a static denominator, so it’s totally cool to do so.
There’s of a million pitfalls with A/B testing and using metrics. This is not an argument against any of it per se. Don’t throw out the baby with the bathwater, just stay cool and make sure you do the right thing :)