Say you build a machine learning model, like a movie recommender system. You need to optimize for something. You have 1-5 stars as ratings so let’s optimize for mean squared error. Great.

Then let’s say you build a new model. It has even lower mean squared error. You deploy it. This model turns out to give a lower mean squared error. You roll it out to users and the metrics are tanking. Crap! Ok so maybe mean squared error isn’t the right thing to optimize for.

The way you solve this, of course, is you start A/B testing your changes. But what metric to choose? People often ask me why we use one or another metric. We typically look at numbers like *Daily active users*, *Day 2 retention*, etc. But what if optimizing too hard for one hurts the other? What if you’re driving day 2 retention but screwing up month 2 retention?

What I like to remind myself is that **everything is a proxy metric**. We really want to maximize shareholder value or something like similar (let’s not get into a debate about this, so for the purpose of this blog post I’m going to assume that’s our goal).

The problem is, you can’t take all your hyperparameters and calculate the gradient

That’s silly for many reasons, but let’s break it down. First of all, the functional relationship is highly stochastic, and depends on all kinds of external factors. Second of all, there’s no way we can even evaluate multiple values of this function, and it won’t be possible until we invent time machines. Third of all, there’s no way we can extract gradient information *at all.*

So what are we going to do? We’re going to invent a new function that we think is **highly correlated **with shareholder value. We could even define multiple functions if we want to, just to cross check different ones. But we will *never be able to establish the correlation* because of the reasons I just mentioned.

So that’s usually why metrics like daily active users are great. If you are trying to grow your company, it’s reasonable to assume that ultimately user growth will lead to success.

But in general, what properties should such a function have? Ideally as many as possible out of this:

- Should be highly correlated with shareholder value. For a company focusing on growth, the number of active users is probably a good one.
- Should be possible to measure separate independent outcomes, using A/B tests, blind tests, or something else. For instance, number of signups is tricky to test.
- Should be fast to measure. We don’t want to launch an A/B test and have to wait many months to get an answer
- Should have a high signal to noise ratio. You want to extract as much value from it. If you’re running an A/B test you want to reach statistical significance quickly.

One thing I’ve learned the hard way is that sometimes it’s often useful to pick a more biased metric if that means you can results faster or getting more results. For instance, we can roll out an A/B test with two different recommendation algorithms. We probably won’t see an impact on high level metrics such as retention, so we can pick a feature-specific metric instead.

But we can go even further. Still if we’re A/B testing, we probably have to run that test for two weeks to get any meaningful numbers. At the end of the day we obtain **1 bit of information **(roughly) at best after two weeks, which tells us which test group won.

If we want to iterate quickly sometimes it makes a lot more sense to just take the output of the recommendation algorithms and let people go through a blind test. This is slightly more biased because we’re not using real users, but usually a lot quicker, and also lets us extract a *lot more information*. Not just do we learn which algorithm is the better, we often end up with lots of (noisy) anecdotal information, such as “algorithm A sucks for popular content” or “algorithm B is less diverse”.

At the lowest level, if you have a recommender algorithm, the model’s objective (eg. mean squared error) is another great proxy. Turns out it’s extremely easy to try multiple parameters and learn from it – all you need to do is to retrain the model. It totally dominates points 2, 3 and 4, but doesn’t do a great job on point 1.

**TL;DR any metric you’re using is just a proxy. Pick the right one for your task.**