# Momentum strategies

Haven’t posted anything in ages, so here’s a quick hack I threw together in Python on a Sunday night. Basically I wanted to know whether momentum strategies work well for international stock indexes. I spent a bit of time putting together a strategy that buys the stock index if the return during the previous n days was positive, otherwise doesn’t do anything. I ran this strategy for a basket of approximately 20 stock markets.

Anyway, disregarding transaction costs, yada yada, historical returns are not a guarantee for future returns, blah blah, here are the results. It doesn’t matter which window size we use, we still make a little bit more money from a momentum strategy:

Returns for each strategy. Click picture to see source code

Minor note: the returns above ignore volatility, actually understating the impact. Looking at Sharpe ratios instead, the momentum strategies have ratios between 1.08-1.40, whereas the “Buy and Hold” strategy has a Sharpe ratio of 0.49.

This whole exercise was mostly for fun, but it was a delight how easy it was. I just installed ystockquote and was able to put everything together within an hour. I’ve worked in finance, where this was almost harder to do. With such a low barrier, I almost consider it a duty for any hacker with a trading account to spend some time hacking on their finances. I will certainly spend more time analyzing my trades. Not necessarily engaging in anything crazy though, and I certainly discourage anyone from doing so.

# Ratio metrics

We run a ton of A/B tests at Spotify and we look at a ton of metrics. Defining metrics is a little bit of an art form. Ideally you want to define success metrics before you run a test to avoid cherry picking metrics. You also want to define a metric that has as high signal to noise ratio. And of course, most importantly, your metric should ideally correlate to high level business impact as much as possible.

One pet peeve I have is metrics defined as ratios. While some of them are useful, there are usually severe caveats that you can spot by just thinking about what goes in the numerator and what goes into the denominator.

Example 1: Average session length

Bad metric. What happens if you add a couple of short-term sessions on top of your existing numbers without changing anything else? Eg. you could improve the number of sessions by 10% but the total session time by 5%. This is a good thing, but your metric would tell another story.

Example 2: Number of clicks per user

What if you launch a feature that sucks and you churn out a bunch of low-intent users? You might end up with high-intent users who drive this up, going against what you mean by “success”.

Example 3: Repeat consumption (or bounce rate)

If you encourage content discovery, you might hope that people enjoy new content so much they come back to it. But you might also improve superficial discovery even more so, meaning this metric goes down.

Example 4: Skip rate

Imagine Spotify Radio. Same thing as 2: churning out a bunch of low-intent users may actually improve the skip rate, although this is a bad thing. Conversely, building a better product might paradoxically increase skip rate, because an influx of low-intent users who dig the feature.

So what metric should you use?

In general, unless you have a really good reason for it, avoid metrics that look at the ratio between two quantities. Instead, I prefer metrics such as total time, number of daily active users, total number of clicks. These metrics are pretty uninteresting in themselves (what does it tell you that the user base spent 10,000 years listening to playlists yesterday?) but they let you draw conclusions about the differences. Eg. if the total number of clicks went up by +5%, then that’s a good thing.

For A/B tests where you have uneven proportions between groups, you can simply extrapolate to the whole population by dividing by the ratios. Eg. if 1% of the users are in test group A, and 2% in group B, multiply the metrics by 100x and 50x, respectively. Alternatively, just divide them by the total number of registered users in each bucket. That’s a static denominator, so it’s totally cool to do so.

There’s of a million pitfalls with A/B testing and using metrics. This is not an argument against any of it per se. Don’t throw out the baby with the bathwater, just stay cool and make sure you do the right thing

# Benchmarking nearest neighbor libraries in Python

Radim Rehurek has put together an excellent summary of approximate nearest neighbor libraries in Python. This is exciting, because one of the libraries he’s covering, annoy, was built by me.

After introducing the problem, he goes through the list of contestants and sticks with five remaining ones. Finally, the benchmarks pits annoy against FLANN. Although FLANN seems to have roughly 4x better performance, somewhat surprisingly, Radim concludes annoy is the “winner”. Yay!

1000 nearest neighbors, performance vs accuracy, stolen from Radim’s blog

I’m not surprised that FLANN is a bit faster. Annoy was mainly built having other goals in mind, primarily being able to use a file based memory range and mmap this quickly. I also think the current algorithm (random hyperplanes) works better with at most a hundred dimensions or so. That being said, at some point I think there’s a lot of optimizations to do to Annoy some day when the urge to write messy C++ comes back.

Radim does give some fair criticism of the state of these libraries (including annoy). It can be a pain to install any of them, and in annoy’s case there seems to be some problems with certain architectures where it basically returns bogus data. Given that this is such a fundamental problem, it’s a little depressing how hard it is to use. Hoping this can change soon with more and more competing libraries out there.

# More recommender algorithms

I wanted to share some more insight into the algorithms we use at Spotify. One matrix factorization algorithm we have used for a while assumes that we have user vectors $\bf{a}_u$ and item vectors $\bf{b}_i$. The next track $i$ for a user is now given by the relation

$P(i | u) = \exp(\bf{a}_u^T \bf{b}_i) / Z_u$

Where $Z_u$ is a normalization constant to make sure the probabilities add up to one. This essentially means we treat the tracks choices as outputs of a softmax. You can think of it as a series of $n_u$ choices for each user, where $n_u$ is the number of tracks they played. Every track is a choice where the probability of each track is given by a softmax distribution. We don’t care about the order of these choices, but we assume each choice is independent of the other.

Assuming we have a ton of historical data where user $u$ listened $n_{ui}$ times to track $i$, we get a total log-likelihood (after simplifying a bit).

$\log L = \sum_{u, i} n_{ui}\left(\bf{a}_u^T \bf{b}_i - \log Z_u\right) = \left(\sum_{u, i} n_{ui}\bf{a}_u^T \bf{b}_i\right) - \left( \sum_u n_u \log Z_u\right)$

Note that $Z_u$ only has to be calculated once for every user. Although it’s a sum over all tracks, in practice we can estimate it by sampling random tracks and extrapolating.

Turns out you can implement this as a Hadoop job (this paper is quite helpful) doing alternate gradient ascent. Convergence isn’t super fast, but the results look qualitatively better than other methods like PLSA and the algorithm by Koren et al that seems to be pretty popular.

Measured as a pure predictive model, the model described above doesn’t perform super well. In the graph above, we look at how well the algorithm ranks future items in a validation set, including items the user didn’t play. We use the Gini coefficient, the higher, the better. The model described above is “User MF beta=1.0, f=40″ (I’ll follow up and explain beta in a later post). HKV refers to the model by Koren et al (because the authors of that paper were Hu, Koren, and Volinsky).

In the plot above, the x axis represents the number of iterations, whereas the y-axis gives the value of the Gini coefficient.

However, turns out if we look only at items that the user played at least once, the algorithm kicks ass (especially if try different values for beta, which I’ll get back to at some point). It also makes sense that this is a more useful metric to look at, since if a user plays a track zero times, it’s probably likely they didn’t actually know about the track in the first place.

We also have a lot of other internal metrics that show that this model outperforms pretty much anything else. It is substantially better at predicting radio skips and thumbs, it’s much better at ranking related artists, etc.

Here are some examples of related artists as ranked by the cosine between the vectors:

Artist Most related artists by this algo Most related artists by Koren et al’s algo
Daft Punk Justice
Gorillaz
The Chemical Brothers
Fatboy Slim
Florence + The Machine
Red Hot Chili Peppers
Kings of Leon
Muse
Lana Del Rey
Beach Boys Simon & Garfunkel
The Rolling Stones
The Mamas & The Papas
The Monkees
Elton John
Simon & Garfunkel
Bob Dylan
David Bowie
Billy Joel
Fleetwood Mac
Eminem Dr. Dre
50 Cent
B.o.B
Ludacris
The Black Eyed Peas
Rihanna
David Guetta
Maroon 5
Katy Perry
Beyoncé
Beyoncé Rihanna
Alicia Keys
Christina Aguilera
Usher
Bruno Mars
Rihanna
Katy Perry
David Guetta
Maroon 5
The Black Eyed Peas

# Microsoft’s new marketing strategy: give up

I think it’s funny how MS at some point realized they are not the cool kids and there’s no reason to appeal to that target audience. Their new marketing strategy finally admits what’s been long known: the correlation between “business casual” and using Microsoft products:

Apparently it’s also for people in ties:

And let’s add a (beige?) cardigan on top of that:

On top of that, let’s be sexists and add a woman to the campaign who doesn’t care about work but likes to take photos!

Yay forward thinking company!

# Bagging as a regularizer

One thing I encountered today was a trick using bagging as a way to go beyond a point estimate and get an approximation for the full distribution. This can then be used to penalize predictions with larger uncertainty, which helps reducing false positives.

To me it sounds like a useful trick that I found roughly 0 hits on Google for, so I thought I’d share it. Of course, it might be that I’ve completely gotten something backwards (my ML skills have some notable gaps), so let me know if that’s the case.

Here’s a little toy model to illustrate the idea. Let’s assume we have observations $x_i$ which are $\mathcal{N}(0, 1)$. We also have $y_i = 0.2 + 0.3 / (1 + x_i^2)$ and labels $z_i$ sampled from the Bernoulli distribution given by $P(z_i=1) = y_i$ (i.e. just flipping a weighted coin where the odds are determined by $y_i$.

Gradient Boosted Decision Trees are among the state of the art in regression and classification and can have amazing performance. They are available in scikit-learn and easy to plug into an existing script.

Let’s say we want to find the best choices for $x$ that gives large values of $y$. For instance, maybe we have click data and we want to find what items will get most clicks. We fit a GBDT to the $(x_i, z_i)$ observations to train a model that predicts whether an item is going to get clicked or not based on the value of $x$.

Unfortunately, we might end up getting wild predictions. In this particular case, a single noisy point in the training data around $x_i=-4$ makes the model believe that large negative values of $x$ are good for maximizing $y$.

Here’s the trick. Instead of training a single GBDT, we train 100 smaller GBDT’s on strongly subsampled data (instead of 1000 data points, we sample 100 with replacement). Now, we can use all the predicted values of each GBDT to give us an idea of the uncertainty of of $y$. This is awesome, because (among other things) we can penalize uncertain estimates. In this case, I just picked the value at the 20th percentile. I have to confess I’m still a little unsure whether the distribution of $y$ represents a probability distribution, but I think this is just an example of bootrapping and you could also Bayes rule with a prior to derive a posterior distribution.

Why could this be useful? For instance, when we recommend music at Spotify, it’s much more important to err on the safe side and remove false positives at the cost of false negatives. If we can explicitly penalize uncertainty, then we can focus on recommendations that are safe bets and have more support in historical data.

# Model benchmarks

A lot of people have asked me what models we use for recommendations at Spotify so I wanted to share some insights. Here’s benchmarks for some models. Note that we don’t use all of them in production.

Performance for recommender models

This particular benchmark looks at how well we are able to rank “related artists”. More info about models:

• vector_exp: Our own method, a latent factor method trained on all log data using Hadoop (50B+ events).
• word2vec: Google’s open sourced word2vec. We train a model on subsampled (5%) playlist data using skip-grams and 40 factors.
• rnn: Recurrent Neural Networks trained on session data (users playing tracks in a sequence). With 40 nodes in each layer, using Hierarchical Softmax for the output layer and dropout for regularization.
• koren: Collaborative Filtering for Implicit Feedback Datasets. Trained on same data as vector_exp. Running in Hadoop, 40 factors.
• lda: Latent Dirichlet Allocation using 400 topics, same dataset as above, also running in Hadoop.
• freebase: Training a latent factor model on artist entities in the Freebase dump.
• plsa: Probabilistic Latent Semantic Analysis, using 40 factors and same dataset/framework as above. More factors give significantly better results, but still nothing that can compete with the other models.

Again, not all of these models are in production, and conversely, we have other algorithms not included above that are in production. This is just a selections of things we’ve experimented with. In particular, I think it’s interesting to note that neither PLSA nor LDA perform very well. Taking sequence into account (rnn, word2vec) seems to add a lot of value, but our best model (vector_exp) is a pure bag-of-words model.

# statself.com

Btw I just put something up online that I spent a couple of evenings in my couch putting together: it’s a website where you can track any numerical data on the web. Want to know how many Twitter followers you have? Temperature in NYC? Go to statself.com and start tracking it.

Actually statself.com was just a domain name I had lying around for something else, but it turned out to fit pretty well.

I’m not a web developer but sometimes it’s nice to see how easy doing complex web apps has become. I put together something in a few afternoons using Tornado, Bootstrap, etc – something that would have taken weeks a few years ago.

# Implicit data and collaborative filtering

A lot of people these days know about collaborative filtering. It’s that Netflix Prize thing, right? People rate things 1-5 stars and then you have to predict missing ratings.

While there’s no doubt that the Netflix Prize was successful, I think it created an illusion that all recommender systems care about explicit 1-5 ratings and RMSE as the objective. Some people even distrust me when I talk about the approach we take at Spotify.

Misconception 1: Recommender systems are about predicting missing ratings. This is not true. In our case at Spotify, we have a huge matrix with users and items and each element containing the number of times user u played track i. Note that all of the matrix entries are known. Zero is a zero, and it means that user u actually played track i exactly 0 times.

Actually, even Netflix themselves have stated that there’s much more information in the implicit data than the explicit. Using implicit data has received a lot less attention, probably because the Netflix Prize was so successful.

Misconception 2: Recommender systems use squared loss. This is one of my biggest pet peeves. Think about it – what does squared loss mean, from a Bayesian perspective? It means you assume that the errors are all from a normal distribution. This is a reasonable approximation for 1-5 star rating (although questionable even there), but it’s definitely a horrible way to fit play count data (a reasonable approximation would be Poisson). Some people’s reaction to this is to transform the data to a more reasonable scale before taking the squared difference, but then your model gets even more complicated to interpret.

If there’s one lesson here, it’s definitely that every loss function is an assumption about how data is generated. That’s why I prefer generative models in the first place, such as PLSA or LDA. These methods were originally developed for text classification, but the “bag of words” approach turns out to work great for implicit collaborative filtering. Note that there are some algorithms that use squared loss even for implicit collaborative filtering, but I’m not sure what they assume about the data really.

Misconception 3: Recommender systems are predictive models. This is a subtle one. You can look at the Netflix Prize as a challenge to predict unknown values, and in the same way you can look at implicit collaborative filtering as essentially a predictive model where you are trying to predict what the user is going to do in the future. But just because you can predict that user u is going to play track i, does that mean it’s a good recommendation? After all, there might be some super obscure track j that user u would love if they actually had found it. Just recommending most likely track i introduces a strong popularity bias.

This relates to the previous question. Even if we found a loss function that relates to the generative model, it doesn’t mean we have a way of optimizing recommendation quality by minimizing some loss function. So what should we do? Luckily, it turns out there’s some tricks you can do, like normalizing for popularity, that work reasonably well in practice.

Misconception 4: Recommender systems are all about recommending items to users. I would actually argue that detecting item similarity using collaborative filtering is more important. This is another thing where I’ve hardly seen any research, and I don’t really have a lot of good ideas, although for some reason item-item cosine works really well in latent factor models.

# Vote for our SXSW panel!

If you have a few minutes, you should check out mine and Chris Johnson‘s panel proposal. Go here and vote: http://panelpicker.sxsw.com/vote/24504

Algorithmic Music Discovery at Spotify

Spotify crunches hundreds of billions of streams to analyze user’s music taste and provide music recommendations for its users. We will discuss how the algorithms work, how they fit in within the products, what the problems are and where we think music discovery is going. The talk will be quite technical with a focus on the concepts and methods, mainly how we use large scale machine learning, but we will also some aspects of music discovery from a user perspective that greatly influenced the design decisions.