The power of ensembles

From my presentation at MLConf, one of the points I think is worth stressing again is how extremely well combining different algorithms works.

In this case, we’re training machine learning algorithms on different data sets (playlists, play counts, sessions) and different objectives (least squares, max likelihood). Then we combine all the models using gradient boosted decision trees training on a smaller but higher quality data set. Finally, we validate on a third data set, in this case looking at recall for a ground truth data set of related artists.

The ensemble was common knowledge throughout the Netflix prize, and the winner ended up having many hundreds of models in the ensemble. But measured in pure RMSE score the results weren’t mind blowing – the best models weren’t far from the best ensemble.

I think in a real world setting, chances are bigger that you have lots of models working on different data sets optimizing for different things. The bias of each model will be much larger, so the gains from combining all models are even larger.

Another difference in our case is we basically don’t know what to optimize for. There’s no single golden metric such as RMSE. Instead, we train things on different data sets. Eg individual models may be trained on playlists data, and the ensemble on thumbs. We then have a bunch of offline metrics as well as A/B test metrics we use to determine success.

Some people have suggested that these large ensembles are not practical and have tried to find a single good algorithm. I think this is a bad idea. Ensembles scale really well because you can parallelize all the work, and building a good framework it’s easy to add or remove new models. Data changes all the time and a good ensemble with a bunch of models makes recommendations robust and less sensitive to shift in the underlying data.

Our ensemble looks quite different (and much bigger) than the picture above. We are thinking a lot about adding diversified signals from various machine learning algorithms applied to different data sets. These signals can be anything from collaborative filtering to dummy things, content-based, or editorial data. Pretty much anything that can be converted into a numerical signal can be used.

MLConf 2014

Just spent a day at MLConf where I was talking about how we do music recommendations. There was a whole range of great speakers (actually almost 2/3 women which was pretty cool in itself).

Here are my slides:

Justin Basilico from Netflix talked about how they deliver a personalized start page using lots of ranking

Claudia Perlich also had a great presentation about ad targeting. A few things were really interesting: (a) using transfer learning to learn from web site visiting and apply it to add targeting (b) causality tests to figure out if a campaign actually made an impact (c) how they intentionally introduce bias in their classifiers by skewing subsampling. I think she also mentioned throwing in artificial negative data, which is something we do too.

Josh Wills had a hilarious slide about “ML as a tool” companies. His analogy was waiting two hours in a line for a roller coaster ride, and then someone jumps in front of you, rides the roller coaster, and gives you a video of the ride. His point was that ML is the 1% fun part, and 99% plumbing is the boring part. If you want to start a company, focus on the boring stuff no one else wants to deal with.

Music recommendations using cover images (part 1)

Scrolling through the Discover page on Spotify the other day it occurred to me that the album is in fact a fairly strong visual proxy for what kind of content you can expect from it. I started wondering if the album cover can in fact be used for recommendations. For many obvious reasons this is a kind of ridiculous idea, but still interesting enough that I just had to explore it a bit. So, I embarked on a journey to see how far I could get in a few hours.

First thing I did was to scrape all album covers. We have a few million of them (I don’t think I could give you the exact number, or I would have to kill you). A full set of 64x64px images is 10 GB roughly, so not an insane amount.

1024 random cover images in 16×16 px

My first attempt at defining “similarity” was simply to resize all images to 16×16, convert to grayscale, subtract the mean and normalize by the variance. Each cover image is then essentially a 256-vector in Euclidean space. Load those vectors into Annoy and Bob’s your uncle! Nice!

These recommendations turn out to be pretty horrible, at best.

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

In the chart above, “covers” is the image-based method. “rnn” is recurrent neural networks, “koren” is this paper and “vector_exp” is a model I’ve described before.

This image similarity algorithm gave some fun results, like finding out that the most similar album to the left one was the right one:

In general, the only type of music I could reliably see this working on was minimal techno. Pretty much all minimal techno albums have a completely identical layout: a white (or light) circle on a black background:

Most similar covers for Kassem Mosse – Workshop 12: http://open.spotify.com/album/0NQvm5y6CLtjtbyJNZltFg

This wasn’t enough to resolve the question, so I kept searching for the answer. I stumbled across pyleargist by Oliver Grisel and to my delight it was trivial to install. Essentially pyleargist is a wrapper around a C library that takes any image and generates an image descriptor, which is a vector with 960 elements. Evaluating it using the same metric actually yields some fairly decent results:

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

The results are already not that horrible (although with a very biased and quite unreliable metric). At least it’s definitely better than pure random.

Most similar album covers to Daft Punk’s – Random Access Memories

At this point I decided the next step on this journey of overengineering would be either, or possibly both out of

1. Learning a mapping from image space to collaborative filtering space. That way we learn which features in the picture are relevant for the music. This is a similar idea to this paper.
2. Venture into the realms of deep learning. I went to the Twitters for some assistance and got a ton of response back.

This is an ongoing project (and an ongoing source of frustrations) so I’ll defer the updates to a second part. Stay tuned!

Edit: Was linking to the wrong paper by Sander Dieleman’s paper!

Luigi success

So Luigi, our open sourced workflow engine in Python, just recently passed 1,000 stars on Github, then shortly after passed mrjob as (I think) the most popular Python package to do Hadoop stuff. This is exciting!

A fun anecdote from last week: we accidentally deleted roughly 10TB of data on HDFS, and the output of 1,000s of jobs. This could have been a disaster, but luckily most of the data was intermediate, and luckily everything we do is powered by Luigi meaning it’s encoded as a big huge dependency graph in Python. Some of it is Hadoop jobs, some of it inserts data in Cassandra, some of it trains machine learning models, and much more. The Hadoop jobs are a happy mixture between inline Python jobs and jobs using Scalding.

So anyway, Luigi happily picked up that a bunch of data was missing, traversed the dependency graph backwards, and scheduled everything it needed. A few hours (and a heavly loaded cluster) later, everything was recreated.

Welcome Echo Nest!

In case you missed it, we just acquired a company called Echo Nest in Boston. These people have been obsessed with understanding music for the past 8 years since it was founded by Brian Whitman and Tristan Jehan out of MIT Medialab.

We think this is such a great fit in a lot of ways. In particular, they have focused on very complementary things to what we have. While we have spent a lot of time on things like collaborative filtering, A/B testing, learning from user feedback, and scalability, they have spent time on audio analysis, cultural understanding through web scraping, playlisting, and building a kick ass API.

For some info about what Echo Nest has been up to, check out Paul Lamere’s Music Machinery blog as well as Brian Whitman’s blog.

We’re super excited on starting to incorporate Echo Nest’s technology into our own product and we already have a couple of things in the pipe we might launch shortly. Stay tuned!

Momentum strategies

Haven’t posted anything in ages, so here’s a quick hack I threw together in Python on a Sunday night. Basically I wanted to know whether momentum strategies work well for international stock indexes. I spent a bit of time putting together a strategy that buys the stock index if the return during the previous n days was positive, otherwise doesn’t do anything. I ran this strategy for a basket of approximately 20 stock markets.

Anyway, disregarding transaction costs, yada yada, historical returns are not a guarantee for future returns, blah blah, here are the results. It doesn’t matter which window size we use, we still make a little bit more money from a momentum strategy:

Returns for each strategy. Click picture to see source code

Minor note: the returns above ignore volatility, actually understating the impact. Looking at Sharpe ratios instead, the momentum strategies have ratios between 1.08-1.40, whereas the “Buy and Hold” strategy has a Sharpe ratio of 0.49.

This whole exercise was mostly for fun, but it was a delight how easy it was. I just installed ystockquote and was able to put everything together within an hour. I’ve worked in finance, where this was almost harder to do. With such a low barrier, I almost consider it a duty for any hacker with a trading account to spend some time hacking on their finances. I will certainly spend more time analyzing my trades. Not necessarily engaging in anything crazy though, and I certainly discourage anyone from doing so.

Ratio metrics

We run a ton of A/B tests at Spotify and we look at a ton of metrics. Defining metrics is a little bit of an art form. Ideally you want to define success metrics before you run a test to avoid cherry picking metrics. You also want to define a metric that has as high signal to noise ratio. And of course, most importantly, your metric should ideally correlate to high level business impact as much as possible.

One pet peeve I have is metrics defined as ratios. While some of them are useful, there are usually severe caveats that you can spot by just thinking about what goes in the numerator and what goes into the denominator.

Example 1: Average session length

Bad metric. What happens if you add a couple of short-term sessions on top of your existing numbers without changing anything else? Eg. you could improve the number of sessions by 10% but the total session time by 5%. This is a good thing, but your metric would tell another story.

Example 2: Number of clicks per user

What if you launch a feature that sucks and you churn out a bunch of low-intent users? You might end up with high-intent users who drive this up, going against what you mean by “success”.

Example 3: Repeat consumption (or bounce rate)

If you encourage content discovery, you might hope that people enjoy new content so much they come back to it. But you might also improve superficial discovery even more so, meaning this metric goes down.

Example 4: Skip rate

Imagine Spotify Radio. Same thing as 2: churning out a bunch of low-intent users may actually improve the skip rate, although this is a bad thing. Conversely, building a better product might paradoxically increase skip rate, because an influx of low-intent users who dig the feature.

So what metric should you use?

In general, unless you have a really good reason for it, avoid metrics that look at the ratio between two quantities. Instead, I prefer metrics such as total time, number of daily active users, total number of clicks. These metrics are pretty uninteresting in themselves (what does it tell you that the user base spent 10,000 years listening to playlists yesterday?) but they let you draw conclusions about the differences. Eg. if the total number of clicks went up by +5%, then that’s a good thing.

For A/B tests where you have uneven proportions between groups, you can simply extrapolate to the whole population by dividing by the ratios. Eg. if 1% of the users are in test group A, and 2% in group B, multiply the metrics by 100x and 50x, respectively. Alternatively, just divide them by the total number of registered users in each bucket. That’s a static denominator, so it’s totally cool to do so.

There’s of a million pitfalls with A/B testing and using metrics. This is not an argument against any of it per se. Don’t throw out the baby with the bathwater, just stay cool and make sure you do the right thing

Benchmarking nearest neighbor libraries in Python

Radim Rehurek has put together an excellent summary of approximate nearest neighbor libraries in Python. This is exciting, because one of the libraries he’s covering, annoy, was built by me.

After introducing the problem, he goes through the list of contestants and sticks with five remaining ones. Finally, the benchmarks pits annoy against FLANN. Although FLANN seems to have roughly 4x better performance, somewhat surprisingly, Radim concludes annoy is the “winner”. Yay!

1000 nearest neighbors, performance vs accuracy, stolen from Radim’s blog

I’m not surprised that FLANN is a bit faster. Annoy was mainly built having other goals in mind, primarily being able to use a file based memory range and mmap this quickly. I also think the current algorithm (random hyperplanes) works better with at most a hundred dimensions or so. That being said, at some point I think there’s a lot of optimizations to do to Annoy some day when the urge to write messy C++ comes back.

Radim does give some fair criticism of the state of these libraries (including annoy). It can be a pain to install any of them, and in annoy’s case there seems to be some problems with certain architectures where it basically returns bogus data. Given that this is such a fundamental problem, it’s a little depressing how hard it is to use. Hoping this can change soon with more and more competing libraries out there.

More recommender algorithms

I wanted to share some more insight into the algorithms we use at Spotify. One matrix factorization algorithm we have used for a while assumes that we have user vectors $\bf{a}_u$ and item vectors $\bf{b}_i$. The next track $i$ for a user is now given by the relation

$P(i | u) = \exp(\bf{a}_u^T \bf{b}_i) / Z_u$

Where $Z_u$ is a normalization constant to make sure the probabilities add up to one. This essentially means we treat the tracks choices as outputs of a softmax. You can think of it as a series of $n_u$ choices for each user, where $n_u$ is the number of tracks they played. Every track is a choice where the probability of each track is given by a softmax distribution. We don’t care about the order of these choices, but we assume each choice is independent of the other.

Assuming we have a ton of historical data where user $u$ listened $n_{ui}$ times to track $i$, we get a total log-likelihood (after simplifying a bit).

$\log L = \sum_{u, i} n_{ui}\left(\bf{a}_u^T \bf{b}_i - \log Z_u\right) = \left(\sum_{u, i} n_{ui}\bf{a}_u^T \bf{b}_i\right) - \left( \sum_u n_u \log Z_u\right)$

Note that $Z_u$ only has to be calculated once for every user. Although it’s a sum over all tracks, in practice we can estimate it by sampling random tracks and extrapolating.

Turns out you can implement this as a Hadoop job (this paper is quite helpful) doing alternate gradient ascent. Convergence isn’t super fast, but the results look qualitatively better than other methods like PLSA and the algorithm by Koren et al that seems to be pretty popular.

Measured as a pure predictive model, the model described above doesn’t perform super well. In the graph above, we look at how well the algorithm ranks future items in a validation set, including items the user didn’t play. We use the Gini coefficient, the higher, the better. The model described above is “User MF beta=1.0, f=40″ (I’ll follow up and explain beta in a later post). HKV refers to the model by Koren et al (because the authors of that paper were Hu, Koren, and Volinsky).

In the plot above, the x axis represents the number of iterations, whereas the y-axis gives the value of the Gini coefficient.

However, turns out if we look only at items that the user played at least once, the algorithm kicks ass (especially if try different values for beta, which I’ll get back to at some point). It also makes sense that this is a more useful metric to look at, since if a user plays a track zero times, it’s probably likely they didn’t actually know about the track in the first place.

We also have a lot of other internal metrics that show that this model outperforms pretty much anything else. It is substantially better at predicting radio skips and thumbs, it’s much better at ranking related artists, etc.

Here are some examples of related artists as ranked by the cosine between the vectors:

Artist Most related artists by this algo Most related artists by Koren et al’s algo
Daft Punk Justice
Gorillaz
The Chemical Brothers
Fatboy Slim
Florence + The Machine
Red Hot Chili Peppers
Kings of Leon
Muse
Lana Del Rey
Beach Boys Simon & Garfunkel
The Rolling Stones
The Mamas & The Papas
The Monkees
Elton John
Simon & Garfunkel
Bob Dylan
David Bowie
Billy Joel
Fleetwood Mac
Eminem Dr. Dre
50 Cent
B.o.B
Ludacris
The Black Eyed Peas
Rihanna
David Guetta
Maroon 5
Katy Perry
Beyoncé
Beyoncé Rihanna
Alicia Keys
Christina Aguilera
Usher
Bruno Mars
Rihanna
Katy Perry
David Guetta
Maroon 5
The Black Eyed Peas

Microsoft’s new marketing strategy: give up

I think it’s funny how MS at some point realized they are not the cool kids and there’s no reason to appeal to that target audience. Their new marketing strategy finally admits what’s been long known: the correlation between “business casual” and using Microsoft products:

Apparently it’s also for people in ties:

And let’s add a (beige?) cardigan on top of that:

On top of that, let’s be sexists and add a woman to the campaign who doesn’t care about work but likes to take photos!

Yay forward thinking company!