Running Theano on EC2

Inspired by Sander Dieleman’s internship at Spotify, I’ve been playing around with deep learning using Theano. Theano is this Python package that lets you define symbolic expressions (cool), does automatic differentiation (really cool), and compiles it down into bytecode to run on a CPU/GPU (super cool). It’s built by Yoshua Bengio’s deep learning team up in Montreal.

This isn’t going to be a long blog post – I just wanted to share two pro tips:

1. I was messing around for hours trying to get Theano running on the GPU instances in EC2. Turns out Andreas Jansson, a coworker at Spotify, has already built an ready-to-use AMI. When you start an EC2 instance, search for the gpu_theano AMI. (AMI’s are Amazon’s virtual images that you boot your system from). The gpu_theano AMI runs Ubuntu 14.04 and comes with a bunch of stuff pre-installed. Andreas also has this tool to spin it up from the command line, but I couldn’t get it working (somehow the instances it created weren’t accessible over SSH) so I ended up just booting machines from the AWS Management Console.
2. The list price for the g2.2xlarge instances (the ones with GPU’s) are $0.65/h. If you end up running something for a week then that’s just above$100. The spot instance price, however, is (currently) only \$0.0641/h – less than 10%. The downside with spot instances is that your using excess capacity of EC2, so there’s a small likelihood your machine will be taken down at any point. But so far supply generally seems to outstrip demand. The price looks fairly stable, and you can always checkpoint data to S3 to persist it.
My deep learning model is about 50x faster on a g2.2xlarge (which has 1,536 GPU cores) compared to a c3.4xlarge (which has 16 CPU cores) so the speedup is pretty substantial.

In defense of false positives (why you can’t fail with A/B tests)

Many years ago, I used to think that A/B tests were foolproof and all you need to do is compare the metrics for the two groups. The group with the highest conversion rate wins, right?

Then, for a long period, I ran a lot of tests. I started using confidence intervals, and learned about all the pitfalls of A/B testing. What to think about when running many A/B tests, why you shouldn’t check your metrics every day, why you shouldn’t optimize for a local maximum, and so on. I started becoming paranoid.

There’s about a million blog posts out there saying how everyone’s doing A/B testing wrong and how you should do it instead. It’s like there’s some secret society of people and the only way to join this club is to sacrifice a limb. You clearly have no idea what you’re doing… why are you even thinking about A/B testing? Go back and hack on your cousin’s webshop, or else pick up this 1,500 page book about Bayesian statistics and come back in two years.

The other side of this is about half a gazillion people who argue that A/B testing is inherently flawed. What I used to say was: Don’t throw out the baby with the bathwater. As long as you’re aware of the pitfalls, it’s a great tool. I thought for many years that running A/B test without a thorough understanding of all its shortcomings was dangerous.

But then I changed. Here’s the thing: Even if you’re doing A/B testing completely wrong, you are probably benefitting from it. Even if you don’t care about confidence intervals, multiple comparison corrections, or if you are basically too impatient to wait, you probably are still doing the right thing. The reason is that user metrics optimization is not a drug trial.

What do I mean with this? There’s just a few things that govern the success of an A/B test

1. The impact of a true positive. Assuming you end up deploying the right thing, what’s the business impact?
2. The cost of a false positive. Assuming you end up deploying the wrong thing, what’s the business impact?
3. The prior probability of success. Before you start running the test, what’s the probability of success? In the long run, what’s the success rate of testing?

For a drug trial, the impact of a true positive is huge. You found a cure for baldness! But the cost of a false positive is even bigger: it turns out your drug doesn’t work, and it’s also causing hallucinations. Finally, if you’re a drug company, you probably evaluated 100 different drugs before finding one that seems to work, meaning the success rate of any specific drug is minimal.

This is why drug trials are subject to such intense scrutiny by government agencies. It’s also why most published research findings are false.

But you’re not a drug company, nor are you trying to find the Higgs boson. You’re basically evaluating whether a bigger “sign up” button leads to more conversions. In fact, most of your tests are driven by strong hypotheses with a large prior belief. You have a clever idea of how to impact users and historically few A/B tests show negative results.

The cost of deploying the wrong thing (false positives) is also low. You might end up with the wrong color button or some extra code that adds small tech debt. But not more than that. After all, a feature can’t be horrible if metrics aren’t tanking.

The other thing people argue a lot about is what success metric matters. In my experience, it usually never matters. I’ve very rarely seen statistically significant impacts going in two directions (one metric going up, the other going down) as long as you pick metrics in a sensible way (eg. avoid ratio metrics). But what I have seen is insignificant tests. Lots of them. So if you have to pick a metric, the most important thing is you should just pick the one with the largest signal to noise. Just don’t cherry-pick metric after the test is run.

Conclusion: Don’t listen to all the haters. Do more A/B testing.

Recurrent Neural Networks for Collaborative Filtering

I’ve been spending quite some time lately playing around with RNN’s for collaborative filtering. RNN’s are models that predict a sequence of something. The beauty is that this something can be anything really – as long as you can design an output gate with a proper loss function, you can model essentially anything.

In the case of collaborative filtering, we will predict the next item given the previous items. More specifically, we will predict the next artist, album, or track, given the history of streams. Without loss of generalization, let’s assume we want to predict tracks only.

Note that we’re not trying to predict ratings or any explicit information – just what track the user chose to play.

The data

We use playlist data or session data, because it has an inherent sequence to it. Removing consecutive duplicates improves performance a lot, since otherwise the network just learns to predict the same item as it just predicted.

In our case we use a few billion playlist/sessions and in total about ten to a hundred billion “words”.

The model

Recurrent neural networks have a simple model that tries to predict the next item given all previous ones. After predicting the item, the network gets to “know” what item it was, and incorporates this.

More formally, let’s assume we have time steps $0 \ldots t-1$. The model has a “hidden” internal state $h_0 \ldots h_{t-1}$. These are generally vectors of some dimension $k$. Every time step, we have two things going on

• Predict the output given the hidden state. We need to model a $P(y_i | h_i)$ for this.
• Observe the output $y_t$ and feed it back into the next hidden state $h_{i+1}$. In the most general form, $h_{i+1} = f(a(h_i) + b(y_i))$. In practice, $f$ is generally some nonlinear function like sigmoid or tanh, whereas $a$, and $b$ are usually a simple linear transform. It depends a bit on the structure of the output.

Now, all we need to do is write down the total likelihood and optimize for it!

Wait a minute?

Sorry about the extremely superficial introduction without much detail. Here’s a more specific example:

Let’s say we want to predict a sequence of daily stock returns. In that case, $y_i$ is a vector of stock returns – maybe containing three values with the daily return for Apple, Google, and Microsoft. To get from the hidden state $h_i$ to $y_i$ let’s just use a simple matrix multiplication: $y_i = W h_i$

We can assume $P(y_i | h_i)$ is a normal distribution because then the log-likelihood of the loss is just the (negative) L2 loss: $-(y_t - h_t)^2$

We can specify that $h_{i+1} = \tanh(Uy_i + Vh_i)$ and that $h_0 = 0$ (remember it’s still a vector). If we want to be more fancy we could add bias terms and stuff but let’s ignore that for the purpose of this example. Our model is now completely specified and we have $3k^2$ unknown parameters: $U$$V$, and $W$.

What are we optimizing?

We want to find $U$$V$, and $W$ that maximizes the log-likelihood over all examples:

$\log L = \sum\limits_{\mbox{all examples}} \left( \sum\limits_{i=0}^{t-1} -(y_i - h_i)^2\right)$

Backprop

The way to maximize the log-likelihood is through back-propagation. This is a well-known method and there’s so much resources on line that I’ll be a bit superficial about the details.

Anyway, we need to do two passes through each sequence. First propagation. For $i=0\ldots t-1$: Calculate all hidden states $h_i$. Nothing magic going on here, we’re just applying our rule.

Now it’s time for backprop. This is essentially just the chain rule taken to the extreme. Remember that the total log-likelihood is the sum of all the individual log-probabilities of observing each output:

$L = \sum\limits_{i=0}^{t-1} \log P(y_i | h_i)$

We define the derivatives  $\delta_i$ as the partial derivatives of the log-likelihood with respect to the hidden state:

$\delta_i = \frac{\partial \log L}{\partial h_i}$

Since each hidden state only influences later hidden states, the $\delta$‘s are  just a function of all future $\delta_j, j = i\ldots t-1$.

$\delta_i = \sum\limits_{j=i}^{t-1}\frac{\partial \log P(y_j | h_j)}{\partial h_i}$

We can now rewrite $\delta_i$ as a function of $\delta_{i+1}$ and use some chain rule magic:

• $\frac{\partial \log L}{\partial h_i} = \delta_i = \frac{\partial}{\partial h_i} \log P(y_i | h_i) + \frac{\partial h_{i+1}}{\partial h_{i}} \delta_{i+1}$

We can evaluate this backwards from $t-1 \ldots 0$. Both $P(y_i | h_i)$ and $\frac{\partial h_{i+1}}{\partial h_{i}}$ are specified by our model, so we just need to plug in the expressions.

For the stock price example, we have the unknown parameters $U, V, W$ where we can derive the gradients like this:

• $\frac{\partial \log L}{\partial U} = \frac{\partial L}{\partial h_{i+1}}\frac{\partial h_{i+1}}{\partial U} = \delta_{i+1} \frac{\partial}{\partial U} \tanh(Uy_i + Vh_i) = \delta_{i+1} \tanh'(Uy_i + Vh_i) y_i^T$

This looks intimidating but it’s really just a lot of chain rule applications and fairly straightforward math. You get similar gradients for $V, W$. Now that we have the gradients, we can optimize using stochastic gradient descent over lots of example.

How does this relate to Hidden Markov Models?

The nice thing about RNN’s is that the relation between $h_{t+1}$ and $h_t$ is exact rather than being some probabilistic relationship. This means that the $h$‘s are not parameters in themselves, so we don’t have to solve for them at all. This is usually the slow part of HMM’s since figuring out the hidden values takes some slow iterative process like the Baum-Welch algorithm. For RNN’s, we only need two passes through each sequence rather than iterating lots of times until the hidden states converge.

The other thing is that RNN’s need some kind of nonlinearity or else they magnitude of the hidden states will explode. This nonlinearity is usually taken to be sigmoid or tanh. I guess in theory HMM’s could also use nonlinearities, but I’ve never heard of this.

Predicting other things

Let’s focus on the collaborative filtering example. Given a sequence of watched movies, or tracks that the user has listened to, predict what the next one is going to be. Now $y_i$ is not a scalar, but one out of many items. We need some kind of distribution $P(y_i | h_i)$. The one I’ve seen being used is the Softmax distribution over all possible outputs. This means we have to learn a vector $a_j$ for each item. The probability $P(y_i | h_i)$ is now proportional to $\exp(h_t^Ta_j)$:

$P(y_i | h_i) = \frac{\exp(h_t^Ta_j)}{\sum_k \exp(h_t^Ta_k)}$

Notice that the summation part in the denominator is over all items – something that is pretty slow to compute. I’ll get back to that.

We also need something linking back the output to the next hidden state. In fact, we will learn another set of vectors $b_j$ – one for each item. With slight abuse of notation, here is how it looks like:

Since we are learning a bunch of vectors, we don’t need the matrices $U$ and $V$. Our model now becomes:

$h_{i+1} = \tanh(W h_i + b_i)$

with some slight abuse of notation since the $b$‘s are shared for each item $y_i$. Again, we want to maximize the total log-likelihood

$\sum \sum_{i=0}^t \log P(y_i | h_i)$

We now end up with a ton of parameters because we have the unknowns $a_j$‘s and $b_j$‘s for each item.

Let’s just pause here and reflect a bit: so far this is essentially a model that works for any sequence of items. There’s been some research on how to use this for natural language processing. In particular, check out Tomas Mikilov’s work on RNN’s (this is the same guy that invented word2vec, so it’s pretty cool.

The gnarly details

If you have 10 different items, you can evaluate $P(y_i | h_i)$ easily, but not if you have five million items. But do not despair! There’s a lot of ways you can attack this:

• Take the right output and sample some random items from the entire “vocabulary”. Train the model to classify which one is the right output. This is sort of like a police lineup: one is the right suspect, the remaining people just some random sample.
• Hierarchical softmax: Put all items in a binary tree, and break it up into roughly $\log m$ binary classification problems (where m is the size of the vocabulary).

Instead of messing around with Hamming trees and things recommended in literature, I ended up implementing a much more simple version of hierarchical softmax. Internally, all item are described as integers, so I build the tree implicitly. The root node is a binary classifier for the last bit of the item. The next level classifies the second last bit, and so on.

The idea is that you can calculate $P(y_i | h_i)$ as the product of all the probabilities for each individual node on the path from the root to the leaf.

Instead of learning an $a_j$ for every item, we just have to learn one for every node in the tree, in total $2^{\left \lceil{\log_2 m}\right \rceil -1}$ vectors. It doesn’t really matter that much how we build the tree – we don’t need to enforce that similar items are close to each other in the tree (however that would probably improve performance a bit).

But in general, this problem pops up in a lot of places and here’s some other crazy ideas I’ve thought about:

• If generating random “negative” samples, one way to better mine random examples would be to sample some items from the $y_i$‘s themselves. Since those values are probably pretty similar, that would force the model to discriminate between more “hard” cases.
• Assume the $a_j$‘s have some kind of distribution like a Normal distribution, and calculate an expectation. This is just a bunch of integrals. We’ve had some success using this method in other cases.
• Don’t use softmax but instead use L2 loss on $m$ binary classification problems. All entries would be $0$ except the right one which is $1$. You can put more weight on the right one to address the class imbalance. The cool thing is that with L2 loss, everything becomes linear, and you can compute the sum over all items in constant time. This is essentially how Yehuda Koren’s 2008 paper on implicit collaborative filtering works

Implementing it

I ended up building everything in C++ because it’s fast and I’m pretty comfortable with it. It reads about 10k words per second on a single thread. A multi-threaded version I’ve built can handle 10x that amount and we can parse 10B “words” in about a day.

Hyperparameters

A beautiful thing with this model is that there’s basically no hyperparameters. The larger number of factors is better – we typically use 40-200. With dropout (see below) overfitting is not a concern. It takes a little trial and error to get the step sizes right though.

Initialization

As with most latent factor models, you need to initialize your parameter to random noise. Typically small Gaussian noise like $\mathcal{N}(0, 0.1^2)$ works well.

Nonlinearity

I tried both sigmoid and tanh. Tanh makes more sense to me, because it’s symmetric around 0, so you don’t have to think too much about the bias term. Looking at some offline benchmarks, it seemed like tanh was slightly better than sigmoid.

Dropout

I also added dropout to the hidden values since it seemed to help improve the predictions of the model. After the each $h_i$ is calculated, I set half of them to zero. This also seems to help with exploding gradients. What happens is that the $W$ matrix will essentially learn how to recombine the features

Actually gradient clipping wasn’t needed for sigmoid, but for tanh I had to add it. Basically during the backprop I cap the magnitude of the gradient to 1. This also helps equalizing the impact of different examples since otherwise for longer sequences you get bigger gradients.

I use Adagrad on the item vectors but simple learning rate on the shared parameters $W$ and the bias terms. Adagrad is fairly simple: in addition to each vector $a_j$ and $b_j$ you just store a single scalar with the sum of the squared magnitudes of the gradients. You then use that to normalize the gradient.

For a vector $x$ with gradient $d$, Adagrad can be written as:

$x^{(n+1)} = x^{(n)} + \eta \frac{d^{(n)}}{\sqrt{\sum\limits_{i=1\ldots n} \left( d^{(i)} \right)^2}}$

$\eta$ is a hyperparameter that should be set to about half the final magnitude these vectors will have. I usually have had success just setting $\eta = 1$.

Results

Offline results show that the RNN is one of the best performing algorithms for collaborative filtering and A/B tests confirm this.

More resources

I can recommend A tutorial on training recurrent neural networks as another starting point to read more about recurrent neural networks.

(Edit: fixed some minor errors in the math and a wrong reference to Baum-Welch)

Where do locals go in NYC?

One obvious thing to anyone living in NYC is how tourists cluster in certain areas. I was curious about the larger patterns around this, so I spent some time looking at data. The thing I wanted to understand is: what areas are dominated by tourists? Or conversely, what areas are dominated by locals?

After some looking around, I found this Foursquare data dump and analyzed about 200,000 check-ins in NYC. Time to crunch some data…

First of all, I split up check ins into those done by (a) people living within 10,000 feet of the check in (b) people living further away. As the next step I broke up the check-ins by 2,166 census areas of New York and calculated the ratio of locals. I color-coded each census area from green (100% locals) to purple (0% locals). Here is the result, for the five boroughs:

Some obvious ones stand out, like the airports: JFK and LaGuardia, which are completely dominated by “non-locals” . Interestingly, Red Hook in Brooklyn also is dominated by non-locals (maybe because of IKEA?), as well as Prospect Park and some other areas in downtown Brooklyn. In Bronx, the area surrounding Yankee Stadium is also nearly 100% non-locals.

Maybe not surprisingly, there seems to be some truth to the joke that real New Yorkers do not go above 14th Street. Zooming into lower Manhattan, you can clearly see a sharp dividing line cutting across Manhattan:

Another thing that stands out is how North Williamsburg is completely dominated by non-locals, maybe because of the huge influx of people on weekends going out partying.

I then split up into (a) people who live in the five boroughs (b) tourists, and color-coded into blue (0% tourists) and red (100% tourists):

Airports are slightly biased towards tourists. Interestingly Red Hook now becomes a New York affair.

Zooming in on Lower Manhattan again:

Another thing that stands out is how North Williamsburg is dominated by New Yorkers, meaning that even though most people are not local, they are still from the city.

In contrast, most of Lower Manhattan (East Village, West Village, etc) isn’t just dominated by locals, it’s also very low ratio of tourists to New Yorkers.

The areas most dominated by tourists are, maybe not surprisingly

• Central Park
• Times Square and parts of Midtown
• Financial district
• Liberty Island and Ellis Island

I did this in Python, mostly using geopandas, Matplotlib, and a bunch of open source data sets. It was a fun weekend project and it ended up taking way too much time. And since I live in Lower East Side myself, I’m probably pretty biased…

How to build up a data team (everything I ever learned about recruiting)

During my time at Spotify, I’ve reviewed thousands of resumes and interviewed hundreds of people. Lots of them were rejected but lots of them also got offers. Finally, I’ve also had my share of offers rejected by the candidate.

Recruiting is one of those things where the Dunning-Kruger effect is the most pronounced: the more you do it, the more you realize how bad you are at it. Every time I look back a year, I realize 10 things I did wrong. Extrapolating this, I know in another year I’ll realize all the stupid mistakes I’m doing right now. Anyway, that being said, here are some things I learned from recruiting.

Getting the word out

Depending on where you work, people might have no clue about your company. Why would they work for something they have never heard of? Or alternatively – something they know of, but doesn’t necessarily associate with cutting edge tech? There’s a million companies out there doing cool stuff, so make sure that people know your company stands out. Blog, talk at meetups, open source stuff, go to conferences. I honestly don’t know what works – I don’t have any numbers. But you need to hedge your bets by attacking on all angles at the same time.

I think developers have a hard time justifying this just because success is not easily quantifiable – this is a branding exercise, and it’s super hard to find out if you’re doing the right thing. But over time if you do this right, you will get anecdotal feedback from candidates coming in saying they saw your presentation or read this cool story on Hacker News, or what not.

Finding the people

I don’t think there’s anything magic about this – just go through external recruiters, internal recruiters, job postings, connections, whatever.

Presenting the opportunity

I think most people in the industry are fed up with bad bulk messages over email/LinkedIn. Ideally, the hiring manager should introduce themselves, or for more senior roles having more senior people reaching out (all the way up to the CTO). If a recruiter is reaching out, it’s super important to make sure the recruiter can reach out to people with a quick note on what’s interesting about the team and why it’s a good fit.

Finding the right candidates

Recruiting is some crazy type of active learning problem with this observation bias where you only see how well the people you hire are doing. In particular there was a lot of discussion a while back when Google claimed there was no correlation between test scores and GPA. I think there totally are really strong correlations on a macro scale, but if you are already filtering out people based on those criteria, obviously you will reduce the strength, or even reverse it. Not that I claim to have found any magic criteria. I do however thing the two most successful traits that I’ve observed are (with the risk of sounding cheesy):

1. Programming fluency (10,000 hour rule or whatever) – you need to be able to visualize large codebases, and understand how things fit together. I strongly believe that data engineers need to understand the full stack from idea, to machine learning algorithm, to code running in production. I’ve seen other companies having a “throw it over the fence” attitude, with one team brainstorming algorithms, another team in another city implementing them. I think that’s a flawed way to have a tight learning cycle. In particular, I’m hesitant to hire candidates who are strong on the theoretical side, but with little experience writing code. That’s why I really avoid the “data science” label – most people within this group are generally lacking on the core programming side. I don’t think this necessarily means candidates has to have a solid understanding of the CAP theorem and the linux page cache. The most important thing is they have written a lot of code, can work with nontrivial code bases, and can write clean, maintainable code. There is nothing magic to this – but a person who only has written Matlab scripts probably will have a harder time adjusting.
2. Understand the big picture – go from a vision to a set of tasks, to a bunch of code being written. People have to be able to go from an idea (“analyze this data set and build an algorithm that uses it as a signal for the recommender system”) to code, without having to hand hold them throughout every single step. People who need to be given small tasks rather than the underlying problem will never understand why we’re working on things, and will inevitably end up doing the wrong thing.
Attracting the right candidates
Even if you find awesome candidates, your interview process, or your lack of selling might destroy your prospects of hiring the right people. Here’s some things I’ve learned
• Smart people are actually really impressed by good interview process. If some smart ML genius comes in and walks them through a super hard problem, the candidate will feel like they can learn from that interviewer. Conversely, giving interview problems that are not well structured, without any follow up questions, or discussion notes, will give a really bad impression. The best type of interview problems have a (b), (c) and (d) version that you can pull up in case the candidate nails (a) directly.
• Understand the level of selling you need to do. If you get a super senior person in, spend the bare minimum to establish that this person actually lives up to their rumor. Then, spend the rest of the time explaining the role and why you are so excited about it.
• Explaining the role and why it’s a good fit is everything. But obviously not superficial stuff. These are engineers and you need to impress them with the stuff they will learn. Talk about why your company is working on, what challenges you have, what technologies you are using.
• It’s often much easier to start by listening to the candidate. Let them ask questions and talk more than you. Try to understand what they are looking for. Then, explain how this position meets or does not meet those criteria.
• Everyone has an angle. Someone coming from finance is probably more excited to hear about your product vision. Someone from a small failed startup is probably more excited to hear about your benefits.
• Make the candidate meet the team they will be working with. There’s so many companies failing this. I remember going through a series of ten Google interviews many years ago and then the recruiter wouldn’t tell me what team I would work on. I immediately turned down the offer. I think on the other side of the spectrum, the best thing you can do is to bring the candidate out on a team lunch to meet as many as possible.
• Make them understand you take this role seriously. Set up a quick chat with the CTO or some senior people.
Hiring the right candidate
• If it’s a senior person, and everyone is super impressed, hire the person. If it’s a senior person, and people are on the fence, you shouldn’t hire the person.
• If it’s a junior person, it’s sometimes hard to know. One thing I’ve learned is that excitement really counts. For instance the candidate kicks ass during the ML interview, and clearly has solid programming fluency, but doesn’t necessarily know much about scalability and version control, then I really think it boils down to how excited this person is.
• At the end of the day, if you realize this candidate is brilliant, but not necessarily the right fit for your role, find something else. It’s a sign of a great organization that you can always find a place for smart people.
Building the right team
• Candidates will be on a spectrum between “theoretical” (more ML stuff) or “applied” (more scalability, backend, data processing). But just for the purpose of the argument, let’s assume people are on each side of the spectrum. For a high performing machine learning team, I think a ratio of 1:2 is good. But everyone should be able to work on every part of the system, if needed.
• If you’re a small company, you will end up hiring lots of people who are similar to yourself. I don’t mean necessarily people of the same gender or origin, but people who will all go out to the same bar getting the same beer. This might good for cohesion early on, but as soon as you grow beyond the early stage, you might end up with a destructive monoculture that is super hard to break out of.
• On the same note: if you’re a big company and your team still skews a certain way, then it means you have prejudice problems you need to work on. Seriously. For instance, with the number of female software engineers being 20.2%, you can derive that 95% of all 50 people team should contain at least five women, and 70% of all teams should contain at least eight women. It’s just plain math.

For more reading on the subject, I just came across this  interesting blog post by Cheng-Tao Chu: Why building a data science team is deceptively hard.

The power of ensembles

From my presentation at MLConf, one of the points I think is worth stressing again is how extremely well combining different algorithms works.

In this case, we’re training machine learning algorithms on different data sets (playlists, play counts, sessions) and different objectives (least squares, max likelihood). Then we combine all the models using gradient boosted decision trees training on a smaller but higher quality data set. Finally, we validate on a third data set, in this case looking at recall for a ground truth data set of related artists.

The ensemble was common knowledge throughout the Netflix prize, and the winner ended up having many hundreds of models in the ensemble. But measured in pure RMSE score the results weren’t mind blowing – the best models weren’t far from the best ensemble.

I think in a real world setting, chances are bigger that you have lots of models working on different data sets optimizing for different things. The bias of each model will be much larger, so the gains from combining all models are even larger.

Another difference in our case is we basically don’t know what to optimize for. There’s no single golden metric such as RMSE. Instead, we train things on different data sets. Eg individual models may be trained on playlists data, and the ensemble on thumbs. We then have a bunch of offline metrics as well as A/B test metrics we use to determine success.

Some people have suggested that these large ensembles are not practical and have tried to find a single good algorithm. I think this is a bad idea. Ensembles scale really well because you can parallelize all the work, and building a good framework it’s easy to add or remove new models. Data changes all the time and a good ensemble with a bunch of models makes recommendations robust and less sensitive to shift in the underlying data.

Our ensemble looks quite different (and much bigger) than the picture above. We are thinking a lot about adding diversified signals from various machine learning algorithms applied to different data sets. These signals can be anything from collaborative filtering to dummy things, content-based, or editorial data. Pretty much anything that can be converted into a numerical signal can be used.

MLConf 2014

Just spent a day at MLConf where I was talking about how we do music recommendations. There was a whole range of great speakers (actually almost 2/3 women which was pretty cool in itself).

Here are my slides:

Justin Basilico from Netflix talked about how they deliver a personalized start page using lots of ranking

Claudia Perlich also had a great presentation about ad targeting. A few things were really interesting: (a) using transfer learning to learn from web site visiting and apply it to add targeting (b) causality tests to figure out if a campaign actually made an impact (c) how they intentionally introduce bias in their classifiers by skewing subsampling. I think she also mentioned throwing in artificial negative data, which is something we do too.

Josh Wills had a hilarious slide about “ML as a tool” companies. His analogy was waiting two hours in a line for a roller coaster ride, and then someone jumps in front of you, rides the roller coaster, and gives you a video of the ride. His point was that ML is the 1% fun part, and 99% plumbing is the boring part. If you want to start a company, focus on the boring stuff no one else wants to deal with.

Music recommendations using cover images (part 1)

Scrolling through the Discover page on Spotify the other day it occurred to me that the album is in fact a fairly strong visual proxy for what kind of content you can expect from it. I started wondering if the album cover can in fact be used for recommendations. For many obvious reasons this is a kind of ridiculous idea, but still interesting enough that I just had to explore it a bit. So, I embarked on a journey to see how far I could get in a few hours.

First thing I did was to scrape all album covers. We have a few million of them (I don’t think I could give you the exact number, or I would have to kill you). A full set of 64x64px images is 10 GB roughly, so not an insane amount.

1024 random cover images in 16×16 px

My first attempt at defining “similarity” was simply to resize all images to 16×16, convert to grayscale, subtract the mean and normalize by the variance. Each cover image is then essentially a 256-vector in Euclidean space. Load those vectors into Annoy and Bob’s your uncle! Nice!

These recommendations turn out to be pretty horrible, at best.

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

In the chart above, “covers” is the image-based method. “rnn” is recurrent neural networks, “koren” is this paper and “vector_exp” is a model I’ve described before.

This image similarity algorithm gave some fun results, like finding out that the most similar album to the left one was the right one:

In general, the only type of music I could reliably see this working on was minimal techno. Pretty much all minimal techno albums have a completely identical layout: a white (or light) circle on a black background:

Most similar covers for Kassem Mosse – Workshop 12: http://open.spotify.com/album/0NQvm5y6CLtjtbyJNZltFg

This wasn’t enough to resolve the question, so I kept searching for the answer. I stumbled across pyleargist by Oliver Grisel and to my delight it was trivial to install. Essentially pyleargist is a wrapper around a C library that takes any image and generates an image descriptor, which is a vector with 960 elements. Evaluating it using the same metric actually yields some fairly decent results:

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

The results are already not that horrible (although with a very biased and quite unreliable metric). At least it’s definitely better than pure random.

Most similar album covers to Daft Punk’s – Random Access Memories

At this point I decided the next step on this journey of overengineering would be either, or possibly both out of

1. Learning a mapping from image space to collaborative filtering space. That way we learn which features in the picture are relevant for the music. This is a similar idea to this paper.
2. Venture into the realms of deep learning. I went to the Twitters for some assistance and got a ton of response back.

This is an ongoing project (and an ongoing source of frustrations) so I’ll defer the updates to a second part. Stay tuned!

Edit: Was linking to the wrong paper by Sander Dieleman’s paper!

Luigi success

So Luigi, our open sourced workflow engine in Python, just recently passed 1,000 stars on Github, then shortly after passed mrjob as (I think) the most popular Python package to do Hadoop stuff. This is exciting!

A fun anecdote from last week: we accidentally deleted roughly 10TB of data on HDFS, and the output of 1,000s of jobs. This could have been a disaster, but luckily most of the data was intermediate, and luckily everything we do is powered by Luigi meaning it’s encoded as a big huge dependency graph in Python. Some of it is Hadoop jobs, some of it inserts data in Cassandra, some of it trains machine learning models, and much more. The Hadoop jobs are a happy mixture between inline Python jobs and jobs using Scalding.

So anyway, Luigi happily picked up that a bunch of data was missing, traversed the dependency graph backwards, and scheduled everything it needed. A few hours (and a heavly loaded cluster) later, everything was recreated.

Welcome Echo Nest!

In case you missed it, we just acquired a company called Echo Nest in Boston. These people have been obsessed with understanding music for the past 8 years since it was founded by Brian Whitman and Tristan Jehan out of MIT Medialab.

We think this is such a great fit in a lot of ways. In particular, they have focused on very complementary things to what we have. While we have spent a lot of time on things like collaborative filtering, A/B testing, learning from user feedback, and scalability, they have spent time on audio analysis, cultural understanding through web scraping, playlisting, and building a kick ass API.

For some info about what Echo Nest has been up to, check out Paul Lamere’s Music Machinery blog as well as Brian Whitman’s blog.

We’re super excited on starting to incorporate Echo Nest’s technology into our own product and we already have a couple of things in the pipe we might launch shortly. Stay tuned!