Optimizing things: everything is a proxy for a proxy for a proxy

Say you build a machine learning model, like a movie recommender system. You need to optimize for something. You have 1-5 stars as ratings so let’s optimize for mean squared error. Great.

Then let’s say you build a new model. It has even lower mean squared error. You deploy it. This model turns out to give a lower mean squared error. You roll it out to users and the metrics are tanking. Crap! Ok so maybe mean squared error isn’t the right thing to optimize for.

The way you solve this, of course, is you start A/B testing your changes. But what metric to choose? People often ask me why we use one or another metric. We typically look at numbers like Daily active users, Day 2 retention, etc. But what if optimizing too hard for one hurts the other? What if you’re driving day 2 retention but screwing up month 2 retention?

What I like to remind myself is that everything is a proxy metric. We really want to maximize shareholder value or something like similar (let’s not get into a debate about this, so for the purpose of this blog post I’m going to assume that’s our goal).

The problem is, you can’t take all your hyperparameters and calculate the gradient

\frac{\partial}{\partial \Theta} \mbox{shareholder value}(\Theta)


That’s silly for many reasons, but let’s break it down. First of all, the functional relationship is highly stochastic, and depends on all kinds of external factors. Second of all, there’s no way we can even evaluate multiple values of this function, and it won’t be possible until we invent time machines. Third of all, there’s no way we can extract gradient information at all.

So what are we going to do? We’re going to invent a new function that we think is highly correlated with shareholder value. We could even define multiple functions if we want to, just to cross check different ones. But we will never be able to establish the correlation because of the reasons I just mentioned.

So that’s usually why metrics like daily active users are great. If you are trying to grow your company, it’s reasonable to assume that ultimately user growth will lead to success.

But in general, what properties should such a function have? Ideally as many as possible out of this:

  1. Should be highly correlated with shareholder value. For a company focusing on growth, the number of active users is probably a good one.
  2. Should be possible to measure separate independent outcomes, using A/B tests, blind tests, or something else. For instance, number of signups is tricky to test.
  3. Should be fast to measure. We don’t want to launch an A/B test and have to wait many months to get an answer
  4. Should have a high signal to noise ratio. You want to extract as much value from it. If you’re running an A/B test you want to reach statistical significance quickly.

One thing I’ve learned the hard way is that sometimes it’s often useful to pick a more biased metric if that means you can results faster or getting more results. For instance, we can roll out an A/B test with two different recommendation algorithms. We probably won’t see an impact on high level metrics such as retention, so we can pick a feature-specific metric instead.

But we can go even further. Still if we’re A/B testing, we probably have to run that test for two weeks to get any meaningful numbers. At the end of the day we obtain 1 bit of information (roughly) at best after two weeks, which tells us which test group won.

If we want to iterate quickly sometimes it makes a lot more sense to just take the output of the recommendation algorithms and let people go through a blind test. This is slightly more biased because we’re not using real users, but usually a lot quicker, and also lets us extract a lot more information. Not just do we learn which algorithm is the better, we often end up with lots of (noisy) anecdotal information, such as “algorithm A sucks for popular content” or “algorithm B is less diverse”.

At the lowest level, if you have a recommender algorithm, the model’s objective (eg. mean squared error) is another great proxy. Turns out it’s extremely easy to try multiple parameters and learn from it – all you need to do is to retrain the model. It totally dominates points 2, 3 and 4, but doesn’t do a great job on point 1.

TL;DR any metric you’re using is just a proxy. Pick the right one for your task.

Luigi conquering the world

I keep forgetting to buy a costume for Halloween every year, so this year I prepared and got myself a Luigi costume a month in advance. Only to realize I was going to be out of town the whole weekend. If anyone wants a Luigi costume, let me know!

(I’m not as big as the guy in the picture)

Anyway, that’s not the Luigi this blog post is about. This is about the Python workflow manager that I’ve open sourced. If you’re putting together batch jobs into complex pipelines, you should check it out.

There’s been a couple of companies in NYC using it for a while, eg. Foursquare, Bitly, and Mortar Data. The latter company even provides a hosted Luigi solution.

Lately, there’s been a surge of contributors (at the time of writing there’s 78 contributors, which is pretty cool!). People at Houzz has sent tons of pull requests, and they are not the only ones.

Two blog posts just came out, both describing their use of Luigi (among other things)

In addition, I found this interesting deck from Stripe:

There’s also a couple of meetups in the past and the future. There was a dedicated Luigi meetup on July 31 at the Spotify NYC office, also sponsored by Run Ads who use Luigi a lot. Here are my slides about Luigi’s future, heavily delayed:

I’m also talking about Luigi at the NYC Data Science meetup on Dec 19 (also at Spotify’s NYC office). Feel free to drop by and ask me some tough questions!


Annoying blog post

I spent a couple of hours this weekend going through some pull requests and issues to Annoy, which is an open source C++/Python library for Approximate Nearest Neighbor search.

I set up Travis-CI integration and spent some time on one of the issues that multiple people had reported. At the end of the day, it turns out the issue was actually caused by a bug in GCC 4.8. Some crazy compiler optimization introduced between 4.6 and 4.8 caused this loop to be removed:

if (indices.size() <= (size_t)_K) {
  for (size_t i = 0; i < indices.size(); i++)
    m->children[i] = indices[i];
  return item;

Replacing it with std::copy turned out to do the trick

if (indices.size() <= (size_t)_K) {
  copy(indices.begin(), indices.end(), m->children);
  return item;

It’s still bizarre, but I probably deserved it, given how Annoy is abusing C++. The m->children array is declared as being only 2 elements long, but I deliberately overflow the array because I allocate extra space after it. I think this might cause GCC to unroll the loop to run twice.

I always feel a bit more comfortable when it turns out that the compiler is introducing bugs rather than my code. Made me think of the Jeff Dean joke: Jeff Dean builds his code before committing it, but only to check for compiler and linker bugs.

Anyway, after fixing this in three separately places, it seems like it’s finally working. Dirk Eddelbuettel is working on an R implementation of Annoy which is fun to see.

I haven’t spent much time with Annoy in a year or two and looking around it seems like there’s some new competitors on the block. Panns is one of them, another one is the LSHForest pull request for scikit-learn. I haven’t looked at them thoroughly, but they are both Python-only and claim some advantages over Annoy. None of them implement mmap as a method to load indexes, which imho is Annoy’s killer feature.

There’s a performance benchmark featuring Annoy, LSHForest, and FLANN, written by the author of LSHForest. Annoy performs horribly in the benchmark, getting its ass severely kicked by the other two. After re-running the benchmark myself, I think what happened is that the bug I mentioned above was present for Annoy and that’s why it performed so bad. Re-running the benchmark (thanks for making it easily reproducible!) yields very different results.

It’s extremely hard to compare all trade-offs between index building time, index size, query performance, and accuracy. So please don’t take this super seriously. The only thing I changed in the benchmark was (1) I added Panns, for good measure (2) I reduced the number of trees for Annoy (and Panns) to 10 instead of using n_features. Without reducing the number of trees for Annoy, it gets pretty much 100% accuracy for all data sets, but takes several minutes to build each index. So to emphasize approximate aspect of ANN, I decided to sacrifice some accuracy to gain performance.

Pardon the lousy graphics, but here’s the result in all its glory:

In my layman terms:

  • Annoy and Panns outperform LSHF and FLANN significantly on accuracy.
  • Index building process is fast for LSHF and FLANN. Annoy takes a lot more time, but Panns is 10x slower than Annoy
  • FLANN is faster than Annoy for queries. Annoy is 10x faster than LSHF. Panns is super duper slow.

And with my severely biased conclusions:

  • If you want to use mmap for fast index loading, use Annoy
  • If you want to minimize file size at any cost, use Panns
  • If you want fast query times at any cost, use FLANN
  • If you want a pure Python solution, use LSHF
  • For anything else, use Annoy. Or am I going too far promoting my own projects now…?

Btw, I would love it if someone could help me reimplement the algo used by Panns in Annoy, since it seems pretty good.

For another comparison, check out Radim Řehůřek’s Performance Shootout of Nearest Neighbours.

All metrics below:

Time building index (s) Average query time (ms) Average accuracy
n_samples: 1000 n_features: 100
LSHF 0.02 5.46 0.59
Annoy 0.15 0.27 0.98
Flann 0.00 0.17 0.60
Panns 3.27 66.48 0.93
n_samples: 1000 n_features: 500
LSHF 0.10 7.11 0.61
Annoy 0.39 0.75 0.98
Flann 0.01 0.24 0.62
Panns 9.94 140.60 0.96
n_samples: 10000 n_features: 100
LSHF 0.25 8.01 0.61
Annoy 3.17 0.45 0.98
Flann 0.02 0.20 0.62
Panns 55.34 71.12 0.96
n_samples: 10000 n_features: 500
LSHF 1.29 9.50 0.15
Annoy 10.46 1.14 0.50
Flann 0.07 0.24 0.13
Panns 154.58 139.83 0.54
n_samples: 10000 n_features: 1000
LSHF 2.70 13.74 0.16
Annoy 18.28 2.32 0.49
Flann 0.11 0.32 0.12
Panns 278.21 257.45 0.49

The Filter Bubble is Silly and you Can’t Guess What Happened Next

I’m at RecSys 2014, meeting a lot of people and hanging out at talks. Some of the discussions here was about the filter bubble which prompted me to formalize my own thoughts.

I firmly believe that it’s the role of a system to respect the user’s intent. Any sensible system will optimize for user’s long-term happiness by providing info back to the user that s/he finds useful. This holds true as long as a system isn’t (a) stupid and recommends the wrong content (b) trying to push its own agenda, that may or may not be hidden.

When I go to Google and search for “banana” it tries to predict my intent. I expect to results about bananas back. It’s like I go to an Indian restaurant and order something – I expect to get Indian food back.

Now, the chef could be bad, and I might not get what most people consider Indian food. But that’s just a “bug”. The chef knows that he should respect my intent.

The other option is that I get some weird food because the chef has a hidden agenda. For instance, maybe s/he puts crack in my food so that I come back tomorrow. Or maybe, s/he gives me a hamburger, tells me this is Indian food, and blatantly lies that there’s no other Indian restaurants in United States.

Luckily, we live in a free market, and as long as there’s some competition, I believe that the truth will prevail. I’ll eventually figure out that there are other Indian restaurants and I will discover that they are much better. This also means that there’s really no incentive for the chef to serve something that’s wildly different that what I’m asking for.

My point is, I think any system’s role is to respect the intent of the user, and serve whatever you ask for. It’s the user’s role to decide what s/he wants to consume, and in a free market I think there will be those options out there.

Anyway, this story took a funny twist. As I was debating this over Twitter, some Twitter bot stepped in and started offering advice on Indian restaurants. Obviously unfiltered and neutral.

Detecting corporate fraud using Benford’s law

Note: This is a silly application. Don’t take anything seriously.

Benford’s law describes a phenomenon where numbers in any data series will exhibit patterns in their first digit. For instance, if you took a list of the 1,000 longest rivers of Mongolia, or the average daily calorie consumption of mammals, or the wealth distribution of German soccer players, you will on average see that these numbers start with “1″ about 30% of the time. I won’t attempt at proving this, but essentially it’s a result of scale invariance. It doesn’t apply to all numerical series, like IQ or shoe size, but this pattern turns out to pop up in a lot of places.

Since the theory predicts that the first digit follows a certain outcome, you can use it to find “strange” distributions that seem to disobey what we can expect statistically. The Wikipedia article mentions using Benford’s law to detect accounting fraud, and Greece was busted by researchers noting that the Greek macroeconomic data had an abnormally large deviation from what Benford’s law would predict. There’s another couple of papers and an interesting blog post applying Benford’s law to industry sectors.

For fun, I downloaded about 5,000 annual reports (10-K) for most publicly traded companies in the US, to see if there are big outliers.

Benford’s law predict that the probability of any first digit, 1-9, is

Q(d) = \left(\log (d+1) - \log d \right) / \log 10.

For every annual report, I calculate the empirical distribution, P(d) = n_d / \sum n_i where n_d is just the number of occurrences of a dollar amount starting with digit d. To correct for reports with few values, I smooth the measured digit distribution a bit and add  100\cdot P(d) “fake” counts to each n_d.

To measure the difference between expected and actual distributions, I use the KL-divergence which boils down to

D_{P | Q} = \sum_i \log \left( P(i) / Q(i) \right) P(i)


I downloaded the annual reports from SEC and extracted all figures from all tables containing dollar amounts. Since some amounts may occur many times, and skew the digit distribution, I only looked at the unique amounts that occurred in the report. I then extracted first non-zero digit of all amounts.

The distributions of digits for the top five outlier entries illustrate Benford’s law in practice:

On closer inspection, some of these seem legit. For instance, the #1 spot on the list, Mid-America Apartment Communities, Inc. has a long list of units across the country, and the average price per unit happens to cluster around $800.

Below is a list containing the 100 companies with the largest KL-divergence (most “fishy”). None of the companies stand out as having an outrageous distribution, and even the top companies on the list are very unlikely to have commit fraud. The prior belief of accounting fraud is basically extremely low. We would commit the prosecutor’s fallacy for singling out any of these numbers as fraudulent. Anyway, I’ll follow up with a new blog post in five years to see if any of the companies below were actually caught:

0.1311 Mid-America Apartment Communities, Inc.
0.0578 Power Integrations Inc.
0.0497 United Natural Foods,Inc.
0.0474 Lexicon Pharmaceuticals, Inc
0.0461 Pacific Office Properties Trust, Inc.
0.0414 Xilinx
0.0406 Host Hotels & Resorts Inc.
0.0391 World Acceptance Corp
0.0390 Immunomedics, Inc.
0.0388 Marriott International Inc.
0.0387 CVS Caremark Corporation
0.0382 Paychex, Inc.
0.0382 Luna Innovations Incorporated
0.0381 Capstead Mortgage Corporation
0.0370 Verso Paper Corp.
0.0370 Fastenal Co.
0.0364 Insperity, Inc.
0.0359 Diamond Hill Investment Group Inc.
0.0354 National Security Group Inc.
0.0345 GameStop Corp.
0.0342 Compass Minerals International Inc.
0.0340 SIRIUS XM Radio Inc.
0.0339 BP Prudhoe Bay Royalty Trust
0.0326 Investors Bancorp Inc.
0.0323 Kohlberg Capital Corporation
0.0319 Equity One
0.0319 Kona Grill Inc.
0.0313 Alliance Financial Corporation
0.0310 Zale Corporation
0.0310 Anadarko Petroleum Corporation
0.0308 Sigma-Aldrich Corp.
0.0304 Global Cash Access Holdings, Inc.
0.0300 Corcept Therapeutics
0.0294 Enbridge Energy Management LLC
0.0293 BJ’s Restaurants Inc.
0.0293 Air Transport Services Group, Inc.
0.0292 Fairchild Semiconductor International Inc.
0.0292 Universal Electronics Inc.
0.0291 Espey Manufacturing & Electronics Corp.
0.0290 Inland Real Estate Corporation
0.0286 W. R. Berkley Corporation
0.0285 Albemarle Corp.
0.0282 Koss Corp.
0.0281 Leap Wireless International Inc.
0.0279 Encore Wire Corp.
0.0276 UQM Technologies, Inc.
0.0276 DuPont Fabros Technology Inc.
0.0276 Applied Materials Inc.
0.0275 Destination Maternity Corporation
0.0272 Pepsico, Inc.
0.0271 CorVel Corporation
0.0270 Nathan’s Famous Inc.
0.0269 Sport Chalet, Inc.
0.0269 Key Technology Inc.
0.0268 Overhill Farms Inc.
0.0268 Digi International Inc.
0.0267 Materion Corporation
0.0265 DreamWorks Animation SKG Inc.
0.0265 NIC Inc.
0.0264 ANSYS Inc.
0.0258 Volterra Semiconductor Corporation
0.0258 Verenium Corporation
0.0258 KeyCorp
0.0255 Rockwell Collins Inc.
0.0254 Meritage Homes Corporation
0.0253 Perrigo Co.
0.0249 Zhone Technologies Inc
0.0249 McGrath RentCorp
0.0249 A.M. Castle & Co.
0.0248 Delta Natural Gas Co. Inc.
0.0247 Pervasive Software Inc.
0.0247 Senomyx
0.0247 ManTech International Corp.
0.0246 Ross Stores Inc.
0.0245 Bancorp Of New Jersey, Inc.
0.0245 Werner Enterprises
0.0244 Dillards Inc.
0.0244 Sparton Corp.
0.0243 Rudolph Technologies Inc.
0.0243 CyberOptics Corp.
0.0240 Hallador Energy Company
0.0238 DARA BioSciences, Inc
0.0238 Chico’s FAS Inc.
0.0237 Delcath Systems Inc.
0.0236 Pure Cycle Corp.
0.0235 Cytori Therapeutics
0.0235 Vonage Holdings Corporation
0.0235 Spectranetics Corporation
0.0235 Regal-Beloit Corporation
0.0234 ScanSource, Inc.
0.0234 Weyco Group Inc
0.0232 Ambassadors Group Inc.
0.0232 Rent-A-Center Inc.
0.0232 Accenture plc
0.0231 Idenix Pharmaceuticals
0.0231 KAR Auction Services, Inc.
0.0230 Progressive
0.0230 BCSB Bankcorp Inc.
0.0229 PCTEL, Inc.
0.0229 Cincinnati Financial Corp.

Again, a bunch of disclaimers: this is just a silly application, don’t take it seriously, elevator inspection certificate available in the building manager’s office, etc.

Running Theano on EC2

Inspired by Sander Dieleman’s internship at Spotify, I’ve been playing around with deep learning using Theano. Theano is this Python package that lets you define symbolic expressions (cool), does automatic differentiation (really cool), and compiles it down into bytecode to run on a CPU/GPU (super cool). It’s built by Yoshua Bengio’s deep learning team up in Montreal.

This isn’t going to be a long blog post – I just wanted to share two pro tips:

  1. I was messing around for hours trying to get Theano running on the GPU instances in EC2. Turns out Andreas Jansson, a coworker at Spotify, has already built an ready-to-use AMI. When you start an EC2 instance, search for the gpu_theano AMI. (AMI’s are Amazon’s virtual images that you boot your system from). The gpu_theano AMI runs Ubuntu 14.04 and comes with a bunch of stuff pre-installed. Andreas also has this tool to spin it up from the command line, but I couldn’t get it working (somehow the instances it created weren’t accessible over SSH) so I ended up just booting machines from the AWS Management Console.
  2. The list price for the g2.2xlarge instances (the ones with GPU’s) are $0.65/h. If you end up running something for a week then that’s just above $100. The spot instance price, however, is (currently) only $0.0641/h – less than 10%. The downside with spot instances is that your using excess capacity of EC2, so there’s a small likelihood your machine will be taken down at any point. But so far supply generally seems to outstrip demand. The price looks fairly stable, and you can always checkpoint data to S3 to persist it.
My deep learning model is about 50x faster on a g2.2xlarge (which has 1,536 GPU cores) compared to a c3.4xlarge (which has 16 CPU cores) so the speedup is pretty substantial.



In defense of false positives (why you can’t fail with A/B tests)

Many years ago, I used to think that A/B tests were foolproof and all you need to do is compare the metrics for the two groups. The group with the highest conversion rate wins, right?

Then, for a long period, I ran a lot of tests. I started using confidence intervals, and learned about all the pitfalls of A/B testing. What to think about when running many A/B tests, why you shouldn’t check your metrics every day, why you shouldn’t optimize for a local maximum, and so on. I started becoming paranoid.

There’s about a million blog posts out there saying how everyone’s doing A/B testing wrong and how you should do it instead. It’s like there’s some secret society of people and the only way to join this club is to sacrifice a limb. You clearly have no idea what you’re doing… why are you even thinking about A/B testing? Go back and hack on your cousin’s webshop, or else pick up this 1,500 page book about Bayesian statistics and come back in two years.

The other side of this is about half a gazillion people who argue that A/B testing is inherently flawed. What I used to say was: Don’t throw out the baby with the bathwater. As long as you’re aware of the pitfalls, it’s a great tool. I thought for many years that running A/B test without a thorough understanding of all its shortcomings was dangerous.

But then I changed. Here’s the thing: Even if you’re doing A/B testing completely wrong, you are probably benefitting from it. Even if you don’t care about confidence intervals, multiple comparison corrections, or if you are basically too impatient to wait, you probably are still doing the right thing. The reason is that user metrics optimization is not a drug trial. 

What do I mean with this? There’s just a few things that govern the success of an A/B test

  1. The impact of a true positive. Assuming you end up deploying the right thing, what’s the business impact?
  2. The cost of a false positive. Assuming you end up deploying the wrong thing, what’s the business impact?
  3. The prior probability of success. Before you start running the test, what’s the probability of success? In the long run, what’s the success rate of testing?

For a drug trial, the impact of a true positive is huge. You found a cure for baldness! But the cost of a false positive is even bigger: it turns out your drug doesn’t work, and it’s also causing hallucinations. Finally, if you’re a drug company, you probably evaluated 100 different drugs before finding one that seems to work, meaning the success rate of any specific drug is minimal.

This is why drug trials are subject to such intense scrutiny by government agencies. It’s also why most published research findings are false.

But you’re not a drug company, nor are you trying to find the Higgs boson. You’re basically evaluating whether a bigger “sign up” button leads to more conversions. In fact, most of your tests are driven by strong hypotheses with a large prior belief. You have a clever idea of how to impact users and historically few A/B tests show negative results.

The cost of deploying the wrong thing (false positives) is also low. You might end up with the wrong color button or some extra code that adds small tech debt. But not more than that. After all, a feature can’t be horrible if metrics aren’t tanking.

The other thing people argue a lot about is what success metric matters. In my experience, it usually never matters. I’ve very rarely seen statistically significant impacts going in two directions (one metric going up, the other going down) as long as you pick metrics in a sensible way (eg. avoid ratio metrics). But what I have seen is insignificant tests. Lots of them. So if you have to pick a metric, the most important thing is you should just pick the one with the largest signal to noise. Just don’t cherry-pick metric after the test is run.

Conclusion: Don’t listen to all the haters. Do more A/B testing.

Recurrent Neural Networks for Collaborative Filtering

I’ve been spending quite some time lately playing around with RNN’s for collaborative filtering. RNN’s are models that predict a sequence of something. The beauty is that this something can be anything really – as long as you can design an output gate with a proper loss function, you can model essentially anything.

In the case of collaborative filtering, we will predict the next item given the previous items. More specifically, we will predict the next artist, album, or track, given the history of streams. Without loss of generalization, let’s assume we want to predict tracks only.

Note that we’re not trying to predict ratings or any explicit information – just what track the user chose to play.

The data

We use playlist data or session data, because it has an inherent sequence to it. Removing consecutive duplicates improves performance a lot, since otherwise the network just learns to predict the same item as it just predicted.

In our case we use a few billion playlist/sessions and in total about ten to a hundred billion “words”.

The model

Recurrent neural networks have a simple model that tries to predict the next item given all previous ones. After predicting the item, the network gets to “know” what item it was, and incorporates this.

More formally, let’s assume we have time steps 0 \ldots t-1. The model has a “hidden” internal state h_0 \ldots h_{t-1}. These are generally vectors of some dimension k. Every time step, we have two things going on

  • Predict the output given the hidden state. We need to model a P(y_i | h_i) for this.
  • Observe the output y_t and feed it back into the next hidden state h_{i+1}. In the most general form, h_{i+1} = f(a(h_i) + b(y_i)). In practice, f is generally some nonlinear function like sigmoid or tanh, whereas a, and b are usually a simple linear transform. It depends a bit on the structure of the output.

Now, all we need to do is write down the total likelihood and optimize for it!

Wait a minute?

Sorry about the extremely superficial introduction without much detail. Here’s a more specific example:

Let’s say we want to predict a sequence of daily stock returns. In that case, y_i is a vector of stock returns – maybe containing three values with the daily return for Apple, Google, and Microsoft. To get from the hidden state h_i to y_i let’s just use a simple matrix multiplication: y_i = W h_i

We can assume P(y_i | h_i) is a normal distribution because then the log-likelihood of the loss is just the (negative) L2 loss: -(y_t - h_t)^2

We can specify that h_{i+1} = \tanh(Uy_i + Vh_i) and that h_0 = 0 (remember it’s still a vector). If we want to be more fancy we could add bias terms and stuff but let’s ignore that for the purpose of this example. Our model is now completely specified and we have 3k^2 unknown parameters: UV, and W.

What are we optimizing?

We want to find UV, and W that maximizes the log-likelihood over all examples:

 \log L = \sum\limits_{\mbox{all examples}} \left( \sum\limits_{i=0}^{t-1} -(y_i - h_i)^2\right)



The way to maximize the log-likelihood is through back-propagation. This is a well-known method and there’s so much resources on line that I’ll be a bit superficial about the details.

Anyway, we need to do two passes through each sequence. First propagation. For i=0\ldots t-1: Calculate all hidden states h_i. Nothing magic going on here, we’re just applying our rule.

Now it’s time for backprop. This is essentially just the chain rule taken to the extreme. Remember that the total log-likelihood is the sum of all the individual log-probabilities of observing each output:

L = \sum\limits_{i=0}^{t-1} \log P(y_i | h_i)


We define the derivatives  \delta_i as the partial derivatives of the log-likelihood with respect to the hidden state:

\delta_i = \frac{\partial \log L}{\partial h_i}


Since each hidden state only influences later hidden states, the \delta‘s are  just a function of all future \delta_j, j = i\ldots t-1.

\delta_i = \sum\limits_{j=i}^{t-1}\frac{\partial \log P(y_j | h_j)}{\partial h_i}


We can now rewrite \delta_i as a function of \delta_{i+1} and use some chain rule magic:

  • \frac{\partial \log L}{\partial h_i} = \delta_i = \frac{\partial}{\partial h_i} \log P(y_i | h_i) + \frac{\partial h_{i+1}}{\partial h_{i}} \delta_{i+1}

We can evaluate this backwards from t-1 \ldots 0. Both P(y_i | h_i) and \frac{\partial h_{i+1}}{\partial h_{i}} are specified by our model, so we just need to plug in the expressions.

For the stock price example, we have the unknown parameters U, V, W where we can derive the gradients like this:

  • \frac{\partial \log L}{\partial U} = \frac{\partial L}{\partial h_{i+1}}\frac{\partial h_{i+1}}{\partial U} = \delta_{i+1} \frac{\partial}{\partial U} \tanh(Uy_i + Vh_i) = \delta_{i+1} \tanh'(Uy_i + Vh_i) y_i^T

This looks intimidating but it’s really just a lot of chain rule applications and fairly straightforward math. You get similar gradients for V, W. Now that we have the gradients, we can optimize using stochastic gradient descent over lots of example.

How does this relate to Hidden Markov Models?

The nice thing about RNN’s is that the relation between h_{t+1} and h_t is exact rather than being some probabilistic relationship. This means that the h‘s are not parameters in themselves, so we don’t have to solve for them at all. This is usually the slow part of HMM’s since figuring out the hidden values takes some slow iterative process like the Baum-Welch algorithm. For RNN’s, we only need two passes through each sequence rather than iterating lots of times until the hidden states converge.

The other thing is that RNN’s need some kind of nonlinearity or else they magnitude of the hidden states will explode. This nonlinearity is usually taken to be sigmoid or tanh. I guess in theory HMM’s could also use nonlinearities, but I’ve never heard of this.

Predicting other things

Let’s focus on the collaborative filtering example. Given a sequence of watched movies, or tracks that the user has listened to, predict what the next one is going to be. Now y_i is not a scalar, but one out of many items. We need some kind of distribution P(y_i | h_i). The one I’ve seen being used is the Softmax distribution over all possible outputs. This means we have to learn a vector a_j for each item. The probability P(y_i | h_i) is now proportional to \exp(h_t^Ta_j):

P(y_i | h_i) = \frac{\exp(h_t^Ta_j)}{\sum_k \exp(h_t^Ta_k)}


Notice that the summation part in the denominator is over all items – something that is pretty slow to compute. I’ll get back to that.

We also need something linking back the output to the next hidden state. In fact, we will learn another set of vectors b_j – one for each item. With slight abuse of notation, here is how it looks like:

Since we are learning a bunch of vectors, we don’t need the matrices U and V. Our model now becomes:

h_{i+1} = \tanh(W h_i + b_i)


with some slight abuse of notation since the b‘s are shared for each item y_i. Again, we want to maximize the total log-likelihood

\sum \sum_{i=0}^t \log P(y_i | h_i)


We now end up with a ton of parameters because we have the unknowns a_j‘s and b_j‘s for each item.

Let’s just pause here and reflect a bit: so far this is essentially a model that works for any sequence of items. There’s been some research on how to use this for natural language processing. In particular, check out Tomas Mikilov’s work on RNN’s (this is the same guy that invented word2vec, so it’s pretty cool.

The gnarly details

If you have 10 different items, you can evaluate P(y_i | h_i) easily, but not if you have five million items. But do not despair! There’s a lot of ways you can attack this:

  • Take the right output and sample some random items from the entire “vocabulary”. Train the model to classify which one is the right output. This is sort of like a police lineup: one is the right suspect, the remaining people just some random sample.
  • Hierarchical softmax: Put all items in a binary tree, and break it up into roughly \log m binary classification problems (where m is the size of the vocabulary).

Instead of messing around with Hamming trees and things recommended in literature, I ended up implementing a much more simple version of hierarchical softmax. Internally, all item are described as integers, so I build the tree implicitly. The root node is a binary classifier for the last bit of the item. The next level classifies the second last bit, and so on.

The idea is that you can calculate P(y_i | h_i) as the product of all the probabilities for each individual node on the path from the root to the leaf.

Instead of learning an a_j for every item, we just have to learn one for every node in the tree, in total 2^{\left \lceil{\log_2 m}\right \rceil -1} vectors. It doesn’t really matter that much how we build the tree – we don’t need to enforce that similar items are close to each other in the tree (however that would probably improve performance a bit).

But in general, this problem pops up in a lot of places and here’s some other crazy ideas I’ve thought about:

  • If generating random “negative” samples, one way to better mine random examples would be to sample some items from the y_i‘s themselves. Since those values are probably pretty similar, that would force the model to discriminate between more “hard” cases.
  • Assume the a_j‘s have some kind of distribution like a Normal distribution, and calculate an expectation. This is just a bunch of integrals. We’ve had some success using this method in other cases.
  • Don’t use softmax but instead use L2 loss on m binary classification problems. All entries would be 0 except the right one which is 1. You can put more weight on the right one to address the class imbalance. The cool thing is that with L2 loss, everything becomes linear, and you can compute the sum over all items in constant time. This is essentially how Yehuda Koren’s 2008 paper on implicit collaborative filtering works

Implementing it

I ended up building everything in C++ because it’s fast and I’m pretty comfortable with it. It reads about 10k words per second on a single thread. A multi-threaded version I’ve built can handle 10x that amount and we can parse 10B “words” in about a day.


A beautiful thing with this model is that there’s basically no hyperparameters. The larger number of factors is better – we typically use 40-200. With dropout (see below) overfitting is not a concern. It takes a little trial and error to get the step sizes right though.


As with most latent factor models, you need to initialize your parameter to random noise. Typically small Gaussian noise like \mathcal{N}(0, 0.1^2) works well.


I tried both sigmoid and tanh. Tanh makes more sense to me, because it’s symmetric around 0, so you don’t have to think too much about the bias term. Looking at some offline benchmarks, it seemed like tanh was slightly better than sigmoid.


I also added dropout to the hidden values since it seemed to help improve the predictions of the model. After the each h_i is calculated, I set half of them to zero. This also seems to help with exploding gradients. What happens is that the W matrix will essentially learn how to recombine the features

Gradient clipping

Actually gradient clipping wasn’t needed for sigmoid, but for tanh I had to add it. Basically during the backprop I cap the magnitude of the gradient to 1. This also helps equalizing the impact of different examples since otherwise for longer sequences you get bigger gradients.


I use Adagrad on the item vectors but simple learning rate on the shared parameters W and the bias terms. Adagrad is fairly simple: in addition to each vector a_j and b_j you just store a single scalar with the sum of the squared magnitudes of the gradients. You then use that to normalize the gradient.

For a vector x with gradient d, Adagrad can be written as:

x^{(n+1)} = x^{(n)} + \eta \frac{d^{(n)}}{\sqrt{\sum\limits_{i=1\ldots n} \left( d^{(i)} \right)^2}}


\eta is a hyperparameter that should be set to about half the final magnitude these vectors will have. I usually have had success just setting \eta = 1.


Offline results show that the RNN is one of the best performing algorithms for collaborative filtering and A/B tests confirm this.

More resources

I can recommend A tutorial on training recurrent neural networks as another starting point to read more about recurrent neural networks.

(Edit: fixed some minor errors in the math and a wrong reference to Baum-Welch)

Where do locals go in NYC?

One obvious thing to anyone living in NYC is how tourists cluster in certain areas. I was curious about the larger patterns around this, so I spent some time looking at data. The thing I wanted to understand is: what areas are dominated by tourists? Or conversely, what areas are dominated by locals?

After some looking around, I found this Foursquare data dump and analyzed about 200,000 check-ins in NYC. Time to crunch some data…

First of all, I split up check ins into those done by (a) people living within 10,000 feet of the check in (b) people living further away. As the next step I broke up the check-ins by 2,166 census areas of New York and calculated the ratio of locals. I color-coded each census area from green (100% locals) to purple (0% locals). Here is the result, for the five boroughs:

Some obvious ones stand out, like the airports: JFK and LaGuardia, which are completely dominated by “non-locals” . Interestingly, Red Hook in Brooklyn also is dominated by non-locals (maybe because of IKEA?), as well as Prospect Park and some other areas in downtown Brooklyn. In Bronx, the area surrounding Yankee Stadium is also nearly 100% non-locals.

Maybe not surprisingly, there seems to be some truth to the joke that real New Yorkers do not go above 14th Street. Zooming into lower Manhattan, you can clearly see a sharp dividing line cutting across Manhattan:

Another thing that stands out is how North Williamsburg is completely dominated by non-locals, maybe because of the huge influx of people on weekends going out partying.

I then split up into (a) people who live in the five boroughs (b) tourists, and color-coded into blue (0% tourists) and red (100% tourists):

Airports are slightly biased towards tourists. Interestingly Red Hook now becomes a New York affair.

Zooming in on Lower Manhattan again:

Another thing that stands out is how North Williamsburg is dominated by New Yorkers, meaning that even though most people are not local, they are still from the city.

In contrast, most of Lower Manhattan (East Village, West Village, etc) isn’t just dominated by locals, it’s also very low ratio of tourists to New Yorkers.

The areas most dominated by tourists are, maybe not surprisingly

  • Central Park
  • Times Square and parts of Midtown
  • Financial district
  • Liberty Island and Ellis Island

I did this in Python, mostly using geopandas, Matplotlib, and a bunch of open source data sets. It was a fun weekend project and it ended up taking way too much time. And since I live in Lower East Side myself, I’m probably pretty biased…

How to build up a data team (everything I ever learned about recruiting)

During my time at Spotify, I’ve reviewed thousands of resumes and interviewed hundreds of people. Lots of them were rejected but lots of them also got offers. Finally, I’ve also had my share of offers rejected by the candidate.

Recruiting is one of those things where the Dunning-Kruger effect is the most pronounced: the more you do it, the more you realize how bad you are at it. Every time I look back a year, I realize 10 things I did wrong. Extrapolating this, I know in another year I’ll realize all the stupid mistakes I’m doing right now. Anyway, that being said, here are some things I learned from recruiting.

Getting the word out

Depending on where you work, people might have no clue about your company. Why would they work for something they have never heard of? Or alternatively – something they know of, but doesn’t necessarily associate with cutting edge tech? There’s a million companies out there doing cool stuff, so make sure that people know your company stands out. Blog, talk at meetups, open source stuff, go to conferences. I honestly don’t know what works – I don’t have any numbers. But you need to hedge your bets by attacking on all angles at the same time.

I think developers have a hard time justifying this just because success is not easily quantifiable – this is a branding exercise, and it’s super hard to find out if you’re doing the right thing. But over time if you do this right, you will get anecdotal feedback from candidates coming in saying they saw your presentation or read this cool story on Hacker News, or what not.

Finding the people

I don’t think there’s anything magic about this – just go through external recruiters, internal recruiters, job postings, connections, whatever.

Presenting the opportunity

I think most people in the industry are fed up with bad bulk messages over email/LinkedIn. Ideally, the hiring manager should introduce themselves, or for more senior roles having more senior people reaching out (all the way up to the CTO). If a recruiter is reaching out, it’s super important to make sure the recruiter can reach out to people with a quick note on what’s interesting about the team and why it’s a good fit.

Finding the right candidates

Recruiting is some crazy type of active learning problem with this observation bias where you only see how well the people you hire are doing. In particular there was a lot of discussion a while back when Google claimed there was no correlation between test scores and GPA. I think there totally are really strong correlations on a macro scale, but if you are already filtering out people based on those criteria, obviously you will reduce the strength, or even reverse it. Not that I claim to have found any magic criteria. I do however thing the two most successful traits that I’ve observed are (with the risk of sounding cheesy):

  1. Programming fluency (10,000 hour rule or whatever) – you need to be able to visualize large codebases, and understand how things fit together. I strongly believe that data engineers need to understand the full stack from idea, to machine learning algorithm, to code running in production. I’ve seen other companies having a “throw it over the fence” attitude, with one team brainstorming algorithms, another team in another city implementing them. I think that’s a flawed way to have a tight learning cycle. In particular, I’m hesitant to hire candidates who are strong on the theoretical side, but with little experience writing code. That’s why I really avoid the “data science” label – most people within this group are generally lacking on the core programming side. I don’t think this necessarily means candidates has to have a solid understanding of the CAP theorem and the linux page cache. The most important thing is they have written a lot of code, can work with nontrivial code bases, and can write clean, maintainable code. There is nothing magic to this – but a person who only has written Matlab scripts probably will have a harder time adjusting.
  2. Understand the big picture – go from a vision to a set of tasks, to a bunch of code being written. People have to be able to go from an idea (“analyze this data set and build an algorithm that uses it as a signal for the recommender system”) to code, without having to hand hold them throughout every single step. People who need to be given small tasks rather than the underlying problem will never understand why we’re working on things, and will inevitably end up doing the wrong thing.
Attracting the right candidates
Even if you find awesome candidates, your interview process, or your lack of selling might destroy your prospects of hiring the right people. Here’s some things I’ve learned
  • Smart people are actually really impressed by good interview process. If some smart ML genius comes in and walks them through a super hard problem, the candidate will feel like they can learn from that interviewer. Conversely, giving interview problems that are not well structured, without any follow up questions, or discussion notes, will give a really bad impression. The best type of interview problems have a (b), (c) and (d) version that you can pull up in case the candidate nails (a) directly.
  • Understand the level of selling you need to do. If you get a super senior person in, spend the bare minimum to establish that this person actually lives up to their rumor. Then, spend the rest of the time explaining the role and why you are so excited about it.
  • Explaining the role and why it’s a good fit is everything. But obviously not superficial stuff. These are engineers and you need to impress them with the stuff they will learn. Talk about why your company is working on, what challenges you have, what technologies you are using.
  • It’s often much easier to start by listening to the candidate. Let them ask questions and talk more than you. Try to understand what they are looking for. Then, explain how this position meets or does not meet those criteria.
  • Everyone has an angle. Someone coming from finance is probably more excited to hear about your product vision. Someone from a small failed startup is probably more excited to hear about your benefits.
  • Make the candidate meet the team they will be working with. There’s so many companies failing this. I remember going through a series of ten Google interviews many years ago and then the recruiter wouldn’t tell me what team I would work on. I immediately turned down the offer. I think on the other side of the spectrum, the best thing you can do is to bring the candidate out on a team lunch to meet as many as possible.
  • Make them understand you take this role seriously. Set up a quick chat with the CTO or some senior people.
Hiring the right candidate
  • If it’s a senior person, and everyone is super impressed, hire the person. If it’s a senior person, and people are on the fence, you shouldn’t hire the person.
  • If it’s a junior person, it’s sometimes hard to know. One thing I’ve learned is that excitement really counts. For instance the candidate kicks ass during the ML interview, and clearly has solid programming fluency, but doesn’t necessarily know much about scalability and version control, then I really think it boils down to how excited this person is.
  • At the end of the day, if you realize this candidate is brilliant, but not necessarily the right fit for your role, find something else. It’s a sign of a great organization that you can always find a place for smart people.
Building the right team
  • Candidates will be on a spectrum between “theoretical” (more ML stuff) or “applied” (more scalability, backend, data processing). But just for the purpose of the argument, let’s assume people are on each side of the spectrum. For a high performing machine learning team, I think a ratio of 1:2 is good. But everyone should be able to work on every part of the system, if needed.
  • If you’re a small company, you will end up hiring lots of people who are similar to yourself. I don’t mean necessarily people of the same gender or origin, but people who will all go out to the same bar getting the same beer. This might good for cohesion early on, but as soon as you grow beyond the early stage, you might end up with a destructive monoculture that is super hard to break out of.
  • On the same note: if you’re a big company and your team still skews a certain way, then it means you have prejudice problems you need to work on. Seriously. For instance, with the number of female software engineers being 20.2%, you can derive that 95% of all 50 people team should contain at least five women, and 70% of all teams should contain at least eight women. It’s just plain math.

For more reading on the subject, I just came across this  interesting blog post by Cheng-Tao Chu: Why building a data science team is deceptively hard.