The Filter Bubble is Silly and you Can’t Guess What Happened Next

I’m at RecSys 2014, meeting a lot of people and hanging out at talks. Some of the discussions here was about the filter bubble which prompted me to formalize my own thoughts.

I firmly believe that it’s the role of a system to respect the user’s intent. Any sensible system will optimize for user’s long-term happiness by providing info back to the user that s/he finds useful. This holds true as long as a system isn’t (a) stupid and recommends the wrong content (b) trying to push its own agenda, that may or may not be hidden.

When I go to Google and search for “banana” it tries to predict my intent. I expect to results about bananas back. It’s like I go to an Indian restaurant and order something – I expect to get Indian food back.

Now, the chef could be bad, and I might not get what most people consider Indian food. But that’s just a “bug”. The chef knows that he should respect my intent.

The other option is that I get some weird food because the chef has a hidden agenda. For instance, maybe s/he puts crack in my food so that I come back tomorrow. Or maybe, s/he gives me a hamburger, tells me this is Indian food, and blatantly lies that there’s no other Indian restaurants in United States.

Luckily, we live in a free market, and as long as there’s some competition, I believe that the truth will prevail. I’ll eventually figure out that there are other Indian restaurants and I will discover that they are much better. This also means that there’s really no incentive for the chef to serve something that’s wildly different that what I’m asking for.

My point is, I think any system’s role is to respect the intent of the user, and serve whatever you ask for. It’s the user’s role to decide what s/he wants to consume, and in a free market I think there will be those options out there.

Anyway, this story took a funny twist. As I was debating this over Twitter, some Twitter bot stepped in and started offering advice on Indian restaurants. Obviously unfiltered and neutral.

Detecting corporate fraud using Benford’s law

Note: This is a silly application. Don’t take anything seriously.

Benford’s law describes a phenomenon where numbers in any data series will exhibit patterns in their first digit. For instance, if you took a list of the 1,000 longest rivers of Mongolia, or the average daily calorie consumption of mammals, or the wealth distribution of German soccer players, you will on average see that these numbers start with “1″ about 30% of the time. I won’t attempt at proving this, but essentially it’s a result of scale invariance. It doesn’t apply to all numerical series, like IQ or shoe size, but this pattern turns out to pop up in a lot of places.

Since the theory predicts that the first digit follows a certain outcome, you can use it to find “strange” distributions that seem to disobey what we can expect statistically. The Wikipedia article mentions using Benford’s law to detect accounting fraud, and Greece was busted by researchers noting that the Greek macroeconomic data had an abnormally large deviation from what Benford’s law would predict. There’s another couple of papers and an interesting blog post applying Benford’s law to industry sectors.

For fun, I downloaded about 5,000 annual reports (10-K) for most publicly traded companies in the US, to see if there are big outliers.

Benford’s law predict that the probability of any first digit, 1-9, is

Q(d) = \left(\log (d+1) - \log d \right) / \log 10.

For every annual report, I calculate the empirical distribution, P(d) = n_d / \sum n_i where n_d is just the number of occurrences of a dollar amount starting with digit d. To correct for reports with few values, I smooth the measured digit distribution a bit and add  100\cdot P(d) “fake” counts to each n_d.

To measure the difference between expected and actual distributions, I use the KL-divergence which boils down to

D_{P | Q} = \sum_i \log \left( P(i) / Q(i) \right) P(i)

 

I downloaded the annual reports from SEC and extracted all figures from all tables containing dollar amounts. Since some amounts may occur many times, and skew the digit distribution, I only looked at the unique amounts that occurred in the report. I then extracted first non-zero digit of all amounts.

The distributions of digits for the top five outlier entries illustrate Benford’s law in practice:

On closer inspection, some of these seem legit. For instance, the #1 spot on the list, Mid-America Apartment Communities, Inc. has a long list of units across the country, and the average price per unit happens to cluster around $800.

Below is a list containing the 100 companies with the largest KL-divergence (most “fishy”). None of the companies stand out as having an outrageous distribution, and even the top companies on the list are very unlikely to have commit fraud. The prior belief of accounting fraud is basically extremely low. We would commit the prosecutor’s fallacy for singling out any of these numbers as fraudulent. Anyway, I’ll follow up with a new blog post in five years to see if any of the companies below were actually caught:

0.1311 Mid-America Apartment Communities, Inc.
0.0578 Power Integrations Inc.
0.0497 United Natural Foods,Inc.
0.0474 Lexicon Pharmaceuticals, Inc
0.0461 Pacific Office Properties Trust, Inc.
0.0414 Xilinx
0.0406 Host Hotels & Resorts Inc.
0.0391 World Acceptance Corp
0.0390 Immunomedics, Inc.
0.0388 Marriott International Inc.
0.0387 CVS Caremark Corporation
0.0382 Paychex, Inc.
0.0382 Luna Innovations Incorporated
0.0381 Capstead Mortgage Corporation
0.0370 Verso Paper Corp.
0.0370 Fastenal Co.
0.0364 Insperity, Inc.
0.0359 Diamond Hill Investment Group Inc.
0.0354 National Security Group Inc.
0.0345 GameStop Corp.
0.0342 Compass Minerals International Inc.
0.0340 SIRIUS XM Radio Inc.
0.0339 BP Prudhoe Bay Royalty Trust
0.0326 Investors Bancorp Inc.
0.0323 Kohlberg Capital Corporation
0.0319 Equity One
0.0319 Kona Grill Inc.
0.0313 Alliance Financial Corporation
0.0310 Zale Corporation
0.0310 Anadarko Petroleum Corporation
0.0308 Sigma-Aldrich Corp.
0.0304 Global Cash Access Holdings, Inc.
0.0300 Corcept Therapeutics
0.0294 Enbridge Energy Management LLC
0.0293 BJ’s Restaurants Inc.
0.0293 Air Transport Services Group, Inc.
0.0292 Fairchild Semiconductor International Inc.
0.0292 Universal Electronics Inc.
0.0291 Espey Manufacturing & Electronics Corp.
0.0290 Inland Real Estate Corporation
0.0286 W. R. Berkley Corporation
0.0285 Albemarle Corp.
0.0282 Koss Corp.
0.0281 Leap Wireless International Inc.
0.0279 Encore Wire Corp.
0.0276 UQM Technologies, Inc.
0.0276 DuPont Fabros Technology Inc.
0.0276 Applied Materials Inc.
0.0275 Destination Maternity Corporation
0.0272 Pepsico, Inc.
0.0271 CorVel Corporation
0.0270 Nathan’s Famous Inc.
0.0269 Sport Chalet, Inc.
0.0269 Key Technology Inc.
0.0268 Overhill Farms Inc.
0.0268 Digi International Inc.
0.0267 Materion Corporation
0.0265 DreamWorks Animation SKG Inc.
0.0265 NIC Inc.
0.0264 ANSYS Inc.
0.0258 Volterra Semiconductor Corporation
0.0258 Verenium Corporation
0.0258 KeyCorp
0.0255 Rockwell Collins Inc.
0.0254 Meritage Homes Corporation
0.0253 Perrigo Co.
0.0249 Zhone Technologies Inc
0.0249 McGrath RentCorp
0.0249 A.M. Castle & Co.
0.0248 Delta Natural Gas Co. Inc.
0.0247 Pervasive Software Inc.
0.0247 Senomyx
0.0247 ManTech International Corp.
0.0246 Ross Stores Inc.
0.0245 Bancorp Of New Jersey, Inc.
0.0245 Werner Enterprises
0.0244 Dillards Inc.
0.0244 Sparton Corp.
0.0243 Rudolph Technologies Inc.
0.0243 CyberOptics Corp.
0.0240 Hallador Energy Company
0.0238 DARA BioSciences, Inc
0.0238 Chico’s FAS Inc.
0.0237 Delcath Systems Inc.
0.0236 Pure Cycle Corp.
0.0235 Cytori Therapeutics
0.0235 Vonage Holdings Corporation
0.0235 Spectranetics Corporation
0.0235 Regal-Beloit Corporation
0.0234 ScanSource, Inc.
0.0234 Weyco Group Inc
0.0232 Ambassadors Group Inc.
0.0232 Rent-A-Center Inc.
0.0232 Accenture plc
0.0231 Idenix Pharmaceuticals
0.0231 KAR Auction Services, Inc.
0.0230 Progressive
0.0230 BCSB Bankcorp Inc.
0.0229 PCTEL, Inc.
0.0229 Cincinnati Financial Corp.

Again, a bunch of disclaimers: this is just a silly application, don’t take it seriously, elevator inspection certificate available in the building manager’s office, etc.

Running Theano on EC2

Inspired by Sander Dieleman’s internship at Spotify, I’ve been playing around with deep learning using Theano. Theano is this Python package that lets you define symbolic expressions (cool), does automatic differentiation (really cool), and compiles it down into bytecode to run on a CPU/GPU (super cool). It’s built by Yoshua Bengio’s deep learning team up in Montreal.

This isn’t going to be a long blog post – I just wanted to share two pro tips:

  1. I was messing around for hours trying to get Theano running on the GPU instances in EC2. Turns out Andreas Jansson, a coworker at Spotify, has already built an ready-to-use AMI. When you start an EC2 instance, search for the gpu_theano AMI. (AMI’s are Amazon’s virtual images that you boot your system from). The gpu_theano AMI runs Ubuntu 14.04 and comes with a bunch of stuff pre-installed. Andreas also has this tool to spin it up from the command line, but I couldn’t get it working (somehow the instances it created weren’t accessible over SSH) so I ended up just booting machines from the AWS Management Console.
  2. The list price for the g2.2xlarge instances (the ones with GPU’s) are $0.65/h. If you end up running something for a week then that’s just above $100. The spot instance price, however, is (currently) only $0.0641/h – less than 10%. The downside with spot instances is that your using excess capacity of EC2, so there’s a small likelihood your machine will be taken down at any point. But so far supply generally seems to outstrip demand. The price looks fairly stable, and you can always checkpoint data to S3 to persist it.
My deep learning model is about 50x faster on a g2.2xlarge (which has 1,536 GPU cores) compared to a c3.4xlarge (which has 16 CPU cores) so the speedup is pretty substantial.

 

 

In defense of false positives (why you can’t fail with A/B tests)

Many years ago, I used to think that A/B tests were foolproof and all you need to do is compare the metrics for the two groups. The group with the highest conversion rate wins, right?

Then, for a long period, I ran a lot of tests. I started using confidence intervals, and learned about all the pitfalls of A/B testing. What to think about when running many A/B tests, why you shouldn’t check your metrics every day, why you shouldn’t optimize for a local maximum, and so on. I started becoming paranoid.

There’s about a million blog posts out there saying how everyone’s doing A/B testing wrong and how you should do it instead. It’s like there’s some secret society of people and the only way to join this club is to sacrifice a limb. You clearly have no idea what you’re doing… why are you even thinking about A/B testing? Go back and hack on your cousin’s webshop, or else pick up this 1,500 page book about Bayesian statistics and come back in two years.

The other side of this is about half a gazillion people who argue that A/B testing is inherently flawed. What I used to say was: Don’t throw out the baby with the bathwater. As long as you’re aware of the pitfalls, it’s a great tool. I thought for many years that running A/B test without a thorough understanding of all its shortcomings was dangerous.

But then I changed. Here’s the thing: Even if you’re doing A/B testing completely wrong, you are probably benefitting from it. Even if you don’t care about confidence intervals, multiple comparison corrections, or if you are basically too impatient to wait, you probably are still doing the right thing. The reason is that user metrics optimization is not a drug trial. 

What do I mean with this? There’s just a few things that govern the success of an A/B test

  1. The impact of a true positive. Assuming you end up deploying the right thing, what’s the business impact?
  2. The cost of a false positive. Assuming you end up deploying the wrong thing, what’s the business impact?
  3. The prior probability of success. Before you start running the test, what’s the probability of success? In the long run, what’s the success rate of testing?

For a drug trial, the impact of a true positive is huge. You found a cure for baldness! But the cost of a false positive is even bigger: it turns out your drug doesn’t work, and it’s also causing hallucinations. Finally, if you’re a drug company, you probably evaluated 100 different drugs before finding one that seems to work, meaning the success rate of any specific drug is minimal.

This is why drug trials are subject to such intense scrutiny by government agencies. It’s also why most published research findings are false.

But you’re not a drug company, nor are you trying to find the Higgs boson. You’re basically evaluating whether a bigger “sign up” button leads to more conversions. In fact, most of your tests are driven by strong hypotheses with a large prior belief. You have a clever idea of how to impact users and historically few A/B tests show negative results.

The cost of deploying the wrong thing (false positives) is also low. You might end up with the wrong color button or some extra code that adds small tech debt. But not more than that. After all, a feature can’t be horrible if metrics aren’t tanking.

The other thing people argue a lot about is what success metric matters. In my experience, it usually never matters. I’ve very rarely seen statistically significant impacts going in two directions (one metric going up, the other going down) as long as you pick metrics in a sensible way (eg. avoid ratio metrics). But what I have seen is insignificant tests. Lots of them. So if you have to pick a metric, the most important thing is you should just pick the one with the largest signal to noise. Just don’t cherry-pick metric after the test is run.

Conclusion: Don’t listen to all the haters. Do more A/B testing.

Recurrent Neural Networks for Collaborative Filtering

I’ve been spending quite some time lately playing around with RNN’s for collaborative filtering. RNN’s are models that predict a sequence of something. The beauty is that this something can be anything really – as long as you can design an output gate with a proper loss function, you can model essentially anything.

In the case of collaborative filtering, we will predict the next item given the previous items. More specifically, we will predict the next artist, album, or track, given the history of streams. Without loss of generalization, let’s assume we want to predict tracks only.

Note that we’re not trying to predict ratings or any explicit information – just what track the user chose to play.

The data

We use playlist data or session data, because it has an inherent sequence to it. Removing consecutive duplicates improves performance a lot, since otherwise the network just learns to predict the same item as it just predicted.

In our case we use a few billion playlist/sessions and in total about ten to a hundred billion “words”.

The model

Recurrent neural networks have a simple model that tries to predict the next item given all previous ones. After predicting the item, the network gets to “know” what item it was, and incorporates this.

More formally, let’s assume we have time steps 0 \ldots t-1. The model has a “hidden” internal state h_0 \ldots h_{t-1}. These are generally vectors of some dimension k. Every time step, we have two things going on

  • Predict the output given the hidden state. We need to model a P(y_i | h_i) for this.
  • Observe the output y_t and feed it back into the next hidden state h_{i+1}. In the most general form, h_{i+1} = f(a(h_i) + b(y_i)). In practice, f is generally some nonlinear function like sigmoid or tanh, whereas a, and b are usually a simple linear transform. It depends a bit on the structure of the output.

Now, all we need to do is write down the total likelihood and optimize for it!

Wait a minute?

Sorry about the extremely superficial introduction without much detail. Here’s a more specific example:

Let’s say we want to predict a sequence of daily stock returns. In that case, y_i is a vector of stock returns – maybe containing three values with the daily return for Apple, Google, and Microsoft. To get from the hidden state h_i to y_i let’s just use a simple matrix multiplication: y_i = W h_i

We can assume P(y_i | h_i) is a normal distribution because then the log-likelihood of the loss is just the (negative) L2 loss: -(y_t - h_t)^2

We can specify that h_{i+1} = \tanh(Uy_i + Vh_i) and that h_0 = 0 (remember it’s still a vector). If we want to be more fancy we could add bias terms and stuff but let’s ignore that for the purpose of this example. Our model is now completely specified and we have 3k^2 unknown parameters: UV, and W.

What are we optimizing?

We want to find UV, and W that maximizes the log-likelihood over all examples:

 \log L = \sum\limits_{\mbox{all examples}} \left( \sum\limits_{i=0}^{t-1} -(y_i - h_i)^2\right)

 

Backprop

The way to maximize the log-likelihood is through back-propagation. This is a well-known method and there’s so much resources on line that I’ll be a bit superficial about the details.

Anyway, we need to do two passes through each sequence. First propagation. For i=0\ldots t-1: Calculate all hidden states h_i. Nothing magic going on here, we’re just applying our rule.

Now it’s time for backprop. This is essentially just the chain rule taken to the extreme. Remember that the total log-likelihood is the sum of all the individual log-probabilities of observing each output:

L = \sum\limits_{i=0}^{t-1} \log P(y_i | h_i)

 

We define the derivatives  \delta_i as the partial derivatives of the log-likelihood with respect to the hidden state:

\delta_i = \frac{\partial \log L}{\partial h_i}

 

Since each hidden state only influences later hidden states, the \delta‘s are  just a function of all future \delta_j, j = i\ldots t-1.

\delta_i = \sum\limits_{j=i}^{t-1}\frac{\partial \log P(y_j | h_j)}{\partial h_i}

 

We can now rewrite \delta_i as a function of \delta_{i+1} and use some chain rule magic:

  • \frac{\partial \log L}{\partial h_i} = \delta_i = \frac{\partial}{\partial h_i} \log P(y_i | h_i) + \frac{\partial h_{i+1}}{\partial h_{i}} \delta_{i+1}

We can evaluate this backwards from t-1 \ldots 0. Both P(y_i | h_i) and \frac{\partial h_{i+1}}{\partial h_{i}} are specified by our model, so we just need to plug in the expressions.

For the stock price example, we have the unknown parameters U, V, W where we can derive the gradients like this:

  • \frac{\partial \log L}{\partial U} = \frac{\partial L}{\partial h_{i+1}}\frac{\partial h_{i+1}}{\partial U} = \delta_{i+1} \frac{\partial}{\partial U} \tanh(Uy_i + Vh_i) = \delta_{i+1} \tanh'(Uy_i + Vh_i) y_i^T

This looks intimidating but it’s really just a lot of chain rule applications and fairly straightforward math. You get similar gradients for V, W. Now that we have the gradients, we can optimize using stochastic gradient descent over lots of example.

How does this relate to Hidden Markov Models?

The nice thing about RNN’s is that the relation between h_{t+1} and h_t is exact rather than being some probabilistic relationship. This means that the h‘s are not parameters in themselves, so we don’t have to solve for them at all. This is usually the slow part of HMM’s since figuring out the hidden values takes some slow iterative process like the Baum-Welch algorithm. For RNN’s, we only need two passes through each sequence rather than iterating lots of times until the hidden states converge.

The other thing is that RNN’s need some kind of nonlinearity or else they magnitude of the hidden states will explode. This nonlinearity is usually taken to be sigmoid or tanh. I guess in theory HMM’s could also use nonlinearities, but I’ve never heard of this.

Predicting other things

Let’s focus on the collaborative filtering example. Given a sequence of watched movies, or tracks that the user has listened to, predict what the next one is going to be. Now y_i is not a scalar, but one out of many items. We need some kind of distribution P(y_i | h_i). The one I’ve seen being used is the Softmax distribution over all possible outputs. This means we have to learn a vector a_j for each item. The probability P(y_i | h_i) is now proportional to \exp(h_t^Ta_j):

P(y_i | h_i) = \frac{\exp(h_t^Ta_j)}{\sum_k \exp(h_t^Ta_k)}

 

Notice that the summation part in the denominator is over all items – something that is pretty slow to compute. I’ll get back to that.

We also need something linking back the output to the next hidden state. In fact, we will learn another set of vectors b_j – one for each item. With slight abuse of notation, here is how it looks like:

Since we are learning a bunch of vectors, we don’t need the matrices U and V. Our model now becomes:

h_{i+1} = \tanh(W h_i + b_i)

 

with some slight abuse of notation since the b‘s are shared for each item y_i. Again, we want to maximize the total log-likelihood

\sum \sum_{i=0}^t \log P(y_i | h_i)

 

We now end up with a ton of parameters because we have the unknowns a_j‘s and b_j‘s for each item.

Let’s just pause here and reflect a bit: so far this is essentially a model that works for any sequence of items. There’s been some research on how to use this for natural language processing. In particular, check out Tomas Mikilov’s work on RNN’s (this is the same guy that invented word2vec, so it’s pretty cool.

The gnarly details

If you have 10 different items, you can evaluate P(y_i | h_i) easily, but not if you have five million items. But do not despair! There’s a lot of ways you can attack this:

  • Take the right output and sample some random items from the entire “vocabulary”. Train the model to classify which one is the right output. This is sort of like a police lineup: one is the right suspect, the remaining people just some random sample.
  • Hierarchical softmax: Put all items in a binary tree, and break it up into roughly \log m binary classification problems (where m is the size of the vocabulary).

Instead of messing around with Hamming trees and things recommended in literature, I ended up implementing a much more simple version of hierarchical softmax. Internally, all item are described as integers, so I build the tree implicitly. The root node is a binary classifier for the last bit of the item. The next level classifies the second last bit, and so on.

The idea is that you can calculate P(y_i | h_i) as the product of all the probabilities for each individual node on the path from the root to the leaf.

Instead of learning an a_j for every item, we just have to learn one for every node in the tree, in total 2^{\left \lceil{\log_2 m}\right \rceil -1} vectors. It doesn’t really matter that much how we build the tree – we don’t need to enforce that similar items are close to each other in the tree (however that would probably improve performance a bit).

But in general, this problem pops up in a lot of places and here’s some other crazy ideas I’ve thought about:

  • If generating random “negative” samples, one way to better mine random examples would be to sample some items from the y_i‘s themselves. Since those values are probably pretty similar, that would force the model to discriminate between more “hard” cases.
  • Assume the a_j‘s have some kind of distribution like a Normal distribution, and calculate an expectation. This is just a bunch of integrals. We’ve had some success using this method in other cases.
  • Don’t use softmax but instead use L2 loss on m binary classification problems. All entries would be 0 except the right one which is 1. You can put more weight on the right one to address the class imbalance. The cool thing is that with L2 loss, everything becomes linear, and you can compute the sum over all items in constant time. This is essentially how Yehuda Koren’s 2008 paper on implicit collaborative filtering works

Implementing it

I ended up building everything in C++ because it’s fast and I’m pretty comfortable with it. It reads about 10k words per second on a single thread. A multi-threaded version I’ve built can handle 10x that amount and we can parse 10B “words” in about a day.

Hyperparameters

A beautiful thing with this model is that there’s basically no hyperparameters. The larger number of factors is better – we typically use 40-200. With dropout (see below) overfitting is not a concern. It takes a little trial and error to get the step sizes right though.

Initialization

As with most latent factor models, you need to initialize your parameter to random noise. Typically small Gaussian noise like \mathcal{N}(0, 0.1^2) works well.

Nonlinearity

I tried both sigmoid and tanh. Tanh makes more sense to me, because it’s symmetric around 0, so you don’t have to think too much about the bias term. Looking at some offline benchmarks, it seemed like tanh was slightly better than sigmoid.

Dropout

I also added dropout to the hidden values since it seemed to help improve the predictions of the model. After the each h_i is calculated, I set half of them to zero. This also seems to help with exploding gradients. What happens is that the W matrix will essentially learn how to recombine the features

Gradient clipping

Actually gradient clipping wasn’t needed for sigmoid, but for tanh I had to add it. Basically during the backprop I cap the magnitude of the gradient to 1. This also helps equalizing the impact of different examples since otherwise for longer sequences you get bigger gradients.

Adagrad

I use Adagrad on the item vectors but simple learning rate on the shared parameters W and the bias terms. Adagrad is fairly simple: in addition to each vector a_j and b_j you just store a single scalar with the sum of the squared magnitudes of the gradients. You then use that to normalize the gradient.

For a vector x with gradient d, Adagrad can be written as:

x^{(n+1)} = x^{(n)} + \eta \frac{d^{(n)}}{\sqrt{\sum\limits_{i=1\ldots n} \left( d^{(i)} \right)^2}}

 

\eta is a hyperparameter that should be set to about half the final magnitude these vectors will have. I usually have had success just setting \eta = 1.

Results

Offline results show that the RNN is one of the best performing algorithms for collaborative filtering and A/B tests confirm this.

More resources

I can recommend A tutorial on training recurrent neural networks as another starting point to read more about recurrent neural networks.

(Edit: fixed some minor errors in the math and a wrong reference to Baum-Welch)

Where do locals go in NYC?

One obvious thing to anyone living in NYC is how tourists cluster in certain areas. I was curious about the larger patterns around this, so I spent some time looking at data. The thing I wanted to understand is: what areas are dominated by tourists? Or conversely, what areas are dominated by locals?

After some looking around, I found this Foursquare data dump and analyzed about 200,000 check-ins in NYC. Time to crunch some data…

First of all, I split up check ins into those done by (a) people living within 10,000 feet of the check in (b) people living further away. As the next step I broke up the check-ins by 2,166 census areas of New York and calculated the ratio of locals. I color-coded each census area from green (100% locals) to purple (0% locals). Here is the result, for the five boroughs:

Some obvious ones stand out, like the airports: JFK and LaGuardia, which are completely dominated by “non-locals” . Interestingly, Red Hook in Brooklyn also is dominated by non-locals (maybe because of IKEA?), as well as Prospect Park and some other areas in downtown Brooklyn. In Bronx, the area surrounding Yankee Stadium is also nearly 100% non-locals.

Maybe not surprisingly, there seems to be some truth to the joke that real New Yorkers do not go above 14th Street. Zooming into lower Manhattan, you can clearly see a sharp dividing line cutting across Manhattan:

Another thing that stands out is how North Williamsburg is completely dominated by non-locals, maybe because of the huge influx of people on weekends going out partying.

I then split up into (a) people who live in the five boroughs (b) tourists, and color-coded into blue (0% tourists) and red (100% tourists):

Airports are slightly biased towards tourists. Interestingly Red Hook now becomes a New York affair.

Zooming in on Lower Manhattan again:

Another thing that stands out is how North Williamsburg is dominated by New Yorkers, meaning that even though most people are not local, they are still from the city.

In contrast, most of Lower Manhattan (East Village, West Village, etc) isn’t just dominated by locals, it’s also very low ratio of tourists to New Yorkers.

The areas most dominated by tourists are, maybe not surprisingly

  • Central Park
  • Times Square and parts of Midtown
  • Financial district
  • Liberty Island and Ellis Island

I did this in Python, mostly using geopandas, Matplotlib, and a bunch of open source data sets. It was a fun weekend project and it ended up taking way too much time. And since I live in Lower East Side myself, I’m probably pretty biased…

How to build up a data team (everything I ever learned about recruiting)

During my time at Spotify, I’ve reviewed thousands of resumes and interviewed hundreds of people. Lots of them were rejected but lots of them also got offers. Finally, I’ve also had my share of offers rejected by the candidate.

Recruiting is one of those things where the Dunning-Kruger effect is the most pronounced: the more you do it, the more you realize how bad you are at it. Every time I look back a year, I realize 10 things I did wrong. Extrapolating this, I know in another year I’ll realize all the stupid mistakes I’m doing right now. Anyway, that being said, here are some things I learned from recruiting.

Getting the word out

Depending on where you work, people might have no clue about your company. Why would they work for something they have never heard of? Or alternatively – something they know of, but doesn’t necessarily associate with cutting edge tech? There’s a million companies out there doing cool stuff, so make sure that people know your company stands out. Blog, talk at meetups, open source stuff, go to conferences. I honestly don’t know what works – I don’t have any numbers. But you need to hedge your bets by attacking on all angles at the same time.

I think developers have a hard time justifying this just because success is not easily quantifiable – this is a branding exercise, and it’s super hard to find out if you’re doing the right thing. But over time if you do this right, you will get anecdotal feedback from candidates coming in saying they saw your presentation or read this cool story on Hacker News, or what not.

Finding the people

I don’t think there’s anything magic about this – just go through external recruiters, internal recruiters, job postings, connections, whatever.

Presenting the opportunity

I think most people in the industry are fed up with bad bulk messages over email/LinkedIn. Ideally, the hiring manager should introduce themselves, or for more senior roles having more senior people reaching out (all the way up to the CTO). If a recruiter is reaching out, it’s super important to make sure the recruiter can reach out to people with a quick note on what’s interesting about the team and why it’s a good fit.

Finding the right candidates

Recruiting is some crazy type of active learning problem with this observation bias where you only see how well the people you hire are doing. In particular there was a lot of discussion a while back when Google claimed there was no correlation between test scores and GPA. I think there totally are really strong correlations on a macro scale, but if you are already filtering out people based on those criteria, obviously you will reduce the strength, or even reverse it. Not that I claim to have found any magic criteria. I do however thing the two most successful traits that I’ve observed are (with the risk of sounding cheesy):

  1. Programming fluency (10,000 hour rule or whatever) – you need to be able to visualize large codebases, and understand how things fit together. I strongly believe that data engineers need to understand the full stack from idea, to machine learning algorithm, to code running in production. I’ve seen other companies having a “throw it over the fence” attitude, with one team brainstorming algorithms, another team in another city implementing them. I think that’s a flawed way to have a tight learning cycle. In particular, I’m hesitant to hire candidates who are strong on the theoretical side, but with little experience writing code. That’s why I really avoid the “data science” label – most people within this group are generally lacking on the core programming side. I don’t think this necessarily means candidates has to have a solid understanding of the CAP theorem and the linux page cache. The most important thing is they have written a lot of code, can work with nontrivial code bases, and can write clean, maintainable code. There is nothing magic to this – but a person who only has written Matlab scripts probably will have a harder time adjusting.
  2. Understand the big picture – go from a vision to a set of tasks, to a bunch of code being written. People have to be able to go from an idea (“analyze this data set and build an algorithm that uses it as a signal for the recommender system”) to code, without having to hand hold them throughout every single step. People who need to be given small tasks rather than the underlying problem will never understand why we’re working on things, and will inevitably end up doing the wrong thing.
Attracting the right candidates
Even if you find awesome candidates, your interview process, or your lack of selling might destroy your prospects of hiring the right people. Here’s some things I’ve learned
  • Smart people are actually really impressed by good interview process. If some smart ML genius comes in and walks them through a super hard problem, the candidate will feel like they can learn from that interviewer. Conversely, giving interview problems that are not well structured, without any follow up questions, or discussion notes, will give a really bad impression. The best type of interview problems have a (b), (c) and (d) version that you can pull up in case the candidate nails (a) directly.
  • Understand the level of selling you need to do. If you get a super senior person in, spend the bare minimum to establish that this person actually lives up to their rumor. Then, spend the rest of the time explaining the role and why you are so excited about it.
  • Explaining the role and why it’s a good fit is everything. But obviously not superficial stuff. These are engineers and you need to impress them with the stuff they will learn. Talk about why your company is working on, what challenges you have, what technologies you are using.
  • It’s often much easier to start by listening to the candidate. Let them ask questions and talk more than you. Try to understand what they are looking for. Then, explain how this position meets or does not meet those criteria.
  • Everyone has an angle. Someone coming from finance is probably more excited to hear about your product vision. Someone from a small failed startup is probably more excited to hear about your benefits.
  • Make the candidate meet the team they will be working with. There’s so many companies failing this. I remember going through a series of ten Google interviews many years ago and then the recruiter wouldn’t tell me what team I would work on. I immediately turned down the offer. I think on the other side of the spectrum, the best thing you can do is to bring the candidate out on a team lunch to meet as many as possible.
  • Make them understand you take this role seriously. Set up a quick chat with the CTO or some senior people.
Hiring the right candidate
  • If it’s a senior person, and everyone is super impressed, hire the person. If it’s a senior person, and people are on the fence, you shouldn’t hire the person.
  • If it’s a junior person, it’s sometimes hard to know. One thing I’ve learned is that excitement really counts. For instance the candidate kicks ass during the ML interview, and clearly has solid programming fluency, but doesn’t necessarily know much about scalability and version control, then I really think it boils down to how excited this person is.
  • At the end of the day, if you realize this candidate is brilliant, but not necessarily the right fit for your role, find something else. It’s a sign of a great organization that you can always find a place for smart people.
Building the right team
  • Candidates will be on a spectrum between “theoretical” (more ML stuff) or “applied” (more scalability, backend, data processing). But just for the purpose of the argument, let’s assume people are on each side of the spectrum. For a high performing machine learning team, I think a ratio of 1:2 is good. But everyone should be able to work on every part of the system, if needed.
  • If you’re a small company, you will end up hiring lots of people who are similar to yourself. I don’t mean necessarily people of the same gender or origin, but people who will all go out to the same bar getting the same beer. This might good for cohesion early on, but as soon as you grow beyond the early stage, you might end up with a destructive monoculture that is super hard to break out of.
  • On the same note: if you’re a big company and your team still skews a certain way, then it means you have prejudice problems you need to work on. Seriously. For instance, with the number of female software engineers being 20.2%, you can derive that 95% of all 50 people team should contain at least five women, and 70% of all teams should contain at least eight women. It’s just plain math.

For more reading on the subject, I just came across this  interesting blog post by Cheng-Tao Chu: Why building a data science team is deceptively hard.

The power of ensembles

From my presentation at MLConf, one of the points I think is worth stressing again is how extremely well combining different algorithms works.

In this case, we’re training machine learning algorithms on different data sets (playlists, play counts, sessions) and different objectives (least squares, max likelihood). Then we combine all the models using gradient boosted decision trees training on a smaller but higher quality data set. Finally, we validate on a third data set, in this case looking at recall for a ground truth data set of related artists.

The ensemble was common knowledge throughout the Netflix prize, and the winner ended up having many hundreds of models in the ensemble. But measured in pure RMSE score the results weren’t mind blowing – the best models weren’t far from the best ensemble.

I think in a real world setting, chances are bigger that you have lots of models working on different data sets optimizing for different things. The bias of each model will be much larger, so the gains from combining all models are even larger.

Another difference in our case is we basically don’t know what to optimize for. There’s no single golden metric such as RMSE. Instead, we train things on different data sets. Eg individual models may be trained on playlists data, and the ensemble on thumbs. We then have a bunch of offline metrics as well as A/B test metrics we use to determine success.

Some people have suggested that these large ensembles are not practical and have tried to find a single good algorithm. I think this is a bad idea. Ensembles scale really well because you can parallelize all the work, and building a good framework it’s easy to add or remove new models. Data changes all the time and a good ensemble with a bunch of models makes recommendations robust and less sensitive to shift in the underlying data.

Our ensemble looks quite different (and much bigger) than the picture above. We are thinking a lot about adding diversified signals from various machine learning algorithms applied to different data sets. These signals can be anything from collaborative filtering to dummy things, content-based, or editorial data. Pretty much anything that can be converted into a numerical signal can be used.

MLConf 2014

Just spent a day at MLConf where I was talking about how we do music recommendations. There was a whole range of great speakers (actually almost 2/3 women which was pretty cool in itself).

Here are my slides:

Justin Basilico from Netflix talked about how they deliver a personalized start page using lots of ranking

Claudia Perlich also had a great presentation about ad targeting. A few things were really interesting: (a) using transfer learning to learn from web site visiting and apply it to add targeting (b) causality tests to figure out if a campaign actually made an impact (c) how they intentionally introduce bias in their classifiers by skewing subsampling. I think she also mentioned throwing in artificial negative data, which is something we do too.

Josh Wills had a hilarious slide about “ML as a tool” companies. His analogy was waiting two hours in a line for a roller coaster ride, and then someone jumps in front of you, rides the roller coaster, and gives you a video of the ride. His point was that ML is the 1% fun part, and 99% plumbing is the boring part. If you want to start a company, focus on the boring stuff no one else wants to deal with.

Music recommendations using cover images (part 1)

Scrolling through the Discover page on Spotify the other day it occurred to me that the album is in fact a fairly strong visual proxy for what kind of content you can expect from it. I started wondering if the album cover can in fact be used for recommendations. For many obvious reasons this is a kind of ridiculous idea, but still interesting enough that I just had to explore it a bit. So, I embarked on a journey to see how far I could get in a few hours.

First thing I did was to scrape all album covers. We have a few million of them (I don’t think I could give you the exact number, or I would have to kill you). A full set of 64x64px images is 10 GB roughly, so not an insane amount.

1024 random cover images in 16×16 px

My first attempt at defining “similarity” was simply to resize all images to 16×16, convert to grayscale, subtract the mean and normalize by the variance. Each cover image is then essentially a 256-vector in Euclidean space. Load those vectors into Annoy and Bob’s your uncle! Nice!

These recommendations turn out to be pretty horrible, at best.

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

In the chart above, “covers” is the image-based method. “rnn” is recurrent neural networks, “koren” is this paper and “vector_exp” is a model I’ve described before.

This image similarity algorithm gave some fun results, like finding out that the most similar album to the left one was the right one:

In general, the only type of music I could reliably see this working on was minimal techno. Pretty much all minimal techno albums have a completely identical layout: a white (or light) circle on a black background:

Most similar covers for Kassem Mosse – Workshop 12: http://open.spotify.com/album/0NQvm5y6CLtjtbyJNZltFg

This wasn’t enough to resolve the question, so I kept searching for the answer. I stumbled across pyleargist by Oliver Grisel and to my delight it was trivial to install. Essentially pyleargist is a wrapper around a C library that takes any image and generates an image descriptor, which is a vector with 960 elements. Evaluating it using the same metric actually yields some fairly decent results:

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

The results are already not that horrible (although with a very biased and quite unreliable metric). At least it’s definitely better than pure random.

Most similar album covers to Daft Punk’s – Random Access Memories

At this point I decided the next step on this journey of overengineering would be either, or possibly both out of

  1. Learning a mapping from image space to collaborative filtering space. That way we learn which features in the picture are relevant for the music. This is a similar idea to this paper.
  2. Venture into the realms of deep learning. I went to the Twitters for some assistance and got a ton of response back.

This is an ongoing project (and an ongoing source of frustrations) so I’ll defer the updates to a second part. Stay tuned!

Edit: Was linking to the wrong paper by Sander Dieleman’s paper!