Erik Bernhardsson    About

Calculating cosine similarities using dimensionality reduction

This was posted on the Twitter Engineering blog a few days ago: Dimension Independent Similarity Computation (DISCO)

I just glanced at the paper, and there’s some cool stuff going on from a theoretical perspective. What I’m curious about is why they didn’t decide to use dimensionality reduction to solve such a big problem. The benefit of this approach is that it scales much better (linear in input data size) and produces much better results. The drawback is that it’s much harder to implement.

Dimensionality reduction is a lot messier to implement, but basically it works like this: You take your matrix and factor it into matrices and so that . Denoting each row as (user vectors) and (item vectors), respectively, the idea is that . Furthermore, cosine between user can be approximated by their low-dimensionality counterpart pretty well.

This is great for two reasons. First of all, you generally use only a handful of dimensions so all you have to deal with now is super trivial calculations. Taking the cosine of two users or items is where f is a small number denoting the number of dimensions. You can also calculate user-item score by just taking dot products, also in .

Second of all, forcing everything down onto a few dimensions is a great way to reduce noise. An intuitive way to see this is that in the original matrix, if user A had a lot of items in common with user B and C, but B and C didn’t have any items (or very few) in common, we would draw the conclusion that . Working in a reduced dimensionality we would probably still assign a pretty high value of similarity between B and C.

Now, you just reduced the problem to a much lower dimensionality and you still need to apply hashing techniques to find similar pairs. But if we can bring it down to something like 10 or 50 dimensions this is much easier to implement. One way to do it is to cut through the space using random hyperplanes and hash by that.

Dimensionality reduction (aka matrix factorization) is no easy task in itself. Mahout provides some tools to do this, but they don’t scale super well to the scale that Spotify or Twitter has. Instead, you are probably stuck having to build something from scratch. Be warned my friend, but at least I’d recommend these two papers:

Google News Personalization: Scalable Online Collaborative Filtering – describes how to scale PLSA (a factorization method) to large data sets

Collaborative Filtering for Implicit Feedback Datasets – describes another factorization algorithm that converges very fast in practice

Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.