Calculating cosine similarities using dimensionality reduction

This was posted on the Twitter Engineering blog a few days ago: Dimension Independent Similarity Computation (DISCO)

I just glanced at the paper, and there’s some cool stuff going on from a theoretical perspective. What I’m curious about is why they didn’t decide to use dimensionality reduction to solve such a big problem. The benefit of this approach is that it scales much better (linear in input data size) and produces much better results. The drawback is that it’s much harder to implement.

Read more…

A neat little trick with time decay

Something that pops up pretty frequently is to implement time decay, especially where you have recursive chains of jobs. For instance, say you want to keep track of a popularity score. You calculate today’s output by reading yesterday’s output, discounting it by $$ exp(-lambda Delta T) $$ and then adding some hit count for today. Typically you choose $$ lambda $$ so that $$ exp(-lambda Delta T) = 0.95 $$ for a day or something like that. We do this to generate popularity scores for every track at Spotify.

Read more…

Luigi: complex pipelines of tasks in Python

I’m shamelessly promoting my first major open source project. Luigi is a Python module that helps you build complex pipelines of batch jobs, handle dependency resolution, and create visualizations to help manage multiple workflows. It also comes with Hadoop support built in (because that’s where really where its strength becomes clear).

Read more…