Annoy

Annoy is a simple package to find approximate nearest neighbors (ANN) that I just put on Github. I'm not trying to compete with existing packages, but Annoy has a couple of features that makes it pretty useful. Most importantly, it uses very little memory and can put everything in a contiguous blob that you can mmap from disk. This way multiple processes can share the same index.

We use it at Spotify to put a couple of million tracks in 40-dimensional space and then query for the most similar tracks. Using floats, 5M * 40 * 4 is already 800MB, so sharing memory across multiple processes makes sense.

I hate when people talk about the “multicore revolution”, but it kind of makes sense here. An interesting side effect of it is that memory capacity doesn't seem to grow as fast as the number of cores. Static file-based mmapped indexes are actually really useful for a wide range of features. We also use Tokyo Cabinet a lot at Spotify, whenever data is at least a few hundred megs, at which point it does make sense to share the data across processes.