Annoy 1.10 released, with Hamming distance and Windows support
I’ve been a bit bad at posting things with a regular cadence lately, partly because I’m trying to adjust to having a toddler, partly because the hunt for clicks has caused such a high bar for me that I feel like I have to post something Pulitzer-worthy. But things are always cooking, so let’s break this pattern with a quick notice on something I’ve been working on!
Annoy 1.10.0 is out
Annoy is a library I built at Spotify that helps your search for approximate nearest neighbors in high-dimensional spaces. This is super useful if you use vector models, which Spotify uses a lot. Every track/album/artist/playlist/user ends up being a vector in some high dimensional space (typically 40D, sometimes more). The problem is that searching in that space is a nontrivial art (if I recall correctly, it’s expected but not proven to be to be NP-complete).
Annoy solves this issue by relaxing the search to be approximate. You can usually get 90% or 99% recall with only 1% of the runtime of an exhaustive search. This is great in many applications like recommendations where the cost of a false negative isn’t the end of the world.
Annoy 1.10.0 features mind-altering things like Hamming distance added by Martin Aumüller. Hamming distance is great when your vectors can be represented in binary form (every coordinate is either 0 or 1). This means that vectors can be stored very efficiently as 64-bit integers and distance can be computed using primitives like __builtin_popcountll which I think is a single CPU cycle on modern machines. The tree-building method right now only consider axis aligned splits (effectively making it a k-d tree) but I’m hoping to experiment with a few other heuristics at some point in the future.
The other main thing that 1.10.0 adds is Windows support with a proper CI pipeline, contributed by Timothy Riley. Annoy has had some semi-broken Windows support for a very long time, but several people have reported that it doesn’t work. Since I haven’t had access to any Windows machines, it’s been tricky for me to debug. The Windows build only works on Python 3.6 (but quite frankly: I’m a big proponent of Py3 — and my sympathy for people on Py2 is very limited).
- I got an email saying Annoy powers a database for condensed matter physics. See corresponding paper 1 and paper 2. Always fun when things end up in unexpected fields.
- There’s a long list of companies using Annoy (including Spotify). Instacart is a new entry to that list. They use it to recommend groceries.
- A paper about approximate nearest neighbor benchmarks was recently accepted at SISAP and the authors were nice enough to include me as a co-author. This relates to a similar open source project I have: ann-benchmarks. There’s a lot going on right now with that project that will be its own blog post in the future, but one thing worth mentioning so far is I’ve built a number of benchmark datasets for approximate nearest neighbors that I encourage you to use if you’re interested!
- What’s up next for Annoy? There’s a work in progress pull request for threaded index building which should speed things up a lot.
- I’m speaking about Annoy at the EGG2017 conference in NYC on Nov 30. Feel free to drop by and say hi! It will cover basically a slightly updated version of my serious of blog posts from before: part 1, part 2, and part 3. Expect a number of dad jokes about dimensionality and slides like the one below.