Erik Bernhardsson    About

Annoy 1.10 released, with Hamming distance and Windows support

I’ve been a bit bad at posting things with a regular cadence lately, partly because I’m trying to adjust to having a toddler, partly because the hunt for clicks has caused such a high bar for me that I feel like I have to post something Pulitzer-worthy. But things are always cooking, so let’s break this pattern with a quick notice on something I’ve been working on!

Annoy 1.10.0 is out


Annoy is a library I built at Spotify that helps your search for approximate nearest neighbors in high-dimensional spaces. This is super useful if you use vector models, which Spotify uses a lot. Every track/album/artist/playlist/user ends up being a vector in some high dimensional space (typically 40D, sometimes more). The problem is that searching in that space is a nontrivial art (if I recall correctly, it’s expected but not proven to be to be NP-complete).

Annoy solves this issue by relaxing the search to be approximate. You can usually get 90% or 99% recall with only 1% of the runtime of an exhaustive search. This is great in many applications like recommendations where the cost of a false negative isn’t the end of the world.

Annoy 1.10.0 features mind-altering things like Hamming distance added by Martin Aumüller. Hamming distance is great when your vectors can be represented in binary form (every coordinate is either 0 or 1). This means that vectors can be stored very efficiently as 64-bit integers and distance can be computed using primitives like __builtin_popcountll which I think is a single CPU cycle on modern machines. The tree-building method right now only consider axis aligned splits (effectively making it a k-d tree) but I’m hoping to experiment with a few other heuristics at some point in the future.

The other main thing that 1.10.0 adds is Windows support with a proper CI pipeline, contributed by Timothy Riley. Annoy has had some semi-broken Windows support for a very long time, but several people have reported that it doesn’t work. Since I haven’t had access to any Windows machines, it’s been tricky for me to debug. The Windows build only works on Python 3.6 (but quite frankly: I’m a big proponent of Py3 – and my sympathy for people on Py2 is very limited).

What else?

  • I got an email saying Annoy powers a database for condensed matter physics. See corresponding paper 1 and paper 2. Always fun when things end up in unexpected fields.
  • There’s a long list of companies using Annoy (including Spotify). Instacart is a new entry to that list. They use it to recommend groceries.
  • A paper about approximate nearest neighbor benchmarks was recently accepted at SISAP and the authors were nice enough to include me as a co-author. This relates to a similar open source project I have: ann-benchmarks. There’s a lot going on right now with that project that will be its own blog post in the future, but one thing worth mentioning so far is I’ve built a number of benchmark datasets for approximate nearest neighbors that I encourage you to use if you’re interested!
  • What’s up next for Annoy? There’s a work in progress pull request for threaded index building which should speed things up a lot.
  • I’m speaking about Annoy at the EGG2017 conference in NYC on Nov 30. Feel free to drop by and say hi! It will cover basically a slightly updated version of my serious of blog posts from before: part 1, part 2, and part 3. Expect a number of dad jokes about dimensionality and slides like the one below.


Want to get blog posts over email?

Enter your email address and get weekly emails with new articles!

Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.