Erik Bernhardsson    About

Annoy 1.10 released, with Hamming distance and Windows support

I’ve been a bit bad at posting things with a regular cadence lately, partly because I’m trying to adjust to having a toddler, partly because the hunt for clicks has caused such a high bar for me that I feel like I have to post something Pulitzer-worthy. But things are always cooking, so let’s break this pattern with a quick notice on something I’ve been working on!

Annoy 1.10.0 is out

pic

Annoy is a library I built at Spotify that helps your search for approximate nearest neighbors in high-dimensional spaces. This is super useful if you use vector models, which Spotify uses a lot. Every track/album/artist/playlist/user ends up being a vector in some high dimensional space (typically 40D, sometimes more). The problem is that searching in that space is a nontrivial art (if I recall correctly, it’s expected but not proven to be to be NP-complete).

Annoy solves this issue by relaxing the search to be approximate. You can usually get 90% or 99% recall with only 1% of the runtime of an exhaustive search. This is great in many applications like recommendations where the cost of a false negative isn’t the end of the world.

Annoy 1.10.0 features mind-altering things like Hamming distance added by Martin Aumüller. Hamming distance is great when your vectors can be represented in binary form (every coordinate is either 0 or 1). This means that vectors can be stored very efficiently as 64-bit integers and distance can be computed using primitives like __builtin_popcountll which I think is a single CPU cycle on modern machines. The tree-building method right now only consider axis aligned splits (effectively making it a k-d tree) but I’m hoping to experiment with a few other heuristics at some point in the future.

The other main thing that 1.10.0 adds is Windows support with a proper CI pipeline, contributed by Timothy Riley. Annoy has had some semi-broken Windows support for a very long time, but several people have reported that it doesn’t work. Since I haven’t had access to any Windows machines, it’s been tricky for me to debug. The Windows build only works on Python 3.6 (but quite frankly: I’m a big proponent of Py3 — and my sympathy for people on Py2 is very limited).

What else?

  • I got an email saying Annoy powers a database for condensed matter physics. See corresponding paper 1 and paper 2. Always fun when things end up in unexpected fields.
  • There’s a long list of companies using Annoy (including Spotify). Instacart is a new entry to that list. They use it to recommend groceries.
  • A paper about approximate nearest neighbor benchmarks was recently accepted at SISAP and the authors were nice enough to include me as a co-author. This relates to a similar open source project I have: ann-benchmarks. There’s a lot going on right now with that project that will be its own blog post in the future, but one thing worth mentioning so far is I’ve built a number of benchmark datasets for approximate nearest neighbors that I encourage you to use if you’re interested!
  • What’s up next for Annoy? There’s a work in progress pull request for threaded index building which should speed things up a lot.
  • I’m speaking about Annoy at the EGG2017 conference in NYC on Nov 30. Feel free to drop by and say hi! It will cover basically a slightly updated version of my serious of blog posts from before: part 1, part 2, and part 3. Expect a number of dad jokes about dimensionality and slides like the one below.

pic

Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.