Benchmarking nearest neighbor libraries in Python

Radim Rehurek has put together an excellent summary of approximate nearest neighbor libraries in Python. This is exciting, because one of the libraries he's covering, annoy, was built by me.

After introducing the problem, he goes through the list of contestants and sticks with five remaining ones. Finally, the benchmarks pits annoy against FLANN. Although FLANN seems to have roughly 4x better performance, somewhat surprisingly, Radim concludes annoy is the “winner”. Yay!

image1000 nearest neighbors, performance vs accuracy, stolen from Radim's blog

I'm not surprised that FLANN is a bit faster. Annoy was mainly built having other goals in mind, primarily being able to use a file based memory range and mmap this quickly. I also think the current algorithm (random hyperplanes) works better with at most a hundred dimensions or so. That being said, at some point I think there's a lot of optimizations to do to Annoy some day when the urge to write messy C++ comes back.

Radim does give some fair criticism of the state of these libraries (including annoy). It can be a pain to install any of them, and in annoy's case there seems to be some problems with certain architectures where it basically returns bogus data. Given that this is such a fundamental problem, it's a little depressing how hard it is to use. Hoping this can change soon with more and more competing libraries out there.