Annoy – now without Boost dependencies and with Python 3 Support

ann

Annoy is a C++/Python package I built for fast approximate nearest neighbor search in high dimensional spaces. Spotify uses it a lot to find similar items. First, matrix factorization gives a low dimensional representation of each item (artist/album/track/user) so that every item is a k-dimensional vector, where k is typically 40-100. This is then loaded into an Annoy index for a number of things: fast similar items, personal music recommendations, etc.

Annoy stands for Approximate Nearest Neighbors something something and was originally open sourced back in 2013, although it wasn’t entirely well-supported until last year when I fixed a couple of crucial bugs. Subsequently, Dirk Eddelbuettel released RCppAnnoy, an R version of Annoy.

The key feature of Annoy is that it supports file-based indexes that can be mmapped very quickly. This makes it very easy to share indexes across multiple processes, load/save indexes, etc.

I built the original version of Annoy using Boost Python but a bunch of people have complained that it’s pretty hard to install. Additionally, Boost Python doesn’t support Python 3.

Last weekend I decided to fix it. I have something to confess. I’ve been meaning to address the Boost dependency for a long time, but never found the time to do it. Finally I just put up an ad on Odesk and outsourced the whole project. I found a great developer who built it all in a few hours.

It might seem ironic to outsource open source projects since I don’t get payed for it. But I spend time working on open source projects because it gives me things back in many ways – networking, recognition, some sort of fuzzy altruistic feeling of having contributed. I don’t mind spending a few bucks on it, same way as I don’t mind spending time on it.

The results is, Annoy doesn’t depend on Boost, and now has Python 3 support. Grab the new version from Github/PyPI and Please let me know if you run into any issues!

Ping the world

I just pinged a few million random IP addresses from my apartment in NYC. Here’s the result:

nyc

Some notes:

  • What’s going on with Sweden? Too much torrenting?
  • Ireland is likewise super slow, but not Northern Ireland
  • Eastern Ukraine is also super slow, maybe not surprising given current events.
  • Toronto seems screwed too, as well as part of NH and western PA.
  • Russia has fast internet.

The world:

world

Here are some high resolution plots, including one centered over the Eastern hemisphere: world.(png|pdf), east.(png|pdf), west.(png|pdf)

Some more notes on methodology

  • The source code is here.
  • It’s based on a 50 nearest neighbor average with the top 10% outliers removed.
  • Almost all random pings time out, so this is skewed towards the few (<10%) of hosts that actually respond
  • Some gaps of the map are filled out too much from neighbors. Eg. North Korea.
  • When computing nearest neighbors it’s much easier if you convert everything to 3D vectors first. I used Annoy in 3D for nearest neighbors. Annoy is a Python module (written by me) that does fast approximate nearest neighbors using random projections.
  • Basemap is kind of a pain in the ass to install and mess around with, but gives nice plots.
  • I was pinging using subprocess from Python and threads. Really want to give Go a shot on this as a way to learn it.

Black Box Machine Learning in the Cloud

black-cloud-4g9h

There’s a bunch of companies working on machine learning as a service. Some old companies like Google, but now also Amazon and Microsoft.

Then there’s a ton of startups: PredictionIO ($2.7M funding), BigML ($1.6M funding), Clarifai, etc, etc. Here’s a nice map from Bloomberg showing some of the landscape.

As much as I love ML, I’m not super bullish on these companies. I wrote a pretty cynical tweet the other day

Screen Shot 2015-04-14 at 10.42.31 PM

Instead of the negative let’s go through the ways I think a machine learning API can actually be useful (ok full disclosure: I don’t think it’s very many)

Does it solve an immediate business problem?

In ML dreamtown, an engineer realizes one day: “Hey, I have all these feature vectors, and these target values, and a loss function, I just wish someone could approximate a function for me”.

In reality, ML problems are often super messy, and it can be pretty challenging to get from data into a regression/classification problem (or anything else). Model fitting isn’t the issue, getting to model fitting is the hard part. Here is a bunch of real-world scenarios I worked with over the last years:

  1. Ad targeting based on music consumption. This is a very ill-defined problem and we need to figure out what we really want to solve.
  2. Predict churn. We can build this complex model that takes user features and predict whether they are going to churn out. The resulting model is generally not that useful though – it doesn’t give us any actionable insight.
  3. Predict ad clickthrough rate. Yes – we can take historical data and train a binary classifier, but it suffers from a lot of issues (such as observation bias, feedback loops, etc).

Does it focus on a particular niche?

It’s a lot more likely you solve an immediate business need if you focus on a specific niche. Natural language processing for sales people? I don’t know.

Focusing on a particular niche makes it easier to build something that works off the shelf. A general purpose outlier detection is not as useful as a model to detect insurance fraud.

Does it build on proprietary data sets? 

If you have amassed enormous troves of security footage, or aerial photography, or financial data, or whatever, then you can train a model that no one else can train. You can then sell this model for lots of money, because the cost of building up this data set is a huge barrier for anyone else.

Is there secret sauce?

Remember that trying to build something secret you are up against about 10,000 machine learning researchers in academia who spend all their time trying to come up with new methods. It’s true that lots of machine learning has a bit of a gap between academia an industry. But this is just things that are hard and messy. That’s not a defensible asset in the long run.

Convolutional neural networks for instance. It’s still pretty messy to get Theano or Torch working – I know because I spent a lot of time reading papers and documentation to get a simple image classifier working. But the complexity of this is going to go down very quickly. In a year’s time there will be open source libraries with pre-trained models for image classification that are on par with anything you can get through an API (probably better).

Does it solve infrastructural issues?

Scaling applications is still a hard problem. Similarly the use of GPU’s in deep learning creates an artificial barrier for many companies who don’t want to deal with Amazon instances etc – there is some value in abstracting this away from users.

The question is what companies have problems that require large scale machine learning that don’t have problems that require scalability.

Do you have smart people?

I actually think the biggest upside in many of these companies is the possibility of acqui-hire. It’s no secret that machine learning engineers are high in demand. So maybe the best strategy is try to attack some super hard problem, ignore whether it’s actually useful, and hire as many smart people as possible.

 

 

Norvig’s claim that programming competitions correlate negatively with being good on the job

I saw a bunch of tweets over the weekend about Peter Norvig claiming there’s a negative correlation between being good at programming competitions and being good at the job. There were some decent Hacker News comments on it.

Norvig’s statement is obviously not true if we’re drawing samples from the general population – most people can’t code. It doesn’t necessarily even have to do with time allocation as this commenter alluded to:

Being a champion at something requires excruciating narrow focus on something for unusually long time. If you are getting GPA of 4.0 or play Rachmaninoff’s Piano Concerto No 3 or deadlift 400 pounds or in top 1000 chess players – you probably have to work on it for hours a day for years while ignoring everything else (of course unless you are one of those 1 in million polymath).

Here’s the real reason: Google is already selecting for the top 1% programmers using some criteria, leading to selection bias. Even if the two values are positively correlated, you might have a selection criterion that leads to a negative correlation.

But let’s start with the ideal case. Let’s say there’s a slight positive correlation between “being good at programming competitions” and “what really matters”. Let’s assume Google hires perfectly. Let’s assume everyone is on a multivariate Gaussian:

recruiting_perfect

 

For all the people that were hired, I calculate the correlation between “Programming competition skills” and “What really matter”. The correlation for hired people is almost 0.2 and it’s still positive!

However let’s say Google for some reason puts too much weight on programming competitions during the interviews. We now get a negative correlation!

recruiting_bad

Does this mean it’s bad to hire people who are good at programming competition? No, it just means that we probably overweighted it during the hiring process. If we lower the weight a bit we get something a positive correlation again:

recruiting_medium

But in general does it mean we should never look at programming competition skills? Actually the reality is a lot more complicated. Instead of observing what really matters, you observe some crappy proxy for it. And when all metrics is noisy, you should put some nonzero positive weight on any metric that correlate positively with your target. Just not too much!

recruiting_realistic

Sorry for spamming you with scatter plots, but it’s in the name of statistics! My point here is that you can tweak these variables and end up seeing correlations with pretty much any value. So when you have these complex selection biases you need to be super careful about how to interpret the data. It’s a great reminder that studies like Project Oxygen always need to be taken with a bucket of sea salt.

Are there other examples of selection biases leading to spurious correlations? Let me know!

 

Pinterest open sources Pinball

41Pz5ClQ46L._SY300_

Pinterest just open sourced Pinball which seems like an interesting Luigi alternative. There’s two blog posts: Pinball: Building workflow management (from 2014) and Open-sourcing Pinball (from this week). The author has a comment in the comments thread on Hacker News:

Luigi was not available in public, when Pinball starts. So not sure the pros and cons between Pinball and Luigi.

When we build pinball, we aim to build a scalable and flexible workflow manager to satisfy the the following requirements (I just name a few here).

  1. easy system upgrade – when we fix bug or adding new features, there should be no interruption for current running workflow and jobs.
  2. easy add/test workflow – end user can easily add new jobs and workflows into pinball system, without affecting other running jobs and workflows.
  3. extensibility – a workflow manager should be easy to extended. As the company and business grows, there will be a lot new requirements and features needed. And also we love your contributions as well.
  4. flexible workflow scheduling policy, easy failure handling.
  5. We provide rich UI for you to easily manage your workflows – auto retry failed job, – you can retry failed job, can skip some job, can select a subset of jobs of a workflow to run (all from UI) – you can easily access all the running history of your job, and also get the stderr, stdout logs of your jobs – you can also explore the topology of your workflow, and also support easy search.
  6. Pinball is very generic can support different kind platform, you can use different hadoop clusters,e.g., quoble cluster, emr cluster. You can write different kind of jobs, e.g., hadoop streaming, cascading, hive, pig, spark, python …

There are a lot interesting things built in Pinball, and you probably want to have a try!

Sounds pretty similar to Luigi! My initial impression is that

  • The architecture is a bit more advanced than Luigi and has some features that Luigi lacks. From what I can tell, it comes with task storage out of the box (whereas Luigi’s task history DB is still not entirely integrated), distributed execution, and a triggering mechanism. These are all areas where Luigi still needs some love
  • The workflow API seems very convoluted. I don’t really understand how the code works and there’s a lot of boiler plate.

Fun to have something to compare to. Not that I want to rationalize Luigi’s missing features, but in general I would argue that the importance of good API design is underrated compared to good architecture. I still believe the key thing for a workflow manager is to reduce boiler plate and configuration at any point. It’s slightly harder to create an easy to use API than to think hard about architecture and check all the boxes for every feature.

powering-interactive-data-analysis-at-pinterest-by-amazon-redshift-3-638

Hopefully we’ll see more of these in the future. Obviously being Luigi’s author, I think Luigi is an awesome tool. But I think it’s 10% of what it could be, and diversity in this space is great for innovation. There’s a lot of them now: Oozie, Azkaban, Drake, Pinball, etc. Some people apparently use Jenkins for workflow management. A wildcard I encountered the other day is Ketrew. I wish I knew enough OCaml to understand what’s going on!

The relationship between commit size and commit message size

Screen Shot 2015-02-24 at 8.56.35 PM

Wow I guess it was more than a year ago that I tweeted this. Crazy how time flies by. Anyway, here’s my rationale:

  • When I update one line of code I feel like I have to put in a long explanation about its side effects, why it’s fully backwards compatible, and why it fixes some issue #xyz.
  • When I refactor 500 lines of code, I get too lazy to write anything sensible, so I just put “refactoring FooBarController”. Note: don’t do at home!

I decided to plot the relationship for Luigi:

test

The plot is clickable! Check it out! It’s an old school image map which is pretty pathetic, since no one has used it since 1997, but it was just so much easier for this task. Hover over any point to see the commit message and click on it to jump to the commit on Github.

As you can see, there’s essentially no relationship between the two values. Not as spectacular as I was hoping for, but still kind of weird/interesting.

Code is here if you’re curious!

My favorite management failures

For most people straight out of school, work life is a bit of a culture shock. For me it was an awesome experience, but a lot of the constraints were different and I had to learn to optimize for different things. It wasn’t necessarily the technology that I struggled with. The hardest part was how to manage my own projects and my time, as well as how to grow and make impact as an engineer. I’ve listed some of my biggest mistakes, which are also mistakes I see other (mostly junior) engineers make.

 

Having the wrong scope

How do you know what’s the right amount of work to spend on a project? I had horrible intuition about this coming out of school. One thing I think is helpful is to think of the relationship between time spent and impact. For a given project, it looks something like this:

time impact

It usually ends up being a concave function.

How do you pick a point on this curve? If you only have one task then it’s usually pretty easy because you have some constraint on total time or total impact. In school usually you work on some task until it hits a certain y value (problem is solved) or until it hits a certain x value (time to turn in what you have).

The problem is in real life you actually need to pick not just one point on one curve but a points on each many curves. Actually an infinite number of curves. And you need to pick these points subject to the constraint that you get the maximum value per time invested.

time impact 2 copy

 

This is a much harder problem! It means the amount of time we spend on task A is actually determined not just by how hard task A is but how hard an infinite number of other tasks are.

Let’s get mathematical here: for this concave optimization problem you can show that the marginal impact of each task should be identical. (I really want to write a book some day called The Mathematics of Project Management)

time impact 3

This means: recognize when the marginal impact of spending more time on a project starts to get low and you get more marginal impact elsewhere. Or just think: is this already good enough to deliver user value? Then take a break and look at the whole portfolio of possible task: ignoring what I have done so far, what’s the highest impact next thing I can do?

Focusing only on the things you are supposed to focus on

This might sound weird. What are you supposed to do at work? Most of the time you should probably do what your team/manager told you to do. But guess what? Your team/manager is not an all-seeing all-knowing oracle. Sometimes you might actually have a better idea of what to do.

Your sole purpose of working somewhere is to deliver value for the company. Completing a task from the backlog is a great proxy for that. But it’s still a proxy and as such has no intrinsic value. In many cases there might be even higher leverage things that no one will tell you to do. For instance, look around you. Is the team struggling with some old framework? Can you help someone get unblocked?

I like people to come in every morning and ask themselves: what is the highest impact thing I can do to for the company today? And do that. If you think about it, task backlogs is a completely artificial construct needed because we don’t have perfect information.

This gets especially important if you are interested in management roles. The higher up you get, the less people are going to tell you what to do.

Silly obligatory visualization:

things to do

Focusing only on low-leverage activities

There’s only that much leverage you get by being an individual contributor. Even if you’re a 10x engineer. Look around you for things with a force multiplier built in. Usually that means applying something to the entire team. Are you using the wrong language for the tool? Spend a few days investigating something else, introduce it to the team, and watch the whole team move twice as fast.

Screen Shot 2015-02-22 at 1.32.23 PM

I used to work with Neville Li at Spotify who was a genius at finding these opportunities. He also never did what you told him to. Instead, he would spend days reading blogs and trying new technologies. Every few months he would find something that made the whole team 2x as much productive. Then he would organize a workshop, introduce it to the team, and move on.

Not realizing technology isn’t just a job

This is probably my most cynical note, or optimistic, depending on how you look at it.

The truth is, software engineering isn’t just a normal job. It’s a life style. It’s also a field that keeps changing from year to year. If you want to be successful, you need to stay up to date. If you want to be above average, you need to do things like:

  • Working on side projects
  • Reading tech blogs
  • Following influencers on Twitter
  • Going to meetups
  • Reading papers
  • Etc

Being a software engineer is a fantastic career in many ways. With lots of freedom comes a lot of responsibilities. If you want to stay fresh, you need to invest a fair amount of your spare time.

Not drawing diagrams on glass walls

This is a no-brainer. Everyone knows that solid software engineers work draw everything on glass walls. And they also write everything flipped horizontally because it’s cooler.

cloud

Summary

I love technology. Go write some kick ass code now.

 

Leaving Spotify

Febrary 6 was my last day at Spotify. In total I spent more than six years at Spotify and it was an amazing experience.

I joined Spotify in Stockholm in 2008, mainly because a bunch of friends from programming competitions had joined already. Their goal to change music consumption seemed ridiculous at that point, but six years later I think it’s safe to say they actually succeeded.

Back in the early days, my job was to do almost anything related to data. I think the range of tasks that I was responsible for has now grown into 100+ people at Spotify. My day to day tasks was all over the map: Hadoop maintenance, Powerpoint presentations, label reporting, ran A/B tests, optimized ad delivery, did ad delivery forecasts, built music recommendations, and much more (for most of that time we were actually three people though, not just me).

It was an amazing learning experience to see a company grow this way. I think a company goes through different challenges at every stage, both technically and organizationally (honestly a lot more of the latter compared to the former).

Pushing the button, launching Spotify to the world (late 2008)
Pushing the button, launching Spotify to the world (late 2008)

 

Figuring out the cable situation (2009)
Figuring out the cable situation (2009)

I’ve been craving to go back and go through the same journey again, so I’ve joined a small startup in NYC as the head of engineering. I will share more details soon. Hopefully this time will be an opportunity to apply all those things I learned at Spotify.

Oskar Stål, the CTO of Spotify and a great mentor, would always tell me that I have to decide between machine learning and the “CTO ladder” at some point. I made a conscious decision right now to focus more on management and building teams. I think this might be the topic of some future blog post, but not now.

What’s going to happen to my open source projects such as Luigi and Annoy? Nothing should change, except I will have a lot less time to spend on it.

Stay tuned for more updates!