3D in D3

I have spent some time lately with D3. It’s a lot of fun to build interactive graphs. See for instance this demo (will provide a longer writeup soon).

D3 doesn’t have support for 3D but you can do projections into 2D pretty easily. It’s just old school computer graphics. I ended up adding an animated background to this blog based on an experiment. The math is simple.

First, there’s the rotation. Given a bunch of 3D coordinates, how do you rotate them in 3D space? The cleanest way is to define angles \alpha, \beta, \gamma and use them for rotation in the yz-plane, xz-plane, and xy-plane, respectively. Each of them define a rotation matrix. For the xy-plane, we get the rotation matrix R_{xy} (see code):

R_{xy} = \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}

We get three of these matrices in total: R_{yz}, R_{xz}, R_{xy}.

The rotation of any vector \mathbf{v} can now be described as R_{yz} R_{xz} R_{xy}\mathbf{v}. The nice thing is we can precompute the product of these matrices R (see code). Math porn:

R =   \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos(\alpha) & -\sin(\alpha) \\ 0 & \sin(\alpha) & \cos(\alpha) \end{pmatrix}   \begin{pmatrix} \cos(\beta) & 0 & \sin(\beta) \\ 0 & 1 & 0 \\ -\sin(\beta) & 0 & \cos(\beta) \end{pmatrix}   \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}

 

Now going forward we can use the matrix R to rotate any vector \mathbf{v} (see code).

The other thing you want to do is to make distant objects look further away. Thinking through proportionalities you can derive a pretty simple equation (see code). x' = x / (z/d + 1), and same for y. The constant d is just a hack to scale down the z values.

Not sure whether the 3D animation is cool or just annoying, but I’ll prob keep it for a bit – enjoy!

The hardest challenge about becoming a manager

Note: this post is full of pseudo-psychology and highly speculative content. Like most fun stuff!

I became a manager back in 2009. Being a developer is fun. You have this very tangible way to measure yourself. Did I deploy something today? How much code did I write today? Did I solve some really cool machine learning problem on paper?

But as 1:1’s and emails and architecture discussions started filling up my day I often walked home with this gnawing feeling of having accomplished nothing. I saw my team build and deploy some really cool stuff, but I had this sort of guilt as if I was pretty useless.

To feel better about myself, I started coding more. But I noticed when I started coding it was like smoking crack. I couldn’t stop doing it. I would come in at 9am thinking about some fun problem, then get completely sucked into it. I would find myself at 9pm drinking Red Bull deep in some highly optimized C++ latent factor model. That felt great until I realized I had missed a bunch of 1:1’s and had 43 unread emails in my inbox.

tumblr_n44tmpqcn31sv7laoo1_500

I think what happens in your brain is you create all these proxies for accomplishments that take years to retrain. I would be incredibly costly if every action was judged in terms of its benefits amortized over your entire life time. Instead, we humans have an ability to invent pretty arbitrary proxies, such as getting high scores on exams, or in my case, write shitloads of code.

Proxies are great because they let you make decisions much quicker. If I have decided that it’s good for me to write code, then I will start doing it, and eventually feel great about it. After a few years of doing something very consciously (programming is good because I can build this cool game and show it to my friends) you build up this great rewards system (kind of like Pavlov’s dogs?) that makes you feel good about it in itself (programming is cool because I feel good when I do it).

The problem is when your ultimate goal changes and your old proxies are still in effect. You rational side might tell you: hey, look at you, you’re team is really happy, they are learning new stuff, and delivering lots of stuff. But still, you have this really weird feeling that you are not getting anything useful done.

This took me literally years to retrain. I remember at some point I saw someone in my team that did something unexpectedly impressive and I got really excited. I got excited because I realized this person had grown tremendously since joining, and presumably some small fraction of it was due to me. With enough exposure finally this starts to be the new proxy for delivering value. Something the irrational side immediately detects and makes you feel a sense of accomplishment about.

Anyway… once an addict, always an addict. I still have relapses and fall back into programming sometimes. In general I’ve noticed it’s extremely hard to balance and try to do 50-50 development and management. Basically one of the sides take over until it’s 90-10. Either you start coding and you fall back to your old crack habits. Or you manage to break out of it, and you go into this opposite mode where you just don’t have the energy to go into the zone.

I don’t know what my conclusion is. Programmers are the most irrational people I know and I think they are really driven by more of a irrational, System 1 kind of thinking, the pattern matching brain. That’s why I think so many super smart people get stuck in habits (I really want to just solve super cool graph algorithm problems!). The most powerful way to make progress is to put your rational System 2 on and constantly remind the other side what’s really making impact and what’s really long term beneficial. It takes a few year, but with enough persistence, your rational self can really train the primitive pattern matching brain to feel a sense of accomplishment in doing almost anything.

Sorry for offending anyone with my drug references!

The lane next to you is more likely to be slower than yours

Saw this link on Hacker News the other day: The Highway Lane Next to Yours Isn’t Really Moving Any Faster

The article describes a phenomenon unique to traffic where cars spread out when they go fast and get more compact when they go slow. That’s supposedly the explanation.

There’s a much simpler explanation that works for any queue. Let’s consider a supermarket checkout with two lines. One of them has a slow worker and will take 10 minutes. The other one has a fast worker and will take 5 minutes. You don’t know which one is which so you pick one at random.

With p=1/2 you will pick the slow one, of course. But let’s say you go to this supermarket every day for a year. Here’s the interesting thing: on average you will spend 2/3 time in the slow queue. So if you sample any point in time where you are standing in line uniformly, with p=2/3 the other line will be faster.

 

 

Better precision and faster index building in Annoy

Sometimes you have these awesome insights. A few days ago I got an idea for how to improve index building in Annoy.

For anyone who isn’t acquainted with Annoy – it’s a C++ library with Python bindings that provides fast high-dimensional nearest neighbor search.

Annoy recursively builds up a tree given a set of points. The algorithm so far was: at every level, pick a random hyperplane out of all possible hyperplanes that intersect the convex hull given by the point set. The hyperplane defines a way to split the set of points into two subsets. Recursively apply the same algorithm on each subset until there’s only a small set of points left.

A much smarter way is this: sample two points from the set of points, compute the hyperplane equidistant to those points, and use this hyperplane to split the point set.

(I just described what happens for Euclidean distance. Angular is almost the same, just slightly simpler).

Implementing this turns makes index building 4x faster for Euclidean distance. But more importantly, the search quality is substantially better, both for angular and Euclidean distance. The difference is particularly large for high dimensional spaces.

I put together a test that measures precision for nearest neighbor search on the GloVe pretrained vectors using some hardcoded values for various parameters (10 trees, 10 nearest neighbors). See below:

annoy 64 angular

annoy 64 euclidean

 

This is pretty cool given that the commit is actually more red than green – the new algorithm is a lot simpler and I could remove a lot of old stuff that was no longer needed.

The intuitive reason why this works so well is: consider what happens if you have 200 dimensions, but your data is really “mostly” located on a lower dimensional space of say 20 dimensions. Then Annoy will find splits that are more aligned with the distribution of your data. I suspect these cases are pretty common in high dimensional spaces.

I also fixed another randomness issue that looked pretty severe (although I think in practice it didn’t cause any issues) and added a unit tests that runs the f=25 test shown above in the graphs.

There is a fresh 1.3.1 version out on PyPI and Github – get it while it’s hot!

Annoy – now without Boost dependencies and with Python 3 Support

ann

Annoy is a C++/Python package I built for fast approximate nearest neighbor search in high dimensional spaces. Spotify uses it a lot to find similar items. First, matrix factorization gives a low dimensional representation of each item (artist/album/track/user) so that every item is a k-dimensional vector, where k is typically 40-100. This is then loaded into an Annoy index for a number of things: fast similar items, personal music recommendations, etc.

Annoy stands for Approximate Nearest Neighbors something something and was originally open sourced back in 2013, although it wasn’t entirely well-supported until last year when I fixed a couple of crucial bugs. Subsequently, Dirk Eddelbuettel released RCppAnnoy, an R version of Annoy.

The key feature of Annoy is that it supports file-based indexes that can be mmapped very quickly. This makes it very easy to share indexes across multiple processes, load/save indexes, etc.

I built the original version of Annoy using Boost Python but a bunch of people have complained that it’s pretty hard to install. Additionally, Boost Python doesn’t support Python 3.

Last weekend I decided to fix it. I have something to confess. I’ve been meaning to address the Boost dependency for a long time, but never found the time to do it. Finally I just put up an ad on Odesk and outsourced the whole project. I found a great developer who built it all in a few hours.

It might seem ironic to outsource open source projects since I don’t get payed for it. But I spend time working on open source projects because it gives me things back in many ways – networking, recognition, some sort of fuzzy altruistic feeling of having contributed. I don’t mind spending a few bucks on it, same way as I don’t mind spending time on it.

The results is, Annoy doesn’t depend on Boost, and now has Python 3 support. Grab the new version from Github/PyPI and Please let me know if you run into any issues!

Ping the world

I just pinged a few million random IP addresses from my apartment in NYC. Here’s the result:

nyc

Some notes:

  • What’s going on with Sweden? Too much torrenting?
  • Ireland is likewise super slow, but not Northern Ireland
  • Eastern Ukraine is also super slow, maybe not surprising given current events.
  • Toronto seems screwed too, as well as part of NH and western PA.
  • Russia has fast internet.

The world:

world

Here are some high resolution plots, including one centered over the Eastern hemisphere: world.(png|pdf), east.(png|pdf), west.(png|pdf)

Some more notes on methodology

  • The source code is here.
  • It’s based on a 50 nearest neighbor average with the top 10% outliers removed.
  • Almost all random pings time out, so this is skewed towards the few (<10%) of hosts that actually respond
  • Some gaps of the map are filled out too much from neighbors. Eg. North Korea.
  • When computing nearest neighbors it’s much easier if you convert everything to 3D vectors first. I used Annoy in 3D for nearest neighbors. Annoy is a Python module (written by me) that does fast approximate nearest neighbors using random projections.
  • Basemap is kind of a pain in the ass to install and mess around with, but gives nice plots.
  • I was pinging using subprocess from Python and threads. Really want to give Go a shot on this as a way to learn it.

Black Box Machine Learning in the Cloud

black-cloud-4g9h

There’s a bunch of companies working on machine learning as a service. Some old companies like Google, but now also Amazon and Microsoft.

Then there’s a ton of startups: PredictionIO ($2.7M funding), BigML ($1.6M funding), Clarifai, etc, etc. Here’s a nice map from Bloomberg showing some of the landscape.

As much as I love ML, I’m not super bullish on these companies. I wrote a pretty cynical tweet the other day

Screen Shot 2015-04-14 at 10.42.31 PM

Instead of the negative let’s go through the ways I think a machine learning API can actually be useful (ok full disclosure: I don’t think it’s very many)

Does it solve an immediate business problem?

In ML dreamtown, an engineer realizes one day: “Hey, I have all these feature vectors, and these target values, and a loss function, I just wish someone could approximate a function for me”.

In reality, ML problems are often super messy, and it can be pretty challenging to get from data into a regression/classification problem (or anything else). Model fitting isn’t the issue, getting to model fitting is the hard part. Here is a bunch of real-world scenarios I worked with over the last years:

  1. Ad targeting based on music consumption. This is a very ill-defined problem and we need to figure out what we really want to solve.
  2. Predict churn. We can build this complex model that takes user features and predict whether they are going to churn out. The resulting model is generally not that useful though – it doesn’t give us any actionable insight.
  3. Predict ad clickthrough rate. Yes – we can take historical data and train a binary classifier, but it suffers from a lot of issues (such as observation bias, feedback loops, etc).

Does it focus on a particular niche?

It’s a lot more likely you solve an immediate business need if you focus on a specific niche. Natural language processing for sales people? I don’t know.

Focusing on a particular niche makes it easier to build something that works off the shelf. A general purpose outlier detection is not as useful as a model to detect insurance fraud.

Does it build on proprietary data sets? 

If you have amassed enormous troves of security footage, or aerial photography, or financial data, or whatever, then you can train a model that no one else can train. You can then sell this model for lots of money, because the cost of building up this data set is a huge barrier for anyone else.

Is there secret sauce?

Remember that trying to build something secret you are up against about 10,000 machine learning researchers in academia who spend all their time trying to come up with new methods. It’s true that lots of machine learning has a bit of a gap between academia an industry. But this is just things that are hard and messy. That’s not a defensible asset in the long run.

Convolutional neural networks for instance. It’s still pretty messy to get Theano or Torch working – I know because I spent a lot of time reading papers and documentation to get a simple image classifier working. But the complexity of this is going to go down very quickly. In a year’s time there will be open source libraries with pre-trained models for image classification that are on par with anything you can get through an API (probably better).

Does it solve infrastructural issues?

Scaling applications is still a hard problem. Similarly the use of GPU’s in deep learning creates an artificial barrier for many companies who don’t want to deal with Amazon instances etc – there is some value in abstracting this away from users.

The question is what companies have problems that require large scale machine learning that don’t have problems that require scalability.

Do you have smart people?

I actually think the biggest upside in many of these companies is the possibility of acqui-hire. It’s no secret that machine learning engineers are high in demand. So maybe the best strategy is try to attack some super hard problem, ignore whether it’s actually useful, and hire as many smart people as possible.

 

 

Norvig’s claim that programming competitions correlate negatively with being good on the job

I saw a bunch of tweets over the weekend about Peter Norvig claiming there’s a negative correlation between being good at programming competitions and being good at the job. There were some decent Hacker News comments on it.

Norvig’s statement is obviously not true if we’re drawing samples from the general population – most people can’t code. It doesn’t necessarily even have to do with time allocation as this commenter alluded to:

Being a champion at something requires excruciating narrow focus on something for unusually long time. If you are getting GPA of 4.0 or play Rachmaninoff’s Piano Concerto No 3 or deadlift 400 pounds or in top 1000 chess players – you probably have to work on it for hours a day for years while ignoring everything else (of course unless you are one of those 1 in million polymath).

Here’s the real reason: Google is already selecting for the top 1% programmers using some criteria, leading to selection bias. Even if the two values are positively correlated, you might have a selection criterion that leads to a negative correlation.

But let’s start with the ideal case. Let’s say there’s a slight positive correlation between “being good at programming competitions” and “what really matters”. Let’s assume Google hires perfectly. Let’s assume everyone is on a multivariate Gaussian:

recruiting_perfect

 

For all the people that were hired, I calculate the correlation between “Programming competition skills” and “What really matter”. The correlation for hired people is almost 0.2 and it’s still positive!

However let’s say Google for some reason puts too much weight on programming competitions during the interviews. We now get a negative correlation!

recruiting_bad

Does this mean it’s bad to hire people who are good at programming competition? No, it just means that we probably overweighted it during the hiring process. If we lower the weight a bit we get something a positive correlation again:

recruiting_medium

But in general does it mean we should never look at programming competition skills? Actually the reality is a lot more complicated. Instead of observing what really matters, you observe some crappy proxy for it. And when all metrics is noisy, you should put some nonzero positive weight on any metric that correlate positively with your target. Just not too much!

recruiting_realistic

Sorry for spamming you with scatter plots, but it’s in the name of statistics! My point here is that you can tweak these variables and end up seeing correlations with pretty much any value. So when you have these complex selection biases you need to be super careful about how to interpret the data. It’s a great reminder that studies like Project Oxygen always need to be taken with a bucket of sea salt.

Are there other examples of selection biases leading to spurious correlations? Let me know!

 

Pinterest open sources Pinball

41Pz5ClQ46L._SY300_

Pinterest just open sourced Pinball which seems like an interesting Luigi alternative. There’s two blog posts: Pinball: Building workflow management (from 2014) and Open-sourcing Pinball (from this week). The author has a comment in the comments thread on Hacker News:

Luigi was not available in public, when Pinball starts. So not sure the pros and cons between Pinball and Luigi.

When we build pinball, we aim to build a scalable and flexible workflow manager to satisfy the the following requirements (I just name a few here).

  1. easy system upgrade – when we fix bug or adding new features, there should be no interruption for current running workflow and jobs.
  2. easy add/test workflow – end user can easily add new jobs and workflows into pinball system, without affecting other running jobs and workflows.
  3. extensibility – a workflow manager should be easy to extended. As the company and business grows, there will be a lot new requirements and features needed. And also we love your contributions as well.
  4. flexible workflow scheduling policy, easy failure handling.
  5. We provide rich UI for you to easily manage your workflows – auto retry failed job, – you can retry failed job, can skip some job, can select a subset of jobs of a workflow to run (all from UI) – you can easily access all the running history of your job, and also get the stderr, stdout logs of your jobs – you can also explore the topology of your workflow, and also support easy search.
  6. Pinball is very generic can support different kind platform, you can use different hadoop clusters,e.g., quoble cluster, emr cluster. You can write different kind of jobs, e.g., hadoop streaming, cascading, hive, pig, spark, python …

There are a lot interesting things built in Pinball, and you probably want to have a try!

Sounds pretty similar to Luigi! My initial impression is that

  • The architecture is a bit more advanced than Luigi and has some features that Luigi lacks. From what I can tell, it comes with task storage out of the box (whereas Luigi’s task history DB is still not entirely integrated), distributed execution, and a triggering mechanism. These are all areas where Luigi still needs some love
  • The workflow API seems very convoluted. I don’t really understand how the code works and there’s a lot of boiler plate.

Fun to have something to compare to. Not that I want to rationalize Luigi’s missing features, but in general I would argue that the importance of good API design is underrated compared to good architecture. I still believe the key thing for a workflow manager is to reduce boiler plate and configuration at any point. It’s slightly harder to create an easy to use API than to think hard about architecture and check all the boxes for every feature.

powering-interactive-data-analysis-at-pinterest-by-amazon-redshift-3-638

Hopefully we’ll see more of these in the future. Obviously being Luigi’s author, I think Luigi is an awesome tool. But I think it’s 10% of what it could be, and diversity in this space is great for innovation. There’s a lot of them now: Oozie, Azkaban, Drake, Pinball, etc. Some people apparently use Jenkins for workflow management. A wildcard I encountered the other day is Ketrew. I wish I knew enough OCaml to understand what’s going on!