## coin2dice

Here’s a problem that I used to give to candidates. I stopped using it seriously a long time ago since I don’t believe in puzzles, but I think it’s kind of fun.

1. Let’s say you have a function that simulates a random coin flip. It returns “H” or “T”. This is the only random generator available. How can write a new function that simulates a random dice roll (1…6)?
2. Is there any method that guarantees that the second function returns in finite time?
3. Let’s say you want to do this $n$ times where $n \to \infty$. What’s the most efficient way to do it? Efficient in terms of using the fewest amount of coin flips.

The first part is old, I think. The second and third part are follow up questions that I came up with.

I’ll give you some time to think about it!

Don’t peak!

Did you figure it out?

Solutions

1. There’s a multitude of ways to do this. The easiest way to do it is probably to flip the coin three times and map the outcomes like this: (HHH, 0), (HHT, 1), (HTH, 2), (HTT, 3), (THH, 4), (THT, 5), (TTH, 6), (TTT, 7). If you end up getting HHH or TTT, you flip again.
2. Impossible! If you flip the coin $n$ times then there’s exactly $2^n$ outcomes. But we can’t partition this space evenly into 6 buckets since $3\nmid 2^n$
3. This one is trickier. Think about it in terms of the amount of information you extract. Every coin flip extracts 1 bit, but every dice roll consumes $\log_2 6\approx 2.585$ bits. This is a lower bound – you need at least that many coin flips per dice roll.

Is there a way to achieve that lower bound? Turns out there is: basically encode a long series of coin flip as a binary representation of a number between 0 and 1, then convert it to base 6. This idea resembles Arithmetic coding. Sample code:

Hope you enjoyed geeking out with a math problem this time!

## Benchmark of Approximate Nearest Neighbor libraries

Annoy is a library written by me that supports fast approximate nearest neighbor queries. Say you have a high (1-1000) dimensional space with points in it, and you want to find the nearest neighbors to some point. Annoy gives you a way to do this very quickly. It could be points on a map, but also word vectors in a latent semantic representation or latent item vectors in collaborative filtering.

I’ve made a few optimizations to Annoy lately and I was curious to see how it stacks up against other libraries out there, so I wrote a benchmark suite: ann-benchmarks. It supports any library with a Python interface and I added a bunch of them. It even has Travis integration!

The results so far for Annoy are pretty great. The only method that consistently beats Annoy is SW-graph from nmslib which is about 2-3x faster at the same precision. But Annoy beats both FLANN and KGraph at high precisions (>95%). At lower precisions (<95%) and cosine distance, Annoy is not quite as fast as FLANN and KGraph.

A surprising result was that Panns, Nearpy, and LSHForest all are very low performance. They are roughly 1,000 times slower than the other ones, and even worse, Panns & LSHForest don’t even produce high precision scores. I created an issue in scikit-learn about LSHForest’s performance. Of course, it might be that I did something wrong in the benchmark.

Annoy was actually never built with performance in mind. The killer feature was always to be able to load/unload indexes quickly using mmap – which no other package supports – but it’s fun to see that it’s actually very competitive on performance too. One of the things that made a difference was that I recently changed the interface for Annoy slightly (don’t worry, it’s backwards compatible). There is now a query-time tradeoff knob which lets you vary how many nodes are inspected.

The other factor not covered by the graph above is how hard it is to install most of these libraries. Annoy should compile and run almost anywhere (Linux, Win, OS X) very easily, but the other libraries can be challenging to install. For instance, both kgraph and nmslib depend on GCC-4.8, so they require custom installations. There are scripts to install all libraries in the repo tested with Ubuntu 12.04 and 14.04. For other platforms – good luck!

I might do a talk in September at the NYC Machine Learning meetup about this, we’ll see! Until then, I found this really great presentation by Stefan Savev (with corresponding web site). He claims his own implementation is a bit faster than Annoy! It’s in Scala so I haven’t tested it yet.

## More Luigi alternatives

The workflow engine battle has intensified with some more interesting entries lately! Here are a couple I encountered in the last few days. I love that at least two of them are direct references to Luigi!

Airflow (Blog Post) (GitHub)

Airflow from Airbnb is probably the most interesting one. I’ve only glanced at it, but here are some very superficial notes

• Seems mostly targeted to run raw UNIX commands using a pretty simple syntax with Jinja templates
• Tasks don’t have support for parameters but it seems like you can build tasks dynamically by just putting them in a function
• It seems to be built around daily jobs, meaning dates and backfill by dates is a foundational concept of the package (whereas in Luigi dates are just one out of many different parameters)
• There is a database of task history which is great
• The visualization seems great
• It also supports farming out jobs to other machines. This is something Luigi definitely needs
• It also comes with built-in support for HDFS, S3, MySQL and Postgres, similar to Luigi
• There’s a built-in triggering mechanism (also something Luigi needs)

Mario (Blog post(GitHub)

Mario seems to be a “Luigi in Scala”. It seems extremely simplistic, so it’s probably more conceptual at this point than meant to be full-fledged.

The choice of Scala is interesting. First of all, I think it’s fair to say that the JVM has taken over the data stack. A lot of Luigi’s core concept are really functional in nature and a language like Scala might be a better choice. A cool thing is that the RPC between Luigi clients and the scheduler is actually just a simple REST interface and Luigi’s scheduler could in fact support clients running other languages. Mario doesn’t do this but it’s something I’ve been meaning to explore for a long time.

Ruigi (GitHub)

Ruigi is “Luigi in R”. It follows the same set of conventions but seems to be pretty simple in that it runs everything locally.

Makeflow (Web site)

Seems to be some academic project mostly for publishing papers. What’s up with not using GitHub in 2015? And also having a “download” section with tarballs!

The benefit of Makeflow seems to be support for a bunch of batch systems commonly used in HPC in academia. The dependencies are specified using their own DSL with some simple Makefile-like notation.

Conclusions

So what’s the state of workflow engines at the moment? Allow me to say something provocative just for the sake of it: they all kind of suck. Including Luigi. There is basically so much trial and error when building a workflow engine and everytime I encounter one I just see a bunch of bad design decisions. Same when I look at Luigi. Luigi was the result of a bunch of many iterations and it avoids roughly 100 pitfalls we encountered in earlier attempts, but there are still some parts of the design that can be addressed.

I don’t mean to bash my own open source project here. Open sourcing Luigi has helped a lot of people building awesome stuff and I think it’s better than anything else out there.

I hope with the explosion of workflow engines lately, we will see a convergence of ideas to a second generation of much better, much more scalable, and much easier ones. My dream is that someone combines every god damned workflow engine in the world and write a new one, preferably in some JVM based language like Scala. I sincerely have no idea how that would look like, but I would love to do that some day as a Luigi 2.0!

## 3D in D3

I have spent some time lately with D3. It’s a lot of fun to build interactive graphs. See for instance this demo (will provide a longer writeup soon).

D3 doesn’t have support for 3D but you can do projections into 2D pretty easily. It’s just old school computer graphics. I ended up adding an animated background to this blog based on an experiment. The math is simple.

First, there’s the rotation. Given a bunch of 3D coordinates, how do you rotate them in 3D space? The cleanest way is to define angles $\alpha, \beta, \gamma$ and use them for rotation in the yz-plane, xz-plane, and xy-plane, respectively. Each of them define a rotation matrix. For the xy-plane, we get the rotation matrix $R_{xy}$ (see code):

$R_{xy} = \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}$

We get three of these matrices in total: $R_{yz}, R_{xz}, R_{xy}$.

The rotation of any vector $\mathbf{v}$ can now be described as $R_{yz} R_{xz} R_{xy}\mathbf{v}$. The nice thing is we can precompute the product of these matrices $R$ (see code). Math porn:

$R = \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos(\alpha) & -\sin(\alpha) \\ 0 & \sin(\alpha) & \cos(\alpha) \end{pmatrix} \begin{pmatrix} \cos(\beta) & 0 & \sin(\beta) \\ 0 & 1 & 0 \\ -\sin(\beta) & 0 & \cos(\beta) \end{pmatrix} \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}$

Now going forward we can use the matrix $R$ to rotate any vector $\mathbf{v}$ (see code).

The other thing you want to do is to make distant objects look further away. Thinking through proportionalities you can derive a pretty simple equation (see code). $x' = x / (z/d + 1)$, and same for $y$. The constant $d$ is just a hack to scale down the $z$ values.

Not sure whether the 3D animation is cool or just annoying, but I’ll prob keep it for a bit – enjoy!

## The hardest challenge about becoming a manager

Note: this post is full of pseudo-psychology and highly speculative content. Like most fun stuff!

I became a manager back in 2009. Being a developer is fun. You have this very tangible way to measure yourself. Did I deploy something today? How much code did I write today? Did I solve some really cool machine learning problem on paper?

But as 1:1’s and emails and architecture discussions started filling up my day I often walked home with this gnawing feeling of having accomplished nothing. I saw my team build and deploy some really cool stuff, but I had this sort of guilt as if I was pretty useless.

To feel better about myself, I started coding more. But I noticed when I started coding it was like smoking crack. I couldn’t stop doing it. I would come in at 9am thinking about some fun problem, then get completely sucked into it. I would find myself at 9pm drinking Red Bull deep in some highly optimized C++ latent factor model. That felt great until I realized I had missed a bunch of 1:1’s and had 43 unread emails in my inbox.

I think what happens in your brain is you create all these proxies for accomplishments that take years to retrain. I would be incredibly costly if every action was judged in terms of its benefits amortized over your entire life time. Instead, we humans have an ability to invent pretty arbitrary proxies, such as getting high scores on exams, or in my case, write shitloads of code.

Proxies are great because they let you make decisions much quicker. If I have decided that it’s good for me to write code, then I will start doing it, and eventually feel great about it. After a few years of doing something very consciously (programming is good because I can build this cool game and show it to my friends) you build up this great rewards system (kind of like Pavlov’s dogs?) that makes you feel good about it in itself (programming is cool because I feel good when I do it).

The problem is when your ultimate goal changes and your old proxies are still in effect. You rational side might tell you: hey, look at you, you’re team is really happy, they are learning new stuff, and delivering lots of stuff. But still, you have this really weird feeling that you are not getting anything useful done.

This took me literally years to retrain. I remember at some point I saw someone in my team that did something unexpectedly impressive and I got really excited. I got excited because I realized this person had grown tremendously since joining, and presumably some small fraction of it was due to me. With enough exposure finally this starts to be the new proxy for delivering value. Something the irrational side immediately detects and makes you feel a sense of accomplishment about.

Anyway… once an addict, always an addict. I still have relapses and fall back into programming sometimes. In general I’ve noticed it’s extremely hard to balance and try to do 50-50 development and management. Basically one of the sides take over until it’s 90-10. Either you start coding and you fall back to your old crack habits. Or you manage to break out of it, and you go into this opposite mode where you just don’t have the energy to go into the zone.

I don’t know what my conclusion is. Programmers are the most irrational people I know and I think they are really driven by more of a irrational, System 1 kind of thinking, the pattern matching brain. That’s why I think so many super smart people get stuck in habits (I really want to just solve super cool graph algorithm problems!). The most powerful way to make progress is to put your rational System 2 on and constantly remind the other side what’s really making impact and what’s really long term beneficial. It takes a few year, but with enough persistence, your rational self can really train the primitive pattern matching brain to feel a sense of accomplishment in doing almost anything.

Sorry for offending anyone with my drug references!

## The lane next to you is more likely to be slower than yours

Saw this link on Hacker News the other day: The Highway Lane Next to Yours Isn’t Really Moving Any Faster

The article describes a phenomenon unique to traffic where cars spread out when they go fast and get more compact when they go slow. That’s supposedly the explanation.

There’s a much simpler explanation that works for any queue. Let’s consider a supermarket checkout with two lines. One of them has a slow worker and will take 10 minutes. The other one has a fast worker and will take 5 minutes. You don’t know which one is which so you pick one at random.

With $p=1/2$ you will pick the slow one, of course. But let’s say you go to this supermarket every day for a year. Here’s the interesting thing: on average you will spend $2/3$ time in the slow queue. So if you sample any point in time where you are standing in line uniformly, with $p=2/3$ the other line will be faster.

## Better precision and faster index building in Annoy

Sometimes you have these awesome insights. A few days ago I got an idea for how to improve index building in Annoy.

For anyone who isn’t acquainted with Annoy – it’s a C++ library with Python bindings that provides fast high-dimensional nearest neighbor search.

Annoy recursively builds up a tree given a set of points. The algorithm so far was: at every level, pick a random hyperplane out of all possible hyperplanes that intersect the convex hull given by the point set. The hyperplane defines a way to split the set of points into two subsets. Recursively apply the same algorithm on each subset until there’s only a small set of points left.

A much smarter way is this: sample two points from the set of points, compute the hyperplane equidistant to those points, and use this hyperplane to split the point set.

(I just described what happens for Euclidean distance. Angular is almost the same, just slightly simpler).

Implementing this turns makes index building 4x faster for Euclidean distance. But more importantly, the search quality is substantially better, both for angular and Euclidean distance. The difference is particularly large for high dimensional spaces.

I put together a test that measures precision for nearest neighbor search on the GloVe pretrained vectors using some hardcoded values for various parameters (10 trees, 10 nearest neighbors). See below:

This is pretty cool given that the commit is actually more red than green – the new algorithm is a lot simpler and I could remove a lot of old stuff that was no longer needed.

The intuitive reason why this works so well is: consider what happens if you have 200 dimensions, but your data is really “mostly” located on a lower dimensional space of say 20 dimensions. Then Annoy will find splits that are more aligned with the distribution of your data. I suspect these cases are pretty common in high dimensional spaces.

I also fixed another randomness issue that looked pretty severe (although I think in practice it didn’t cause any issues) and added a unit tests that runs the f=25 test shown above in the graphs.

There is a fresh 1.3.1 version out on PyPI and Github – get it while it’s hot!

## Annoy – now without Boost dependencies and with Python 3 Support

Annoy is a C++/Python package I built for fast approximate nearest neighbor search in high dimensional spaces. Spotify uses it a lot to find similar items. First, matrix factorization gives a low dimensional representation of each item (artist/album/track/user) so that every item is a k-dimensional vector, where k is typically 40-100. This is then loaded into an Annoy index for a number of things: fast similar items, personal music recommendations, etc.

Annoy stands for Approximate Nearest Neighbors something something and was originally open sourced back in 2013, although it wasn’t entirely well-supported until last year when I fixed a couple of crucial bugs. Subsequently, Dirk Eddelbuettel released RCppAnnoy, an R version of Annoy.

The key feature of Annoy is that it supports file-based indexes that can be mmapped very quickly. This makes it very easy to share indexes across multiple processes, load/save indexes, etc.

I built the original version of Annoy using Boost Python but a bunch of people have complained that it’s pretty hard to install. Additionally, Boost Python doesn’t support Python 3.

Last weekend I decided to fix it. I have something to confess. I’ve been meaning to address the Boost dependency for a long time, but never found the time to do it. Finally I just put up an ad on Odesk and outsourced the whole project. I found a great developer who built it all in a few hours.

It might seem ironic to outsource open source projects since I don’t get payed for it. But I spend time working on open source projects because it gives me things back in many ways – networking, recognition, some sort of fuzzy altruistic feeling of having contributed. I don’t mind spending a few bucks on it, same way as I don’t mind spending time on it.

The results is, Annoy doesn’t depend on Boost, and now has Python 3 support. Grab the new version from Github/PyPI and Please let me know if you run into any issues!

## Ping the world

I just pinged a few million random IP addresses from my apartment in NYC. Here’s the result:

Some notes:

• What’s going on with Sweden? Too much torrenting?
• Ireland is likewise super slow, but not Northern Ireland
• Eastern Ukraine is also super slow, maybe not surprising given current events.
• Toronto seems screwed too, as well as part of NH and western PA.
• Russia has fast internet.

The world:

Here are some high resolution plots, including one centered over the Eastern hemisphere: world.(png|pdf), east.(png|pdf), west.(png|pdf)

Some more notes on methodology

• The source code is here.
• It’s based on a 50 nearest neighbor average with the top 10% outliers removed.
• Almost all random pings time out, so this is skewed towards the few (<10%) of hosts that actually respond
• Some gaps of the map are filled out too much from neighbors. Eg. North Korea.
• When computing nearest neighbors it’s much easier if you convert everything to 3D vectors first. I used Annoy in 3D for nearest neighbors. Annoy is a Python module (written by me) that does fast approximate nearest neighbors using random projections.
• Basemap is kind of a pain in the ass to install and mess around with, but gives nice plots.
• I was pinging using subprocess from Python and threads. Really want to give Go a shot on this as a way to learn it.

## Black Box Machine Learning in the Cloud

There’s a bunch of companies working on machine learning as a service. Some old companies like Google, but now also Amazon and Microsoft.

Then there’s a ton of startups: PredictionIO ($2.7M funding), BigML ($1.6M funding), Clarifai, etc, etc. Here’s a nice map from Bloomberg showing some of the landscape.

As much as I love ML, I’m not super bullish on these companies. I wrote a pretty cynical tweet the other day

Instead of the negative let’s go through the ways I think a machine learning API can actually be useful (ok full disclosure: I don’t think it’s very many)

Does it solve an immediate business problem?

In ML dreamtown, an engineer realizes one day: “Hey, I have all these feature vectors, and these target values, and a loss function, I just wish someone could approximate a function for me”.

In reality, ML problems are often super messy, and it can be pretty challenging to get from data into a regression/classification problem (or anything else). Model fitting isn’t the issue, getting to model fitting is the hard part. Here is a bunch of real-world scenarios I worked with over the last years:

1. Ad targeting based on music consumption. This is a very ill-defined problem and we need to figure out what we really want to solve.
2. Predict churn. We can build this complex model that takes user features and predict whether they are going to churn out. The resulting model is generally not that useful though – it doesn’t give us any actionable insight.
3. Predict ad clickthrough rate. Yes – we can take historical data and train a binary classifier, but it suffers from a lot of issues (such as observation bias, feedback loops, etc).

Does it focus on a particular niche?

It’s a lot more likely you solve an immediate business need if you focus on a specific niche. Natural language processing for sales people? I don’t know.

Focusing on a particular niche makes it easier to build something that works off the shelf. A general purpose outlier detection is not as useful as a model to detect insurance fraud.

Does it build on proprietary data sets?

If you have amassed enormous troves of security footage, or aerial photography, or financial data, or whatever, then you can train a model that no one else can train. You can then sell this model for lots of money, because the cost of building up this data set is a huge barrier for anyone else.

Is there secret sauce?

Remember that trying to build something secret you are up against about 10,000 machine learning researchers in academia who spend all their time trying to come up with new methods. It’s true that lots of machine learning has a bit of a gap between academia an industry. But this is just things that are hard and messy. That’s not a defensible asset in the long run.

Convolutional neural networks for instance. It’s still pretty messy to get Theano or Torch working – I know because I spent a lot of time reading papers and documentation to get a simple image classifier working. But the complexity of this is going to go down very quickly. In a year’s time there will be open source libraries with pre-trained models for image classification that are on par with anything you can get through an API (probably better).

Does it solve infrastructural issues?

Scaling applications is still a hard problem. Similarly the use of GPU’s in deep learning creates an artificial barrier for many companies who don’t want to deal with Amazon instances etc – there is some value in abstracting this away from users.

The question is what companies have problems that require large scale machine learning that don’t have problems that require scalability.

Do you have smart people?

I actually think the biggest upside in many of these companies is the possibility of acqui-hire. It’s no secret that machine learning engineers are high in demand. So maybe the best strategy is try to attack some super hard problem, ignore whether it’s actually useful, and hire as many smart people as possible.