Nearest neighbors and vector models – part 2 – algorithms and data structures

This is a blog post rewritten from a presentation at NYC Machine Learning on Sep 17. It covers a library called Annoy that I have built that helps you do nearest neighbor queries in high dimensional spaces. In the first part, I went through some examples of why vector models are useful. In the second part I will be explaining the data structures and algorithms that Annoy uses to do approximate nearest neighbor queries.

Let’s start by going back to our point set. The goal is to find nearest neighbors in this space. Again, I am showing a 2 dimensional point set because computer screens are 2D, but in reality most vector models have much higher dimensionality.


Our goal is to build a data structure that lets us find the nearest points to any query point in sublinear time.

We are going to build a tree that lets us do queries in \mathcal{O}(\log n). This is how Annoy works. In fact, it’s a binary tree where each node is a random split. Let’s start by splitting the space once:


Annoy does this by picking two points randomly and then splitting by the hyperplane equidistant from those two points. The two points are indicated by the gray line and the hyperplane is the thick black line.

Let’s keep splitting each subspace recursively!


A very tiny binary tree is starting to take shape:

We keep splitting again:


… and so on. We keep doing this until there’s at most K items left in each node. At that point it looks something like this (for K=10):


With the corresponding binary tree:


Nice! We end up with a binary tree that partitions the space. The nice thing is that points that are close to each other in the space are more likely to be close to each other in the tree. In other words, if two points are close to each other in the space, it’s unlikely that any hyperplane will cut them apart.

To search for any point in this space, we can traverse the binary tree from the root. Every intermediate node (the small squares in the tree above) defines a hyperplane, so we can figure out what side of the hyperplane we need to go on and that defines if we go down to the left or right child node. Searching for a point can be done in logarithmic time since that is the height of the tree.

Let’s search for the point denoted by the red X in the plot below:



The path down the binary tree looks like this:


We end up with 7 nearest neighbors. Very cool, but this is not great for at least two reasons

  1. What if we want more than 7 neighbors?
  2. Some of the nearest neighbors are actually outside of this leaf polygon

Trick 1 – use a priority queue

The trick we’re going to use is to go down on both sides of a split if we are “close enough” (which I will quantify in a second). So instead of just going down one path of the binary tree, we will go down a few more:


With the corresponding binary tree:

We can configure the threshold of how far we are willing to go into the “wrong” side of the split. If the threshold is 0, then we will always go on the “correct” side of the split. However if we set the threshold to 0.5 you get the search path above.

The trick here is we can actually use a priority queue to explore nodes sorted by the max distance into the “wrong” side. The nice part is we can search increasingly larger and larger thresholds starting from 0.

Trick 2 – build a forest of trees

The second trick we are going to use is is to construct many trees aka a forest. Each tree is constructed by using a random set of splits. We are going to search down all those trees at the same time:


We can search all trees at the same time using one single priority queue. This has an additional benefit that the search will focus on the trees that are “best” for each query – the splits that are the furthest away from the query point.

Every tree contains all points so when we search many trees we will find some points in multiple trees. If we look at the union of the leaf nodes we get a pretty good neighborhood:


At this point we have nailed it down to a small set of points. Notice so far we have not even computed distances to a single point. Next step is to compute all distances and rank the points:


We then sort all nodes by distance and return the top K nearest neighbors. Nice! And that is how the search algorithm works in Annoy.

Except one thing. In this case it turns out we actually did miss a couple of points outside:


But the A in Annoy stands for approximate and missing a few points is acceptable. Annoy actually has a knob you can tweak (search_k) that lets you trade off performance (time) for accuracy (quality).

The whole idea behind approximate algorithms is that sacrificing a little bit of accuracy can give you enormous performance gains (orders of magnitude). For instance we could return a decent solution where we really only computed the distance for 1% of the points – this is a 100x improvement over exhaustive search.

More trees always help. By adding more trees, you give Annoy more chances to find favorable splits. You generally want to bump it up as high as you can go without running out of memory.

Summary: Annoy’s algorithm

Preprocessing time:

  1. Build up a bunch of binary trees. For each tree, split all points recursively by random hyperplanes.

Query time:

  1. Insert the root of each tree into the priority queue
  2. Until we have search_k candidates, search all the trees using the priority queue
  3. Remove duplicate candidates
  4. Compute distances to candidates
  5. Sort candidates by distance
  6. Return the top ones

Feel free to check out _make_tree and _get_all_nns in annoylib.h

That’s it for this post! More is coming from the presentation shorly. Btw, the take a look at the slides, and the check out the code to generate all graphs in this post.

Nearest neighbor methods and vector models – part 1

This is a blog post rewritten from a presentation at NYC Machine Learning last week. It covers a library called Annoy that I have built that helps you do (approximate) nearest neighbor queries in high dimensional spaces. I will be splitting it into several parts. This first talks about vector models, how to measure similarity, and why nearest neighbor queries are useful.

Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.



This turns out to be quite useful for a bunch of different applications. Before we get started on exactly how nearest neighbor methods work, let’s talk a bit about vector models.

Vector models and why nearest neighbors are useful

Vector models are increasingly popular in various applications. They have been used in natural language processing for a long time using things like LDA and PLSA (and even earlier using TF-IDF in raw space). Recently there has been a new generation of models: word2vec, RNN’s, etc.

In collaborative filtering vector models have been among the most popular methods since going back to the Netflix Prize – the winning entry featured a huge ensemble where vector models made up a huge part.

The basic idea is to represent objects in a space where proximity means two items are similar. If we’re using something like word2vec it could look something like this:

vector model


In this case similarity between words is determined by the angle between them. apple and banana are close to each other, whereas boat is further.

(As a side note: much has been written about word2vec’s ability to do word analogies in vector space. This is a powerful demonstration of the structure of these vector spaces, but the idea of using vector spaces is old and similarity is arguably much more useful).

In the most basic form, data is already represented as vectors. For an example of this, let’s look at one of the most canonical data sets in machine learning – the MNIST handwritten digits dataset.

Building an image search engine for handwritten digits

The MNIST dataset features 60,000 images of size 28×28. They each feature a handwritten digits in grayscale. One of the most basic ways we can play around with this data set is to smash each 28×28 array into a 784-dimensional vector. There is absolutely no machine learning involved in doing this, but we will get back and introduce cool stuff like neural networks and word2vec later.

Let’s define a distance function in this space. Let’s say the distance between two digits is the squared sum of the pixel differences. This is basically the squared Euclidean distance (i.e. the good old Pythagorean theorem):

pixel distance

This is nice because we can compute the distance of arbitrary digits in the dataset:

digit distance

This now lets us search for neighbors in this 784-dimensional space. Check out some samples below – the leftmost digit is the seed digit and to the right of it are the ten most similar images using the pixel distance.

MNIST neighbors

You can see that it sort of works. The digits are visually quite similar, although it’s obvious to a human that some of the nearest neighbors are the wrong digit.

This was pretty nice and easy, but this also an approach that doesn’t scale very well. What about larger images? What about color images? And how to we determine similars not just in terms of visual similarity but actually what a human would think of as similar. This simple definition of “distance” leaves a lot of room for improvement.

Dimensionality reduction

A powerful method that works across a wide range of domains is to take high dimensional complex items and project the items down to a compact vector representation.:

  1. Do a dimensionality reduction from a large dimensional space to a small dimensional space (10-1000 dimensions)
  2. Use similarity in this space instead

Dimensionality reduction is an extremely powerful technique because it lets us take almost any object and translate it to a small convenient vector representation in a space. This space is generally referred to as latent because we don’t necessarily have any prior notion of what the axes are. What we care about is that objects that are similar end up being close to each other. What do we mean with similarity? In a lot of cases we can actually discover that from our data.

So let’s talk about one approach for dimensionality reduction on images: deep convolutional neural networks. I had a side project about a year ago to classify food. It’s a pretty silly application but the eventual goal was to see if you could predict calorie content from pictures, and a side goal was to learn how to use convolutional neural networks. I never ended up using this for anything and wasted way to much money renting GPU instances on AWS, but it was fun.

To train the model, I downloaded 6M pics from Yelp and Foursquare and trained a network quite similar to the one described in this paper using Theano.


The final layer in this model is a 1244-way multi-classification output using softmax so we’re training this in a supervised way. These are words that occurred in the description text, eg. “spicy ramen” for the one above. However the nice thing is we have a “bottleneck” layer just before the final layer – a 128-dimensional vector that gives us exactly what we want.


Using the neural network as an embedding function and using cosine similarity as a metric (this is basically Euclidean distance, but normalize the vectors first) we get some quite cool nearest neighbors:

nearest food pics

These similars look pretty reasonable! The top left picture is similar to a bunch of other fries. The second row shows a bunch of different white bowls with Asian food – more impressively they are all in different scales and angles, and pixel by pixel similarity is quite low. The last row shows a bunch of desserts with similar patterns of chocolate sprinkled over it. We’re dealing with a space that can express object features quite well.

So how do we do find similar items? I’m not going to describe dimensionality reduction in great detail – there are a million different ways that you can read about. What I have spent more time thinking about is how to search for neighbors in vector spaces. In fact, finding the neighbors above takes only a few milliseconds per picture, because Annoy is very fast. This is why dimensionality reduction is so extremely useful. At the same time that it’s discovering high level structure in data, it also computes a compact representation of items. This representation makes it easy to compute similarity and search for nearest neighbors.

Vector methods in collaborative filtering

Reducing dimensionality isn’t just useful in computer vision, of course. As mentioned, it’s incredibly useful in natural language processing. At Spotify, we use vector models extensively for collaborative filtering. The idea is to project artists, users, tracks, and other objects into a low dimensional space where similarity can be computed easily and recommendations can be made. This is in fact what powers almost all of the Spotify recommendations – in particular Discover Weekly that was launched recently.

2D embedding of artists

CF cosines

I have already put together several presentations about this so if you’re interested, you should check out some of them:

Exhaustive search as a baseline

So how do we find similar items? Before we go into detail about how Annoy works, it’s worth looking at the baseline of doing a brute force exhaustive search. This means iterating over all possible items and computing the distance for each one of them to our query point.

word2vec actually comes with a tool to do exhaustive search. Let’s see how it compares! Using the GoogleNews-vectors-negative300.bin dataset and querying for “chinese river”, it takes about 2 minutes 34 seconds to output this:

  • Qiantang_River
  • Yangtse
  • Yangtze_River
  • lake
  • rivers
  • creek
  • Mekong_river
  • Xiangjiang_River
  • Beas_river
  • Minjiang_River

I wrote a similar tool that uses Annoy (available on Github here). The first time you run it, it will precompute a bunch of stuff and can take a lot of time to run. However the second time it runs it will load (mmap) an Annoy index directly from disk into memory. Relying on the magic page cache, this will be very fast. Let’s take it for a spin and search for “chinese river”:

  • Yangtse
  • Yangtze_River
  • rivers
  • creek
  • Mekong_river
  • Huangpu_River
  • Ganges
  • Thu_Bon
  • Yangtze
  • Yangtze_river

Amazingly, this ran in 470 milliseconds, probably some of it overhead for loading the Python interpreter etc. This is roughly 300x faster than the exhaustive search provided by word2vec.

Now – some of you probably noticed that the results are marginally different. That’s because the A in Annoy stands for approximate. We are deliberately trading off some accuracy in return for a huge speed improvement. It turns out you can actually control this knob explicitly. Telling Annoy we want to search through 100k nodes (will get back to that later) we get this result in about 2 seconds:

  • Qiantang_River
  • Yangtse
  • Yangtze_River
  • lake
  • rivers
  • creek
  • Mekong_river
  • Xiangjiang_River
  • Beas_river
  • Minjiang_River

This is exactly the same as the exhaustive search it turns out – and still about 50x faster.

Other uses of nearest neighbors

Finally just as a fun example of another use, nearest neighbors is useful when you’re dealing with physical spaces too. In an earlier blog post, I was showing this world map of how long it takes to ping IP addresses from my apartment in NYC:

This is a simple application of k-NN (k-nearest neighbors) regression that I’ve written earlier about on this blog. There is no dimensionality reduction involved here – we just deal with 3D coordinates (lat/long projected to the unit sphere).

In the next series, I will go in depth about how Annoy works. Stay tuned!


I was playing around with D3 last night and built a silly visualization of antipodes and how our intuitive understanding of the world sometimes doesn’t make sense. Check out the visualization at!

Basically the idea is if you fly from Beijing to Buenos Aires then you can have a layover at any point of the Earth’s surface and it won’t make the trip longer.

Screen Shot 2015-09-07 at 11.33.33 PM

Software Engineers and Automation

Every once in a while when talking to smart people the topic of automation comes up. Technology has made lots of occupations redundant, so what’s next?

Switchboard Operator
Switchboard operator, a long time ago

What about software engineers? Every year technology replaces parts of what they do. Eventually surely everything must be replaced? I just ran into another one of these arguments: Software Engineers will be obsolete by 2060.

This might be a Lump of Labor Fallacy. Think about how much around us is currently powered by software and how much could be powered by software. The opportunity to apply software is probably 100x larger than what’s currently being used. So why aren’t we using software 100x more? Because software engineers are expensive. 

It’s easy to see this if you look back ten years. Say you wanted to build a web shop ten years ago. This was before the cloud, before API’s, good web frameworks etc. Building a web shop was probably 100x more expensive back then. As a result – there were a lot fewer web shops available. Of course, it’s harder to know what latent demand will be unlocked in the next ten years, but there’s always new things coming out that you didn’t realize you needed.

Somewhat counterintuitively, for many goods the latent demand is so big that what happens when the price drops is that the total demand increases. This is called Jevons Paradox after an economist noticed in the 1800s that increased efficiency of coal use lead to an increase in consumption of coal.

Vin Diesel as a stock broker in the movie "Boiler Room"
Vin Diesel as a stock broker in the movie “Boiler Room”

The key here is whether technology replaces a job or whether it increases the efficiency of a job. Technology did not increase the output of switchboard operators, so they were replaced. Similarly, technology is not going to make truck drivers 100x as efficient, so they will be replaced by self driving trucks at some point. But technology actually has the opportunity to increase the output of software engineers by another few orders of magnitude. This will unlock a lot of latent demand, and we will need more software engineers, not less.

The other key is of course whether demand is bounded. So if you want to identify which occupations will be automated, I would look for (a) limited latent demand (b) little technical leverage.

Is this rationalization? Maybe!

Also for a good quick read, check out Race Against the Machine by Erik Brynjolfsson and Andrew McAfee.


Here’s a problem that I used to give to candidates. I stopped using it seriously a long time ago since I don’t believe in puzzles, but I think it’s kind of fun.

  1. Let’s say you have a function that simulates a random coin flip. It returns “H” or “T”. This is the only random generator available. How can write a new function that simulates a random dice roll (1…6)?
  2. Is there any method that guarantees that the second function returns in finite time?
  3. Let’s say you want to do this n times where n \to \infty. What’s the most efficient way to do it? Efficient in terms of using the fewest amount of coin flips.

The first part is old, I think. The second and third part are follow up questions that I came up with.

I’ll give you some time to think about it!










Don’t peak!
















Did you figure it out?












  1. There’s a multitude of ways to do this. The easiest way to do it is probably to flip the coin three times and map the outcomes like this: (HHH, 0), (HHT, 1), (HTH, 2), (HTT, 3), (THH, 4), (THT, 5), (TTH, 6), (TTT, 7). If you end up getting HHH or TTT, you flip again.
  2. Impossible! If you flip the coin n times then there’s exactly 2^n outcomes. But we can’t partition this space evenly into 6 buckets since 3\nmid 2^n
  3. This one is trickier. Think about it in terms of the amount of information you extract. Every coin flip extracts 1 bit, but every dice roll consumes \log_2 6\approx 2.585 bits. This is a lower bound – you need at least that many coin flips per dice roll.

Is there a way to achieve that lower bound? Turns out there is: basically encode a long series of coin flip as a binary representation of a number between 0 and 1, then convert it to base 6. This idea resembles Arithmetic coding. Sample code:

Hope you enjoyed geeking out with a math problem this time!

Benchmark of Approximate Nearest Neighbor libraries

Annoy is a library written by me that supports fast approximate nearest neighbor queries. Say you have a high (1-1000) dimensional space with points in it, and you want to find the nearest neighbors to some point. Annoy gives you a way to do this very quickly. It could be points on a map, but also word vectors in a latent semantic representation or latent item vectors in collaborative filtering.

I’ve made a few optimizations to Annoy lately and I was curious to see how it stacks up against other libraries out there, so I wrote a benchmark suite: ann-benchmarks. It supports any library with a Python interface and I added a bunch of them. It even has Travis integration!

Cosine 100D results


Results for 128D Euclidean


The results so far for Annoy are pretty great. The only method that consistently beats Annoy is SW-graph from nmslib which is about 2-3x faster at the same precision. But Annoy beats both FLANN and KGraph at high precisions (>95%). At lower precisions (<95%) and cosine distance, Annoy is not quite as fast as FLANN and KGraph.

A surprising result was that Panns, Nearpy, and LSHForest all are very low performance. They are roughly 1,000 times slower than the other ones, and even worse, Panns & LSHForest don’t even produce high precision scores. I created an issue in scikit-learn about LSHForest’s performance. Of course, it might be that I did something wrong in the benchmark.

Annoy was actually never built with performance in mind. The killer feature was always to be able to load/unload indexes quickly using mmap – which no other package supports – but it’s fun to see that it’s actually very competitive on performance too. One of the things that made a difference was that I recently changed the interface for Annoy slightly (don’t worry, it’s backwards compatible). There is now a query-time tradeoff knob which lets you vary how many nodes are inspected.

The other factor not covered by the graph above is how hard it is to install most of these libraries. Annoy should compile and run almost anywhere (Linux, Win, OS X) very easily, but the other libraries can be challenging to install. For instance, both kgraph and nmslib depend on GCC-4.8, so they require custom installations. There are scripts to install all libraries in the repo tested with Ubuntu 12.04 and 14.04. For other platforms – good luck!

I might do a talk in September at the NYC Machine Learning meetup about this, we’ll see! Until then, I found this really great presentation by Stefan Savev (with corresponding web site). He claims his own implementation is a bit faster than Annoy! It’s in Scala so I haven’t tested it yet.

Slides from Stefan Savev’s presentation


More Luigi alternatives

The workflow engine battle has intensified with some more interesting entries lately! Here are a couple I encountered in the last few days. I love that at least two of them are direct references to Luigi!

Airflow (Blog Post) (GitHub)

Airflow from Airbnb is probably the most interesting one. I’ve only glanced at it, but here are some very superficial notes

  • Seems mostly targeted to run raw UNIX commands using a pretty simple syntax with Jinja templates
  • Tasks don’t have support for parameters but it seems like you can build tasks dynamically by just putting them in a function
  • It seems to be built around daily jobs, meaning dates and backfill by dates is a foundational concept of the package (whereas in Luigi dates are just one out of many different parameters)
  • There is a database of task history which is great
  • The visualization seems great
  • It also supports farming out jobs to other machines. This is something Luigi definitely needs
  • It also comes with built-in support for HDFS, S3, MySQL and Postgres, similar to Luigi
  • There’s a built-in triggering mechanism (also something Luigi needs)
Screen shot of Airflow (from the blog post)
Screen shot of Airflow (from the blog post)

Mario (Blog post(GitHub)

Mario seems to be a “Luigi in Scala”. It seems extremely simplistic, so it’s probably more conceptual at this point than meant to be full-fledged.

The choice of Scala is interesting. First of all, I think it’s fair to say that the JVM has taken over the data stack. A lot of Luigi’s core concept are really functional in nature and a language like Scala might be a better choice. A cool thing is that the RPC between Luigi clients and the scheduler is actually just a simple REST interface and Luigi’s scheduler could in fact support clients running other languages. Mario doesn’t do this but it’s something I’ve been meaning to explore for a long time.

Ruigi (GitHub)

Ruigi is “Luigi in R”. It follows the same set of conventions but seems to be pretty simple in that it runs everything locally.

Makeflow (Web site)

Seems to be some academic project mostly for publishing papers. What’s up with not using GitHub in 2015? And also having a “download” section with tarballs!

The benefit of Makeflow seems to be support for a bunch of batch systems commonly used in HPC in academia. The dependencies are specified using their own DSL with some simple Makefile-like notation.


So what’s the state of workflow engines at the moment? Allow me to say something provocative just for the sake of it: they all kind of suck. Including Luigi. There is basically so much trial and error when building a workflow engine and everytime I encounter one I just see a bunch of bad design decisions. Same when I look at Luigi. Luigi was the result of a bunch of many iterations and it avoids roughly 100 pitfalls we encountered in earlier attempts, but there are still some parts of the design that can be addressed.

I don’t mean to bash my own open source project here. Open sourcing Luigi has helped a lot of people building awesome stuff and I think it’s better than anything else out there.

I hope with the explosion of workflow engines lately, we will see a convergence of ideas to a second generation of much better, much more scalable, and much easier ones. My dream is that someone combines every god damned workflow engine in the world and write a new one, preferably in some JVM based language like Scala. I sincerely have no idea how that would look like, but I would love to do that some day as a Luigi 2.0!

Me and one of my cats (who is not too happy). Long story!
Me and one of my cats (who is not too happy). Long story!

3D in D3

I have spent some time lately with D3. It’s a lot of fun to build interactive graphs. See for instance this demo (will provide a longer writeup soon).

D3 doesn’t have support for 3D but you can do projections into 2D pretty easily. It’s just old school computer graphics. I ended up adding an animated background to this blog based on an experiment. The math is simple.

First, there’s the rotation. Given a bunch of 3D coordinates, how do you rotate them in 3D space? The cleanest way is to define angles \alpha, \beta, \gamma and use them for rotation in the yz-plane, xz-plane, and xy-plane, respectively. Each of them define a rotation matrix. For the xy-plane, we get the rotation matrix R_{xy} (see code):

R_{xy} = \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}

We get three of these matrices in total: R_{yz}, R_{xz}, R_{xy}.

The rotation of any vector \mathbf{v} can now be described as R_{yz} R_{xz} R_{xy}\mathbf{v}. The nice thing is we can precompute the product of these matrices R (see code). Math porn:

R =   \begin{pmatrix} 1 & 0 & 0 \\ 0 & \cos(\alpha) & -\sin(\alpha) \\ 0 & \sin(\alpha) & \cos(\alpha) \end{pmatrix}   \begin{pmatrix} \cos(\beta) & 0 & \sin(\beta) \\ 0 & 1 & 0 \\ -\sin(\beta) & 0 & \cos(\beta) \end{pmatrix}   \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \\ \sin(\gamma) & \cos(\gamma) & 0 \\ 0 & 0 & 1 \end{pmatrix}


Now going forward we can use the matrix R to rotate any vector \mathbf{v} (see code).

The other thing you want to do is to make distant objects look further away. Thinking through proportionalities you can derive a pretty simple equation (see code). x' = x / (z/d + 1), and same for y. The constant d is just a hack to scale down the z values.

Not sure whether the 3D animation is cool or just annoying, but I’ll prob keep it for a bit – enjoy!

The hardest challenge about becoming a manager

Note: this post is full of pseudo-psychology and highly speculative content. Like most fun stuff!

I became a manager back in 2009. Being a developer is fun. You have this very tangible way to measure yourself. Did I deploy something today? How much code did I write today? Did I solve some really cool machine learning problem on paper?

But as 1:1’s and emails and architecture discussions started filling up my day I often walked home with this gnawing feeling of having accomplished nothing. I saw my team build and deploy some really cool stuff, but I had this sort of guilt as if I was pretty useless.

To feel better about myself, I started coding more. But I noticed when I started coding it was like smoking crack. I couldn’t stop doing it. I would come in at 9am thinking about some fun problem, then get completely sucked into it. I would find myself at 9pm drinking Red Bull deep in some highly optimized C++ latent factor model. That felt great until I realized I had missed a bunch of 1:1’s and had 43 unread emails in my inbox.


I think what happens in your brain is you create all these proxies for accomplishments that take years to retrain. I would be incredibly costly if every action was judged in terms of its benefits amortized over your entire life time. Instead, we humans have an ability to invent pretty arbitrary proxies, such as getting high scores on exams, or in my case, write shitloads of code.

Proxies are great because they let you make decisions much quicker. If I have decided that it’s good for me to write code, then I will start doing it, and eventually feel great about it. After a few years of doing something very consciously (programming is good because I can build this cool game and show it to my friends) you build up this great rewards system (kind of like Pavlov’s dogs?) that makes you feel good about it in itself (programming is cool because I feel good when I do it).

The problem is when your ultimate goal changes and your old proxies are still in effect. You rational side might tell you: hey, look at you, you’re team is really happy, they are learning new stuff, and delivering lots of stuff. But still, you have this really weird feeling that you are not getting anything useful done.

This took me literally years to retrain. I remember at some point I saw someone in my team that did something unexpectedly impressive and I got really excited. I got excited because I realized this person had grown tremendously since joining, and presumably some small fraction of it was due to me. With enough exposure finally this starts to be the new proxy for delivering value. Something the irrational side immediately detects and makes you feel a sense of accomplishment about.

Anyway… once an addict, always an addict. I still have relapses and fall back into programming sometimes. In general I’ve noticed it’s extremely hard to balance and try to do 50-50 development and management. Basically one of the sides take over until it’s 90-10. Either you start coding and you fall back to your old crack habits. Or you manage to break out of it, and you go into this opposite mode where you just don’t have the energy to go into the zone.

I don’t know what my conclusion is. Programmers are the most irrational people I know and I think they are really driven by more of a irrational, System 1 kind of thinking, the pattern matching brain. That’s why I think so many super smart people get stuck in habits (I really want to just solve super cool graph algorithm problems!). The most powerful way to make progress is to put your rational System 2 on and constantly remind the other side what’s really making impact and what’s really long term beneficial. It takes a few year, but with enough persistence, your rational self can really train the primitive pattern matching brain to feel a sense of accomplishment in doing almost anything.

Sorry for offending anyone with my drug references!