Installing TensorFlow on AWS

Curious about Google’s newly released TensorFlow? I don’t have a beefy GPU machine, so I spent some time getting it to run on EC2. The steps on how to reproduce it are pretty brutal and I wouldn’t recommend going through it unless you want to waste five hours of your live.

Instead, I recommend instead just getting the AMI that I built (ami-cf5028a5). Choose g2.2xlarge and you should have a box with TensorFlow running in a minute or two! Note that it’s only available in us-east-1 (virginia) so far.

If you haven’t used AWS, here’s a tutorial on how to set up an instance from an AMI. I usually use spot instances since they are much cheaper, but they have some risk of getting killed unexpectedly (interestingly it seems more rare now, I wonder if it’s since the Bitcoin price is so much lower).

There are some known issues with TensorFlow on AWS. In particular I wasn’t able to get better performance from g2.8xlarge compared to g2.2xlarge, which sucks, since one of the cool features with TensorFlow is that it should distribute work across GPU’s. See this thread for some more info. Looking forward to see these issues getting resolved.

What is TensorFlow?

It seems like there’s a lot of misunderstanding about TensorFlow. It’s not some crazy flow based graphical tool to do neural nets. It’s kind of boring really. It’s just a marginally better version of Theano with much faster compilation times and capability to distribute work over multiple GPU’s/machines. Theano completely blew my mind when I first discovered it. Its approach was super innovative, but it’s pretty rough around the edges and I think in open source the pioneers die with arrows in their backs.

I expect TensorFlow (or maybe CGT or something else) to grow more popular. But in practice I don’t think people will use any of those straight up for machine learning – higher level libraries like Keras will be the preferred way to do most deep learning tasks.

Looking for smart people

I haven’t mentioned what I’m currently up to. Earlier this year I left Spotify to join a small startup called Better. We’re going after one of the biggest industries in the world that also turns out to be completely broken. The mortgage industry might not be the #1 industry you pictured yourself in, but it’s an enormous opportunity to fix a series of real consumer problems and join a company that I predict will be huge.


We’re 6 engineers at the moment, mostly focused on backend stuff, but a bit of frontend and machine learning stuff as well. We have also raised a pretty substantial amount of money. At this point we’re just a few weeks from launching, so I will definitely keep you posted. If you are interested in hearing more, drop me an email at

MCMC for marketing data

The other day I was looking at marketing spend broken down by channel and wanted to compute some simple uncertainty estimates. I have data like this:

Total spend Transactions
Channel A 2292.04 9
Channel B 1276.85 2
Channel C 139.59 3
Channel D 954.98 5

Of course, it’s easy to compute the cost per transaction, but how do you produce uncertainty estimates? Turns out to be somewhat nontrivial. I don’t even think it’s possible to do a t-test, which is kind of interesting in itself.

Let’s make some assumptions about the model:

  1. The cost per transaction is an unknown with some prior (I just picked uniform)
  2. The expected number of transaction is the total budget divided by the (unknown) cost per transaction
  3. The actual observed number of transactions is a Poisson of the expected number of transactions

I always wanted to try using pymc and now I had an excuse. See gist below:

The result in the form of an animated GIF (Unfortunately animated gifs were never widely accepted as a homework format back in school)

You even get a useless graph for free!


Of course, we could have computed this exactly, but I know myself and I’m very unlikely to get the expressions right without some serious effort. The conjugate prior of a Poisson is a Gamma distribution and we have to account for the parameterization of the cost per conversion as the budget divided by the total conversions, which will be another factor. How fun is that? I don’t have access to any windows to write on, so unfortunately not so fun.

From A Beautiful Mind

Anyway – this particular example might not have been the most useful example of using PyMC, but I do quite like the idea of it. Especially applied to conversion analyses, since it translates directly into a generative model. I will definitely use it for some further funnel analysis – in particular when the number of data points is very small and the model is very complex.

Interview with a Data Scientist: Erik Bernhardsson

I was featured in Peadar Coyle’s interview series interviewing various “data scientists” – which is kind of arguable since (a) all the other ppl in that series are much cooler than me (b) I’m not really a data scientist. Anyway, reposting the full interview:

As part of my interviews with Data Scientists I recently caught up with Erik Bernhardsson who is famous in the world of ‘Big Data’ for his open source contributions, his leading of teams at Spotify, and his various talks at various conferences.

1. What project have you worked on do you wish you could go back to, and do better?

Like… everything I ever built. But I think that’s part of the learning experience. Especially working with real users, you never know what’s going to happen. There’s no clear problem formulation, no clear loss function, lots of various data sets to use. Of course you’re going to waste too much time doing something that turns out to nothing. But research is that way. Learning stuff is what matters and kind of by definition you have to do stupid shit before you learned it. Sorry for a super unclear answer :)

The main thing I did wrong for many years was I built all this cool stuff but never really made it into prototypes that other people could play around with. So I learned something very useful about communication and promoting your ideas.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Write a ton of code. Don’t watch TV :)

I really think showcasing cool stuff on Github and helping out other projects is a great way to learn and also to demonstrate market validation of your code.

Seriously, I think everyone can kick ass at almost anything as long as you spend a ridiculous amount of time on it. As long as you’re motivated by something, use that by focusing on something 80% of your time being awake.

I think people generally get motivated by coming up with various proxies for success. So be very careful about choosing the right proxies. I think people in academia often validate themselves in terms of things people in the industry don’t care about and things that doesn’t necessarily correlate with a successful career. It’s easy to fall down into a rabbit hole and become extremely good at say deep learning (or anything), but at a company that means you’re just some expert that will have a hard time getting impact beyond your field. Looking back on my own situation I should have spent a lot more time figuring out how to get other people excited about my ideas instead of perfecting ML algorithms (maybe similar to last question)

3. What do you wish you knew earlier about being a data scientist?

I don’t consider myself a data scientist so not sure :)

There’s a lot of definitions floating around about what a data scientist does. I have had this theory for a long time but just ran into a blog post the other day. I think it summarizes my own impression pretty well. There’s two camps, one is the “business insights” side, one is the “production ML engineer” side. I managed teams at Spotify on both sides. It’s very different.

If you want to understand the business and generate actionable insights, then in my experience you need pretty much no knowledge of statistics and machine learning. It seems like people think with ML you can generate these super interesting insights about a business but in my experience it’s very rare. Sometimes we had people coming in writing a master’s thesis about churn prediction and you can get a really high AUC but it’s almost impossible to use that model for anything. So it really just boils down to doing lots of highly informed A/B tests. And above all, having deep empathy for user behavior. What I mean is you really need to understand how your users think in order to generate hypotheses to test.

For the other camp, in my experience understanding backend development is super important. I’ve seen companies where there’s a “ML research team” and a “implementation team” and there’s a “throw it over the fence” attitude, but it doesn’t work. Iteration cycles get 100x larger and incentives just get misaligned. So I think for anyone who wants to build cool ML algos, they should also learn backend and data engineering.

4. How do you respond when you hear the phrase ‘big data’?

Love it. Seriously, there’s this weird anti-trend of people bashing big data. I throw up every time I see another tweet like “You can get a machine with 1TB of ram for $xyz. You don’t have big data”. I almost definitely had big data at Spotify. We trained models with 10B parameters on 10TB data sets all the time. There is a lot of those problems in the industry for sure. Unfortunately sampling doesn’t always work.

The other thing I think those people get wrong is the production aspect of it. Things like Hadoop forces your computation into fungible units that means you don’t have to worry about computers breaking down. It might be 10x slower than if you had specialized hardware, but that’s fine because you can have 100 teams running 10000 daily jobs and things rarely crash – especially if you use Luigi :) But I’m sure there’s a fair amount of snake oil Hadoop consultants who convince innocent teams they need it.

The other part of “big data” is that it’s at the far right of the hype cycle. Have you been to a Hadoop conference? It’s full of people in oversized suits talking about compliance now. At some point we’ll see deep learning or flux architecture or whatever going down the same route.

5. What is the most exciting thing about your field?

Boring answer but I do think the progress in deep learning has been extremely exciting. Seems like every week there’s new cool applications.

I think even more useful is how tools and platforms are maturing. A few years ago every company wrote their own dashboards, A/B test infrastructure, log synchronization, workflow management, etc. It’s great that there’s more open source projects and that more useful tools are emerging.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Ideally you can iterate on it with real users and see what the impact is. If not, you need to introduce some proxy metrics. That’s a whole art form in itself.

It’s good enough when the opportunity cost outweighs the benefit :) I.e. the marginal return of time invested is lower than for something else. I think it’s good to keep a backlog full of 100s of ideas so that you can prioritize based on expected ROI at any time. I don’t know if that’s a helpful answer but prioritization is probably the hardest problem to solve and it really just boils down to having some rules of thumb.

How Erik describes himself: I like to work with smart people and deliver great software. After 5+ years at Spotify, I just left for new exciting startup in NYC where I am leading the engineering team.

At Spotify, I built up and lead the team responsible for music recommendations and machine learning. We designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover”​ page, “Related Artists”​, and much more. I also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, etc.

When I was younger I participated in lots of programming competitions. My team was five times Nordic champions in programming (2003-2010) and I have an IOI gold medal (2003).

Nearest neighbors and vector models – epilogue – curse of dimensionality

This is another post based on my talk at NYC Machine Learning. The previous two parts covered most of the interesting parts, but there are still some topics left to be discussed. To go back and read the meaty stuff, check out

You should also check out the slides and the video if you’re interested. Anyway, let’s talk about the curse of dimensionality today.

curse of dimensionality
This pic was obviously worth spending 20 minutes on


Curse of dimensionality refers to a set of things that happen when you are dealing with items in high dimensional spaces, in particular what happens with distances and neighborhoods, in such a way that finding the nearest neighbors gets tricky.

Consider a map of the world. Most countries have a handful of neighboring countries. It is also pretty close from New York to Philadelphia but it’s far from New York to Beijing – distances are very different.

What happens when we go to higher dimensions is that everything starts being close to everything. All cities end up having almost the same distance to each other and all countries have borders to all other countries (Trump would have a lot of walls to build). This is highly nonintuitive (as is anything with more than 3 dimensions) but let’s try to quantify this.

Let’s look at how distances behaves as we go to higher dimensions. Let’s sample a 10,000 points from a normal distribution, then pick a random point in the distribution, and compute the distance to the furthest and closest point:


(For code, check out on Github)

As the number of dimensions increase, the distances to the closest and the furthest point are almost similar. Wikipedia’s article is actually quite enlightening and features this statement that as we go to higher dimensions, we have the relationship:

\huge \lim_{d \to \infty} E\left(\frac{\text{dist}_{\max} (d) - \text{dist}_{\min} (d)}{\text{dist}_{\min} (d)}\right) \to 0


For example if we are in a high dimensional version of  New York, then the nearest city is 1.000 miles away and the furthest city is 1.001 miles away. The ratio above is then 0.001. Let’s get back to this ratio shortly.

This weird behavior makes nearest neighbors in high dimensional spaces tricky. It’s still an open problem whether exact k-NN is solvable in polynomial time.

Saving the day

The above relation applies to distribution where there is little structure – in particular the example I generated was just data points from the normal distribution. In the real world data sets we usually have a lot of structure in our data. Consider all the cities and towns of the world. This is a set of points in in 3D space, but all cities lie on a 2D sphere and so the point set will actually behave more 2D-like.

The same thing happens to word embeddings or any other set of vectors. Even if we’re dealing with 1000’s of dimensions, there underlying structure is really much more lower dimensional. This is exactly why dimensionality reduction works so well!

Let’s actually compute the quantity mentioned above and map out some real datasets:

knn distances

(Again for the code, check out on Github)

The graph above shows something pretty interesting: some of these “real” high dimensional data sets have a ratio that is similar to much fewer dimensions for the normal distribution.

Look at the Freebase vectors (which you can download here) for instance – they are 1000D, but they are similar to a normal distribution in 16D in terms of the ratio. This behavior holds true for a whole set of different vectors. The 784-dimensional MNIST digits data set behaves as the 8D normal distribution. The 128-dimensional embedding of food pictures same thing.

I think this is why approximate nearest neighbor methods work so well up to thousands of dimensions. The key thing is the algorithms need to learn the distribution from the data. I am generally bearish on LSH for this reason.

Enough about approximate nearest neighbors for a while! Hope you liked this series of posts!

Nearest neighbors and vector models – part 2 – algorithms and data structures

This is a blog post rewritten from a presentation at NYC Machine Learning on Sep 17. It covers a library called Annoy that I have built that helps you do nearest neighbor queries in high dimensional spaces. In the first part, I went through some examples of why vector models are useful. In the second part I will be explaining the data structures and algorithms that Annoy uses to do approximate nearest neighbor queries.

Let’s start by going back to our point set. The goal is to find nearest neighbors in this space. Again, I am showing a 2 dimensional point set because computer screens are 2D, but in reality most vector models have much higher dimensionality.


Our goal is to build a data structure that lets us find the nearest points to any query point in sublinear time.

We are going to build a tree that lets us do queries in \mathcal{O}(\log n). This is how Annoy works. In fact, it’s a binary tree where each node is a random split. Let’s start by splitting the space once:


Annoy does this by picking two points randomly and then splitting by the hyperplane equidistant from those two points. The two points are indicated by the gray line and the hyperplane is the thick black line.

Let’s keep splitting each subspace recursively!


A very tiny binary tree is starting to take shape:

We keep splitting again:


… and so on. We keep doing this until there’s at most K items left in each node. At that point it looks something like this (for K=10):


With the corresponding binary tree:


Nice! We end up with a binary tree that partitions the space. The nice thing is that points that are close to each other in the space are more likely to be close to each other in the tree. In other words, if two points are close to each other in the space, it’s unlikely that any hyperplane will cut them apart.

To search for any point in this space, we can traverse the binary tree from the root. Every intermediate node (the small squares in the tree above) defines a hyperplane, so we can figure out what side of the hyperplane we need to go on and that defines if we go down to the left or right child node. Searching for a point can be done in logarithmic time since that is the height of the tree.

Let’s search for the point denoted by the red X in the plot below:



The path down the binary tree looks like this:


We end up with 7 nearest neighbors. Very cool, but this is not great for at least two reasons

  1. What if we want more than 7 neighbors?
  2. Some of the nearest neighbors are actually outside of this leaf polygon

Trick 1 – use a priority queue

The trick we’re going to use is to go down on both sides of a split if we are “close enough” (which I will quantify in a second). So instead of just going down one path of the binary tree, we will go down a few more:


With the corresponding binary tree:

We can configure the threshold of how far we are willing to go into the “wrong” side of the split. If the threshold is 0, then we will always go on the “correct” side of the split. However if we set the threshold to 0.5 you get the search path above.

The trick here is we can actually use a priority queue to explore nodes sorted by the max distance into the “wrong” side. The nice part is we can search increasingly larger and larger thresholds starting from 0.

Trick 2 – build a forest of trees

The second trick we are going to use is is to construct many trees aka a forest. Each tree is constructed by using a random set of splits. We are going to search down all those trees at the same time:


We can search all trees at the same time using one single priority queue. This has an additional benefit that the search will focus on the trees that are “best” for each query – the splits that are the furthest away from the query point.

Every tree contains all points so when we search many trees we will find some points in multiple trees. If we look at the union of the leaf nodes we get a pretty good neighborhood:


At this point we have nailed it down to a small set of points. Notice so far we have not even computed distances to a single point. Next step is to compute all distances and rank the points:


We then sort all nodes by distance and return the top K nearest neighbors. Nice! And that is how the search algorithm works in Annoy.

Except one thing. In this case it turns out we actually did miss a couple of points outside:


But the A in Annoy stands for approximate and missing a few points is acceptable. Annoy actually has a knob you can tweak (search_k) that lets you trade off performance (time) for accuracy (quality).

The whole idea behind approximate algorithms is that sacrificing a little bit of accuracy can give you enormous performance gains (orders of magnitude). For instance we could return a decent solution where we really only computed the distance for 1% of the points – this is a 100x improvement over exhaustive search.

More trees always help. By adding more trees, you give Annoy more chances to find favorable splits. You generally want to bump it up as high as you can go without running out of memory.

Summary: Annoy’s algorithm

Preprocessing time:

  1. Build up a bunch of binary trees. For each tree, split all points recursively by random hyperplanes.

Query time:

  1. Insert the root of each tree into the priority queue
  2. Until we have search_k candidates, search all the trees using the priority queue
  3. Remove duplicate candidates
  4. Compute distances to candidates
  5. Sort candidates by distance
  6. Return the top ones

Feel free to check out _make_tree and _get_all_nns in annoylib.h

That’s it for this post! More is coming from the presentation shorly. Btw, the take a look at the slides, and the check out the code to generate all graphs in this post.

Nearest neighbor methods and vector models – part 1

This is a blog post rewritten from a presentation at NYC Machine Learning last week. It covers a library called Annoy that I have built that helps you do (approximate) nearest neighbor queries in high dimensional spaces. I will be splitting it into several parts. This first talks about vector models, how to measure similarity, and why nearest neighbor queries are useful.

Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.



This turns out to be quite useful for a bunch of different applications. Before we get started on exactly how nearest neighbor methods work, let’s talk a bit about vector models.

Vector models and why nearest neighbors are useful

Vector models are increasingly popular in various applications. They have been used in natural language processing for a long time using things like LDA and PLSA (and even earlier using TF-IDF in raw space). Recently there has been a new generation of models: word2vec, RNN’s, etc.

In collaborative filtering vector models have been among the most popular methods since going back to the Netflix Prize – the winning entry featured a huge ensemble where vector models made up a huge part.

The basic idea is to represent objects in a space where proximity means two items are similar. If we’re using something like word2vec it could look something like this:

vector model


In this case similarity between words is determined by the angle between them. apple and banana are close to each other, whereas boat is further.

(As a side note: much has been written about word2vec’s ability to do word analogies in vector space. This is a powerful demonstration of the structure of these vector spaces, but the idea of using vector spaces is old and similarity is arguably much more useful).

In the most basic form, data is already represented as vectors. For an example of this, let’s look at one of the most canonical data sets in machine learning – the MNIST handwritten digits dataset.

Building an image search engine for handwritten digits

The MNIST dataset features 60,000 images of size 28×28. They each feature a handwritten digits in grayscale. One of the most basic ways we can play around with this data set is to smash each 28×28 array into a 784-dimensional vector. There is absolutely no machine learning involved in doing this, but we will get back and introduce cool stuff like neural networks and word2vec later.

Let’s define a distance function in this space. Let’s say the distance between two digits is the squared sum of the pixel differences. This is basically the squared Euclidean distance (i.e. the good old Pythagorean theorem):

pixel distance

This is nice because we can compute the distance of arbitrary digits in the dataset:

digit distance

This now lets us search for neighbors in this 784-dimensional space. Check out some samples below – the leftmost digit is the seed digit and to the right of it are the ten most similar images using the pixel distance.

MNIST neighbors

You can see that it sort of works. The digits are visually quite similar, although it’s obvious to a human that some of the nearest neighbors are the wrong digit.

This was pretty nice and easy, but this also an approach that doesn’t scale very well. What about larger images? What about color images? And how to we determine similars not just in terms of visual similarity but actually what a human would think of as similar. This simple definition of “distance” leaves a lot of room for improvement.

Dimensionality reduction

A powerful method that works across a wide range of domains is to take high dimensional complex items and project the items down to a compact vector representation.:

  1. Do a dimensionality reduction from a large dimensional space to a small dimensional space (10-1000 dimensions)
  2. Use similarity in this space instead

Dimensionality reduction is an extremely powerful technique because it lets us take almost any object and translate it to a small convenient vector representation in a space. This space is generally referred to as latent because we don’t necessarily have any prior notion of what the axes are. What we care about is that objects that are similar end up being close to each other. What do we mean with similarity? In a lot of cases we can actually discover that from our data.

So let’s talk about one approach for dimensionality reduction on images: deep convolutional neural networks. I had a side project about a year ago to classify food. It’s a pretty silly application but the eventual goal was to see if you could predict calorie content from pictures, and a side goal was to learn how to use convolutional neural networks. I never ended up using this for anything and wasted way to much money renting GPU instances on AWS, but it was fun.

To train the model, I downloaded 6M pics from Yelp and Foursquare and trained a network quite similar to the one described in this paper using Theano.


The final layer in this model is a 1244-way multi-classification output using softmax so we’re training this in a supervised way. These are words that occurred in the description text, eg. “spicy ramen” for the one above. However the nice thing is we have a “bottleneck” layer just before the final layer – a 128-dimensional vector that gives us exactly what we want.


Using the neural network as an embedding function and using cosine similarity as a metric (this is basically Euclidean distance, but normalize the vectors first) we get some quite cool nearest neighbors:

nearest food pics

These similars look pretty reasonable! The top left picture is similar to a bunch of other fries. The second row shows a bunch of different white bowls with Asian food – more impressively they are all in different scales and angles, and pixel by pixel similarity is quite low. The last row shows a bunch of desserts with similar patterns of chocolate sprinkled over it. We’re dealing with a space that can express object features quite well.

So how do we do find similar items? I’m not going to describe dimensionality reduction in great detail – there are a million different ways that you can read about. What I have spent more time thinking about is how to search for neighbors in vector spaces. In fact, finding the neighbors above takes only a few milliseconds per picture, because Annoy is very fast. This is why dimensionality reduction is so extremely useful. At the same time that it’s discovering high level structure in data, it also computes a compact representation of items. This representation makes it easy to compute similarity and search for nearest neighbors.

Vector methods in collaborative filtering

Reducing dimensionality isn’t just useful in computer vision, of course. As mentioned, it’s incredibly useful in natural language processing. At Spotify, we use vector models extensively for collaborative filtering. The idea is to project artists, users, tracks, and other objects into a low dimensional space where similarity can be computed easily and recommendations can be made. This is in fact what powers almost all of the Spotify recommendations – in particular Discover Weekly that was launched recently.

2D embedding of artists

CF cosines

I have already put together several presentations about this so if you’re interested, you should check out some of them:

Exhaustive search as a baseline

So how do we find similar items? Before we go into detail about how Annoy works, it’s worth looking at the baseline of doing a brute force exhaustive search. This means iterating over all possible items and computing the distance for each one of them to our query point.

word2vec actually comes with a tool to do exhaustive search. Let’s see how it compares! Using the GoogleNews-vectors-negative300.bin dataset and querying for “chinese river”, it takes about 2 minutes 34 seconds to output this:

  • Qiantang_River
  • Yangtse
  • Yangtze_River
  • lake
  • rivers
  • creek
  • Mekong_river
  • Xiangjiang_River
  • Beas_river
  • Minjiang_River

I wrote a similar tool that uses Annoy (available on Github here). The first time you run it, it will precompute a bunch of stuff and can take a lot of time to run. However the second time it runs it will load (mmap) an Annoy index directly from disk into memory. Relying on the magic page cache, this will be very fast. Let’s take it for a spin and search for “chinese river”:

  • Yangtse
  • Yangtze_River
  • rivers
  • creek
  • Mekong_river
  • Huangpu_River
  • Ganges
  • Thu_Bon
  • Yangtze
  • Yangtze_river

Amazingly, this ran in 470 milliseconds, probably some of it overhead for loading the Python interpreter etc. This is roughly 300x faster than the exhaustive search provided by word2vec.

Now – some of you probably noticed that the results are marginally different. That’s because the A in Annoy stands for approximate. We are deliberately trading off some accuracy in return for a huge speed improvement. It turns out you can actually control this knob explicitly. Telling Annoy we want to search through 100k nodes (will get back to that later) we get this result in about 2 seconds:

  • Qiantang_River
  • Yangtse
  • Yangtze_River
  • lake
  • rivers
  • creek
  • Mekong_river
  • Xiangjiang_River
  • Beas_river
  • Minjiang_River

This is exactly the same as the exhaustive search it turns out – and still about 50x faster.

Other uses of nearest neighbors

Finally just as a fun example of another use, nearest neighbors is useful when you’re dealing with physical spaces too. In an earlier blog post, I was showing this world map of how long it takes to ping IP addresses from my apartment in NYC:

This is a simple application of k-NN (k-nearest neighbors) regression that I’ve written earlier about on this blog. There is no dimensionality reduction involved here – we just deal with 3D coordinates (lat/long projected to the unit sphere).

In the next series, I will go in depth about how Annoy works. Stay tuned!


I was playing around with D3 last night and built a silly visualization of antipodes and how our intuitive understanding of the world sometimes doesn’t make sense. Check out the visualization at!

Basically the idea is if you fly from Beijing to Buenos Aires then you can have a layover at any point of the Earth’s surface and it won’t make the trip longer.

Screen Shot 2015-09-07 at 11.33.33 PM

Software Engineers and Automation

Every once in a while when talking to smart people the topic of automation comes up. Technology has made lots of occupations redundant, so what’s next?

Switchboard Operator
Switchboard operator, a long time ago

What about software engineers? Every year technology replaces parts of what they do. Eventually surely everything must be replaced? I just ran into another one of these arguments: Software Engineers will be obsolete by 2060.

This might be a Lump of Labor Fallacy. Think about how much around us is currently powered by software and how much could be powered by software. The opportunity to apply software is probably 100x larger than what’s currently being used. So why aren’t we using software 100x more? Because software engineers are expensive. 

It’s easy to see this if you look back ten years. Say you wanted to build a web shop ten years ago. This was before the cloud, before API’s, good web frameworks etc. Building a web shop was probably 100x more expensive back then. As a result – there were a lot fewer web shops available. Of course, it’s harder to know what latent demand will be unlocked in the next ten years, but there’s always new things coming out that you didn’t realize you needed.

Somewhat counterintuitively, for many goods the latent demand is so big that what happens when the price drops is that the total demand increases. This is called Jevons Paradox after an economist noticed in the 1800s that increased efficiency of coal use lead to an increase in consumption of coal.

Vin Diesel as a stock broker in the movie "Boiler Room"
Vin Diesel as a stock broker in the movie “Boiler Room”

The key here is whether technology replaces a job or whether it increases the efficiency of a job. Technology did not increase the output of switchboard operators, so they were replaced. Similarly, technology is not going to make truck drivers 100x as efficient, so they will be replaced by self driving trucks at some point. But technology actually has the opportunity to increase the output of software engineers by another few orders of magnitude. This will unlock a lot of latent demand, and we will need more software engineers, not less.

The other key is of course whether demand is bounded. So if you want to identify which occupations will be automated, I would look for (a) limited latent demand (b) little technical leverage.

Is this rationalization? Maybe!

Also for a good quick read, check out Race Against the Machine by Erik Brynjolfsson and Andrew McAfee.