My issue with GPU-accelerated deep learning

I’ve been spending several hundred bucks renting GPU instances on AWS over the last year. The speedup from a GPU is awesome and hard to deny. GPUs have taken over the field. Maybe following the footsteps of Bitcoin mining there’s some research on using FPGA (I know very little about this).

I don’t think there’s a coincidence that GPUs that are built for graphics turn out to be great for image classification using convolutional neural networks. When you are dealing with pixel data packed into 2D arrays it’s possible to parallelize all operations very efficiently.

My issue is that the complexity of each minibatch is \mathcal{O}(n) where n is the number of parameters. The larger models you are dealing with, the bigger this issue becomes.

Word2vec uses a clever technique called hierarchical softmax to achieve \mathcal{O}(\log n) (more details here). I have no idea how to implement this on a GPU and I suspect it’s impossible. Here’s where the CPU shows its strength – traversing a logarithmic datastructure takes a lot of branching and can’t be expressed as a batch operation.

Logarithmic data structures happens to be a field I’m pretty excited about, particularly for vector models and multi-class prediction problems. I’m the author of Annoy, which a library for high dimensional nearest neighbor queries, so it’s something I’ve spent some time thinking about.

For collaborative filtering and natural language processing, GPU architectures are highly constraining. I suspect once you hit a billion parameters or so, more specialized networks that use logarithmic datastructures will outperform for NLP and CF. The speedup from the brute force GPU approach will be offset by the smarter datastructures that a CPU can handle. I haven’t seen any research on this but seems to me like a huge opportunity. In particular, I would love to see hybrid architectures that can use a GPU for the “dense” networks and CPU for the “sparse” networks.

Some more font links

My blog post about fonts generated lots of traffic – it landed on Hacker News, took down my site while I was sleeping, and then obviously vanished from HN before I woke up. But it also got retweeted by a ton of people.

This clearly constitutes another proof of how effective animated gifs are. There’s some stuff out there on the internet that I think is about 10x cooler than my blog post:

Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow – blog post from a few weeks ago, doing something similar but modeling the strokes of Chinese characters as vector paths.

CXTebdZUQAEMe01

Learning a Manifold of Fonts – something I found the other day

Screen Shot 2016-01-24 at 11.40.13 PM

Recurrent neural network handwriting generation demo – very cool RNN approach, similar to the Chinese character experiment above.

handwritten

Some more Chinese characters generated using a neural network – in this case it’s a DCGAN (deep convolutional generative adversarial network) which is probably a better architecture than what I was using.

tumblr_o03o2hiSRm1qav3uso3_r1_500

Avería – several people sent me this link about generating an “average” font

spec04

Analyzing 50k fonts using deep neural networks

For some reason I decided one night I wanted to get a bunch of fonts. A lot of them. An hour later I had a bunch of scrapy scripts pulling down fonts and a few days later I had more than 50k fonts on my computer.

I then decided to convert it to bitmaps. It turns out this is a bit trickier than it might seem like. You need to crop in such a way that each character of a font is vertically aligned, and scale everything to fit the bitmap. I started with 512 * 512 bitmaps of all character. For every font you find the max y and min y of the bounding box, and the same thing for each individual letter. After some more number juggling I was able to scale all characters down to 64 * 64.

The result is a tensor of size 56443 * 62 * 64 * 64. Exercise for the reader: where does the number 62 come from? I stored it as a tiny little (13GB) HDF5 file that you can download here: fonts.hdf5.

If you take the average of all fonts, here’s what you get:

avgHopefully by now it should be clear where the number 62 came from. The median is a lot less blurry than the average:

median

Both mean and median are well-formed and legible! However individual fonts are all over the place:

alphabet

I guess I practically begged for it, stealing fonts from various sketchy places all over the web.In particular most of the fonts don’t even have lower case versions of the letters. A minority of fonts miss certain characters and will just output rectangles instead. And look at the ridiculous Power Ranger figure for the lower case “c”!

Training a neural network

Now, let’s train a neural network that generates characters! Specifically what I wanted to do is to create a “font vector” that is a vector in latent space that “defines” a certain font. That way we embed all fonts in a space where similar fonts have similar vectors.

I built a simple neural network using Lasagne/Theano — check out the code here. It took an insane amount of time to converge, probably because there’s so much data and parameters. After weeks of running, the model converges to something that looks decent.

Some notes on the model

  • 4 hidden layers of fully connected layers of width 1024.
  • The final layer is a 4096 layer (64 * 64) with sigmoid nonlinearity so that the output is between 0 (white) and 1 (black).
  • L1 loss between predictions and target. This works much better than L2 which generates very “gray” images — you can see qualitatively in the pictures above.
  • Pretty strong L2 regularization of all parameters.
  • Leaky rectified units (alpha=0.01) of nonlinearity on each layer.
  • The first layer is 102D — each font is a 40D vector joined with a 62D binary one-hot vector of what is the character.
  • Learning rate is 1.0 which is shockingly high — seemed to work well. Decrease by 3x when no improvements on the 10% test set is achieved in any epoch.
  • Minibatch size is 512 — seemed like larger minibatches gave faster convergence for some weird reason.
  • No dropout, didn’t seem to help. I did add some moderate Gaussian noise (of sigma 0.03) to the font vector and qualitatively it seemed to help a bit.
  • Very simple data augmentation by blurring the input randomly with sigma sampled from [0, 1]. My theory was that this would help fitting characters that have thin lines.

All of the code is available in the erikbern/deep-fonts repo on Github.

After convergence, we end up having a nice 40D embedding of all 50k fonts. Looks like it ends up being roughly a multivariate normal — here’s the distribution of each of the 40 dimensions:

pairplot_cropped

 

Playing around with the model

To start with, let’s recreate real font characters with characters generated from the network. Let’s plot the real character together with the model outputs. For each pair below, the real character is on the left, the model output on the right.

real_vs_pred

These are all characters drawn from the test set, so the network hasn’t seen any of them during training. All we’re telling the network is (a) what font it is (b) what character it is. The model has seen other characters of the same font during training, so what it does is to infer from those training examples to the unseen test examples.

The network does a decent job at most of the characters, but gives up on some of the more difficult ones. For instance, characters with thin black lines are very hard to predict for the model, since if it renders the line just a few pixel to the side, that’s twice the loss of just rendering whitespace.

We can also interpolate between different fonts in continuous space. Since every font is a vector, we can create arbitrary font vectors and generate output from it. Let’s sample four fonts and put them in the corners of a square, then interpolate between them!

grid

Certain characters have multiple forms that we can interpolate between, eg. lowercase g:

grid

We can also pick a font vector and generate new fonts from random perturbations:

noisy_font_2

(btw internet god — please forgive me for wasting bandwidth on all the animated gifs in this blog post!)

We can also generate completely new fonts. If we model the distribution of font vectors as a multivariate normal, we can sample random vectors from it and look at the fonts they generate. I’m interpolating between a few of those vectors in the picture below:

animated_font

An interesting thing here is that the model has learned that many fonts use upper case characters for the lower case range — the network interpolates between Q and q seamlessly. Here’s an example of the network interpolating very slowly between two fonts where this is the main difference:

font_pair

Another cool thing we can do since we have all fonts in a continuous space is to run t-SNE on them and embed all fonts into the 2D plane. Here’s a small excerpt of such an embedding:

tsne

Final remarks

There are many other fun things you can do. It’s clear that there’s some room for improvement here. In particular, if I had more time, I would definitely explore generative adversarial models, which seems better at generating pictures. Another few things should be relatively easy to implement, such as batch normalization and parametric leaky rectifications. And finally the network architecture itself could probably benefit from doing deconvolutions instead of fully connected layers

Feel free to download the data and play around with it if you’re interested!

I believe in the 10x engineer, but…

  • The easiest way to be a 10x engineer is to make 10 other engineers 2x more efficient. Someone can be a 10x engineer if they do nothing for 364 days then convinces the team to change programming language to a 2x more productive language.
  • A motivated 10x engineer in one team could be a demotivated 0.5x engineer in another team (and vice versa).
  • A average 1x engineer could easily become a 5x engineer if surrounded by 10x engineers. Engagement and work ethics is contagious.
  • The cynical reason why 10x engineers aren’t paid 10x more salary is that there is no way for the new employer to know. There is no “10x badge”.
  • …but also, a 10x engineer can go to a new company and become an 1x engineer because of bad focus / bad engagement / tech stack mismatch.
  • So unfortunately there’s less economic rationality for companies to pay 10x salaries to 10x engineers (contrary to what Google or Netflix says)
  • There’s no such thing as a 10x engineer spending time on something that never ends up delivering business value. If something doesn’t deliver business value, it’s 0x.
  • If you build something that the average engineer would not have been able to build, no matter how much time, that can make you 100x or 1000x, or ∞x. Quoting Alexander Scott: There is no number of ordinary eight-year-olds who, when organized into a team, will become smart enough to beat a grandmaster in chess.
  • Most of the 10x factor is most likely explained by team and company factors (process, tech stack, etc) and applies to everyone in the team/company. Intra-team variation is thus much smaller than 10x (even controlling for the fact that companies tend to attract people of equal caliber). Nature vs nurture…
  • I’ve never met the legendary “10x jerk”. Anecdotally the outperforming engineers are generally nice and humble.
  • Don’t get hung up on the exact numbers here, it’s just for illustration purposes. I.e. someone introduced a bug in the trading system of Knight Capital that made them lose $465M in 30 minutes. Did that make it a -1,000,000x engineer? (and btw it had more to do with company culture). The numbers aren’t meant to be taken literally.
I caught a unique glimpse at a small group of 10x engineers until they suddenly rushed away. All I heard was "merkle trees" and "Kappa architeture". What are the meanings of those expressions? We will never know.
I got a unique photo opportunity of this small group of 10x engineers until they suddenly vanished. All I managed to hear was “Merkle trees” and “Kappa architecture”. What are the meanings of those expressions? We will never know.

Books I read in 2015

Early last year when I left Spotify I decided to do more reading. I was planning to read at least one book per week and in particular I wanted to brush up on management, economics, and technology. 2015 was also a year of exclusively non-fiction, which is a pretty drastic shift, since I grew up reading fiction compulsively for 20 years.

My goal for 2015 failed – I ended up reading about 40 books last year. Here is a small selection of the best ones. Not all of it was published in 2015.

Zero-to-One-Notes-on-Startups-or-How-to-Build-the-Future-Peter-Thiel-Small-Business

Hard-Thing-cover

Ben Horowitz – The Hard Thing about Hard Things

Peter Thiel – Zero to One

Both books are highly entertaining. Ben Horowitz has a very narrative style and takes the reader through the ups and down of Opsware. Peter Thiel tries to provoke and goes on with long rants on how to position yourself. He stays true to his character – a nihilist libertarian who knows something about the future mere mortals do not. I like to think of the book as a set of mental models or factors that are all very important but maybe not as crucial as Thiel wants it to be. Ben in contrast stays close to earth and discusses his approach to management. It’s more like an HBR case study, except a lot more interesting and fun.

SUPERFORECASTING

future babble

Philip Tetlock and Dan Gardner – Superforecasting

Dan Gardner – Future Babble

I have a soft spot for cognitive biases, predictions, and making things quantitative. Both build heavy on the classic Expert Political Judgement by Philip Tetlock himself. Future Babble is by Dan Gardner, whereas Superforecasting is by the two of them together.

The books are about predictions but apply to any decision making with limited information. The fascinating TL;DR is that most people can become better decision makers by incorporating multiple factors and viewpoints, constantly calibrating their decisions, and stay away from single “big idea” explanations. If I had to pick one of these, read Superforecasting.

lean-in_custom-575cb1cc7e2e0e704abfffbc2a0ce498dafad0f8-s6-c30Sheryl Sandberg – Lean in

You could read this book from several perspectives, each useful. The intended perspective is if you’re a woman and want to make it in business. I’m not a woman but there’s at least two other perspectives of the book that both make it a worthwhile read. Most of the book to is really a great guide on how to make it in business in America. Being from another culture (growing up with a humble Scandinavian mindset) I found that it generalized quite well from the original target audience. The other reason to read it as a man is that it describes the double standards and the uphill struggles of women in business. Everyone’s biased, it’s just a difference whether they deny it or admit it. If you’re managing people, I think it’s your duty to be aware of your biases and how other people’s biases affect women. Either perspective you choose, this book is an easy read, and worth spending a few hours on.

high output mgmt team of teamsAndy Grove – High Output Management

Stanley McChrystal – Team of Teams

These books aren’t very similar but in one way they represent the opposite sides of how to manage efficiently. Stanley McChrystal spends a lot of time discussing the Taylor approach to management and how it breaks down in wars, especially fighting insurgents. Say what you want about the military, but it represents the ultimate management under uncertainty. Interestingly Superforecasting borrows a chapter out of this book, spending a lot of time on Taylor and modern warfare, starting with van Moltke.

I think of management as a series of steps building on each other, ranging from maximum certainty (we know what we want to build and how to get there) to total uncertainty (we don’t know what to build and have no idea how to find out). On that ladder, High Output Management is a great introduction to the lower levels, and Team of Teams a series of essays on what breaks down at the higher levels. None of the books is exhaustive, but definitely worth reading if you are a manager or interested in management.

Jessica Livingston – Founders at Work

Jessica Livingston (of YCombinator) interviews a set of founders. It’s a bit dated (2007) but still a great set of stories of what built a bunch of companies.

David Ogilvy – Ogilvy on advertising

This books roughly comes off as Peter Thiel talking about advertising. Highly opinionated with lots of concrete advice for how to do advertising well. The book is severely dated – It’s from 1983, but really talks more of the Mad Men era of 1960’s.  I still find it somewhat relevant, but more importantly it’s an easy fun read.

More MCMC – Analyzing a small dataset with 1-5 ratings

I’ve been obsessed with how to iterate quickly based on small scale feedback lately. One awesome website I encountered is Usability Hub which lets you run 5 second tests. Users see your site for 5 seconds and you can ask them free-form questions afterwards. The nice thing is you don’t even have to build the site – just upload a static png/jpg and collect data.

We are redesigning our website, so I ran a bunch of experiments where I asked users how trustworthy they think the website looks like, on a scale from 1 to 5. So let’s say you do that for several variants. How do you estimate the uncertainty of the average score?

You could compute the mean and the variance and use that to estimate. But let’s pause for a second. We know this distribution is not a normal distribution because it’s constrained to integers between 1 and 5.

Instead, let’s use a multinomial distribution for the distributions of the five possible ratings. Furthermore let’s say prior distribution is a Dirichlet distribution. Now let’s compute the weighted average using the posterior of the that distribution. Much cooler!

I also discovered PyMC3 and Seaborn which turns out to be two pretty cool tools. Relevant code:

Output:

ratings_mcmcBeautiful stuff! But how does it compare to the normal approximation? I’m glad you asked! Here are both on the same plot:

ratings_mcmc

You can see that there is a substantial difference. This is caused by two things: (a) our sample is not drawn from a normal distribution (b) the sample size is small.

For large sample sizes, the average of non-normal distributions converges to have a normal distribution (this is the Central limit theorem), but our sample size is very small (only 50 ratings in each set).

Dealing with these small dataset reminds me of the discussion between Karl Pearson and William Sealy Gossett (aka Student). Gossett, working for the Guinness factory in Dublin, developed a lot of modern statistics working with beer samples, in particular with small batch sizes. Talking to Pearson about this, Pearson remarked that Only naughty brewers deal in small samples! The t-test (of Gossett) is a great example of something coming out of necessity of working with small samples sizes. For larger samples, normal approximations work out very well.

Side note: I found a discussion by Andrew Gelman suggesting modeling this as a softmax instead – another option worth trying if you’re interested)

There is no magic trick

(Warning: super speculative, feel free to ignore)

As Yogi Berra said, “It’s tough to make predictions, especially about the future”. Unfortunately predicting is hard, and unsurprisingly people look for the Magic Trick™ that can resolve all the uncertainty. Whether it’s recruiting, investing, system design, finding your soulmate, or anything else, there’s always an alleged shortcut.

In the famous book Expert Political Judgment a huge amount of forecasts about the future are tracked over a long time. The conclusion is: people suck at forecasting. The only characteristic that seems somewhat predictive is what the author calls being a hedgehog vs being a fox. Hedgehogs (bad) are people who have one mental model they apply to anything. Foxes (good) apply a huge amount of different model and combine them to arrive at a conclusion. (Confusingly, hedgehogs do not hedge their bets).

This is a quite profound conclusion that goes beyond prediction. In fact I see it in almost any hard decision I have to make.

Let’s think about recruiting, for instance. So many people claim to have found the ultimate interview question. It ranges from “how old were you when you started coding?” to “what are your open source contributions” to “please spend ten hours on this take home assignment”.

After probably 500 tech interviews I’ve realized one thing: there is no trick. Empirically the correlation between who I thought would be good and who actually turned out to be good is very small. The overconfidence effect definitely is a real thing and I’ve become more skeptical about my abilities. The one thing I’ve learned is: try to collect as many independent metrics as you can. The other day I actually came across an old paper saying something similar.

The same thing applies to investing. You might follow Peter Thiel’s advice and never invest in companies where they wear suits. Or you might have an extremely strong conviction that self-driving cars will take over so you go out and short GM’s stock. But remember you’re up against professional portfolio managers who stare at their screens for 14 hours per day. Did they miss something you are seeing? No. They know that what you are seeing is a small fraction of their valuation of a company.

giphy (1)

What makes it even worse is both investing and recruiting are activities that takes place in a market. You are fighting with n other actors to find mispricings and arbitrage opportunity. Just like buying stocks based on a single model is bad, recruiting based on a single model will give you bad candidates. What happens is basically adverse selection and it will cause you to overpay for underperformance.

See below for a very silly market model where two companies X (blue) and Y (red) bid on employees but X knows something that Y doesn’t know. I model that by assuming employees break down into a set of random factors and X know a few more factors than Y know. Click the graph to see the code.

It’s better to be further to the left (lower cost) and higher up (more value) for a company.

It’s a bit hard to see from the model but what happens is the total total surplus (value – cost) for company X is some positive number and for Y it’s approximately zero. This holds true any time company X knows just a bit more about employees than company Y. Warren Buffet once said: “If you’ve been playing poker for half an hour and you still don’t know who the patsy is, you’re the patsy.”

Speaking of models, one of the most useful insights from machine learning is how much value you get from combining many models. This has been the central dogma in the machine learning community for a long time, whether it’s Kaggle, or the Netflix Prize, or industry applications. All models are wrong, but some are useful – combining a bunch of those models will always outperform.

I think the best meta-model for how to think of any complex systems, whether it’s recruiting or investing or anything else, is something like boosting. You start with nothing, then you find the best model that explains what you see. Then you increase the weight of the misclassified examples and fit another model (weak learner, in boosting lingo). And so on. Eventually you have built up a set of simple models that you can combine for a final prediction. (As a side note I think the reason humans can do this so well is they can use priors very efficiently)

Peter Thiel’s advice is a set of models that are wrong, but still marginally useful, so why not include them? Does a company have network effect? Sure, marginally helpful. Does a candidate have an MBA? Etc. These all sound like weak learners to me. Using just one of them is pretty bad. Add up 100 of them and you have a pretty good prediction.

 

Installing TensorFlow on AWS

Curious about Google’s newly released TensorFlow? I don’t have a beefy GPU machine, so I spent some time getting it to run on EC2. The steps on how to reproduce it are pretty brutal and I wouldn’t recommend going through it unless you want to waste five hours of your live.

Instead, I recommend instead just getting the AMI that I built (ami-cf5028a5). Choose g2.2xlarge and you should have a box with TensorFlow running in a minute or two! Note that it’s only available in us-east-1 (virginia) so far.

If you haven’t used AWS, here’s a tutorial on how to set up an instance from an AMI. I usually use spot instances since they are much cheaper, but they have some risk of getting killed unexpectedly (interestingly it seems more rare now, I wonder if it’s since the Bitcoin price is so much lower).

There are some known issues with TensorFlow on AWS. In particular I wasn’t able to get better performance from g2.8xlarge compared to g2.2xlarge, which sucks, since one of the cool features with TensorFlow is that it should distribute work across GPU’s. See this thread for some more info. Looking forward to see these issues getting resolved.

What is TensorFlow?

It seems like there’s a lot of misunderstanding about TensorFlow. It’s not some crazy flow based graphical tool to do neural nets. It’s kind of boring really. It’s just a marginally better version of Theano with much faster compilation times and capability to distribute work over multiple GPU’s/machines. Theano completely blew my mind when I first discovered it. Its approach was super innovative, but it’s pretty rough around the edges and I think in open source the pioneers die with arrows in their backs.

I expect TensorFlow (or maybe CGT or something else) to grow more popular. But in practice I don’t think people will use any of those straight up for machine learning – higher level libraries like Keras will be the preferred way to do most deep learning tasks.

Looking for smart people

I haven’t mentioned what I’m currently up to. Earlier this year I left Spotify to join a small startup called Better. We’re going after one of the biggest industries in the world that also turns out to be completely broken. The mortgage industry might not be the #1 industry you pictured yourself in, but it’s an enormous opportunity to fix a series of real consumer problems and join a company that I predict will be huge.

Wl6dJCp

We’re 6 engineers at the moment, mostly focused on backend stuff, but a bit of frontend and machine learning stuff as well. We have also raised a pretty substantial amount of money. At this point we’re just a few weeks from launching, so I will definitely keep you posted. If you are interested in hearing more, drop me an email at erik@better.com

MCMC for marketing data

The other day I was looking at marketing spend broken down by channel and wanted to compute some simple uncertainty estimates. I have data like this:

Total spend Transactions
Channel A 2292.04 9
Channel B 1276.85 2
Channel C 139.59 3
Channel D 954.98 5

Of course, it’s easy to compute the cost per transaction, but how do you produce uncertainty estimates? Turns out to be somewhat nontrivial. I don’t even think it’s possible to do a t-test, which is kind of interesting in itself.

Let’s make some assumptions about the model:

  1. The cost per transaction is an unknown with some prior (I just picked uniform)
  2. The expected number of transaction is the total budget divided by the (unknown) cost per transaction
  3. The actual observed number of transactions is a Poisson of the expected number of transactions

I always wanted to try using pymc and now I had an excuse. See gist below:

The result in the form of an animated GIF (Unfortunately animated gifs were never widely accepted as a homework format back in school)

You even get a useless graph for free!

graph

Of course, we could have computed this exactly, but I know myself and I’m very unlikely to get the expressions right without some serious effort. The conjugate prior of a Poisson is a Gamma distribution and we have to account for the parameterization of the cost per conversion as the budget divided by the total conversions, which will be another factor. How fun is that? I don’t have access to any windows to write on, so unfortunately not so fun.

fc96a0834fad9bb68b00fc864f444e
From A Beautiful Mind

Anyway – this particular example might not have been the most useful example of using PyMC, but I do quite like the idea of it. Especially applied to conversion analyses, since it translates directly into a generative model. I will definitely use it for some further funnel analysis – in particular when the number of data points is very small and the model is very complex.