Let's consider a toy model where you're hiring for two things and that those are equally valuable. It's not very important what those are, so let's just call them “thing A” and “thing B” for now.

This is a blog post originally featured on the Better engineering blog. If you want to link to this article or share it, please go to the original post URL! Separately, I'm sorry it's been so long with no posts on this blog.

When I started building up a tech team for Better, I made a very conscious decision to pay at the high end to get people. I thought this made more sense: they cost a bit more money to hire, but output usually more than compensates for it.

A modern tech stack typically involves at least a frontend and backend but relatively quickly also grows to include a data platform. This typically grows out of the need for ad-hoc analysis and reporting but possibly evolves into a whole oil refinery of cronjobs, dashboards, bulk data copying, and much more.

It started with a tweet:
New years resolution: every plot I make during 2018 will contain uncertainty estimates
— Erik Bernhardsson (@fulhack) January 7, 2018 Why? Because I've been sitting in 100,000,000 meetings where people endlessly debate whether the monthly number of widgets is going up or down, or whether widget method X is more productive than widget method Y.

I have done roughly 2,000 interviews in my life. When I started recruiting, I had so much confidence in my ability to assess people. Let me just throw a couple of algorithm questions at a candidate and then I'll tell you if they are good or not!

I've been reading up on operations research lately, including queueing theory. It started out as a way to understand the very complex mortgage process (I work at a mortgage startup) but it's turned into my little hammer and now I see nails everywhere.

There are often close relationships between top level business metrics. For instance, it's well known that retention has a super strong impact on the valuation of a subscription business. Or that the % of occupied seats is super important for an airline.

I just spent a few days in Italy, on the Ligurian coast. Even though we were on the west side of Italy, the Mediterranean sea was to the east, because the house was situated on a long bay.

I've written before about the importance of iterating quickly but I didn't necessarily talk about some concrete things you can do. When I've built up the tech team at Better, I've intentionally optimized for fast iteration speed above almost everything else.

How hard can it be to compute conversion rate? Take the total number of users that converted and divide them with the total number of users. Done. Except… it's a lot more complicated when you have any sort of significant time lag.

I was reading yet another blog post titled “Why our team moved from <language X> to <language Y>” (I forgot which one) and I started wondering if you can generalize it a bit. Is it possible to generate a N * N contingency table of moving from language X to language Y?

Here's a fun analysis that I did of the pitch (aka. frequency) of various languages. Certain languages are simply pronounced with lower or higher pitch. Whether this is a feature of the language or more a cultural thing is a good question, but there are some substantial differences between languages.

Pareto efficiency is a useful concept I like to think about. It often comes up when you compare items on multiple dimensions. Say you want to buy a new TV. To simplify it let's assume you only care about two factors: price and quality.

Why does it suck to wait for things? In a previous post I analyzed a NYC subway dataset and found that at some point, quite early, it's worth just giving up.
This isn't a proof that the subway doesn't run on time – in fact it might actually proves that the subway runs really well.

As you may know, one of my (very geeky) interests is Approximate nearest neigbor methods, and I'm the author of a Python package called Annoy.
I've also built a benchmark suite called ann-benchmarks to compare different packages.

(I accidentally published an unfinished draft of this post a few days ago – sorry about that).
There's a lot of sources preaching the benefits of dollar cost averaging, or the practice of investing a fixed amount of money regularly.

Apparently MTA (the company running the NYC subway) has a real-time API. My fascination for the subway takes autistic proportions and so obviously I had to analyze some of the data. The documentation is somewhat terrible, but here's some relevant code for how to use the API:

I've been spending several hundred bucks renting GPU instances on AWS over the last year. The speedup from a GPU is awesome and hard to deny. GPUs have taken over the field. Maybe following the footsteps of Bitcoin mining there's some research on using FPGA (I know very little about this).

This is another post based on my talk at NYC Machine Learning. The previous two parts covered most of the interesting parts, but there are still some topics left to be discussed. To go back and read the meaty stuff, check out

This is a blog post rewritten from a presentation at NYC Machine Learning on Sep 17. It covers a library called Annoy that I have built that helps you do nearest neighbor queries in high dimensional spaces.

This is a blog post rewritten from a presentation at NYC Machine Learning last week. It covers a library called Annoy that I have built that helps you do (approximate) nearest neighbor queries in high dimensional spaces.

Here's a problem that I used to give to candidates. I stopped using it seriously a long time ago since I don't believe in puzzles, but I think it's kind of fun.
Let's say you have a function that simulates a random coin flip.

I have spent some time lately with D3. It's a lot of fun to build interactive graphs. See for instance this demo (will provide a longer writeup soon).
D3 doesn't have support for 3D but you can do projections into 2D pretty easily.

Saw this link on Hacker News the other day: The Highway Lane Next to Yours Isn’t Really Moving Any Faster
The article describes a phenomenon unique to traffic where cars spread out when they go fast and get more compact when they go slow.

I saw a bunch of tweets over the weekend about Peter Norvig claiming there's a negative correlation between being good at programming competitions and being good at the job. There were some decent Hacker News comments on it.

This is the last post about deep learning for chess/go/whatever. But this really cool paper by Christopher Clark and Amos Storkey was forwarded to me by Michael Eickenberg. It's about using convolutional neural networks to play Go.

My previous blog post about deep learning for chess blew up and made it to Hacker News and a couple of other places. One pretty amazing thing was that the Github repo got 150 stars overnight.

I've been meaning to learn Theano for a while and I've also wanted to build a chess AI at some point. So why not combine the two? That's what I thought, and I ended up spending way too much time on it.

Say you build a machine learning model, like a movie recommender system. You need to optimize for something. You have 1-5 stars as ratings so let's optimize for mean squared error. Great.
Then let's say you build a new model.

Note: This is a silly application. Don't take anything seriously.
Benford's law describes a phenomenon where numbers in any data series will exhibit patterns in their first digit. For instance, if you took a list of the 1,000 longest rivers of Mongolia, or the average daily calorie consumption of mammals, or the wealth distribution of German soccer players, you will on average see that these numbers start with “1” about 30% of the time.

I’ve been spending quite some time lately playing around with RNN’s for collaborative filtering. RNN’s are models that predict a sequence of something. The beauty is that this something can be anything really – as long as you can design an output gate with a proper loss function, you can model essentially anything.

I wanted to share some more insight into the algorithms we use at Spotify. One matrix factorization algorithm we have used for a while assumes that we have user vectors $$ bf{a}_u $$ and item vectors $$ bf{b}_i $$ .

One thing I encountered today was a trick using bagging as a way to go beyond a point estimate and get an approximation for the full distribution. This can then be used to penalize predictions with larger uncertainty, which helps reducing false positives.

Sometimes you have to maximize some function $$ f(w_1, w_2, ldots, w_n) $$ where $$ w_1 + w_2 + ldots + w_n = 1 $$ and $$ 0 le w_i le 1 $$ . Usually, $$ f $$ is concave and differentiable, so there's one unique global maximum and you can solve it by applying gradient ascent.

The simple way to get featured on big data blog these days seem to be
Build something that does 1 thing super well but nothing else Benchmark it against Hadoop Publish stats showing that it's 100x faster than Hadoop $$$ Spark claims their 100x faster than Hadoop and there's a lot of stats showing Redshift is 10x faster than Hadoop.

This was posted on the Twitter Engineering blog a few days ago: Dimension Independent Similarity Computation (DISCO)
I just glanced at the paper, and there's some cool stuff going on from a theoretical perspective. What I'm curious about is why they didn't decide to use dimensionality reduction to solve such a big problem.

Something that pops up pretty frequently is to implement time decay, especially where you have recursive chains of jobs. For instance, say you want to keep track of a popularity score. You calculate today's output by reading yesterday's output, discounting it by $$ exp(-lambda Delta T) $$ and then adding some hit count for today.