Erik Bernhardsson

Predicting solar eclipses with Python

2024-04-07T00:00:00Z

As I am en route to see my first total solar eclipse, I was curious how hard it would be to compute eclipses in Python. It turns out, ignoring some minor coordinate system head-banging, I was able to get something half-decent working in a couple of hours.

I didn't want to go deep on celestial mechanics, so I decided to leverage Python's fantastic ecosystem for everything. The package Astropy turns out to have about 80% of the stuff I wanted, in particular making it quite straightforward to compute the position of the sun and the moon in the sky. After just a few minutes of googling, I had something that computes the overlap between the sun and the moon given a particular point on the Earth:

from astropy.coordinates import EarthLocation, get_body
from astropy.time import Time
from astropy.units import deg, m

def sun_moon_separation(lat: float, lon: float, t: float) -> float:
    loc = EarthLocation(lat=lat * deg, lon=lon * deg, height=0 * m)
    time = Time(t, format="unix")
    moon = get_body("moon", time, loc)
    sun = get_body("sun", time, loc)

    sep = moon.separation(sun)
    return sep.deg

This takes a (latitude, longitude) pair as well as a unix timestamp and computes the angular separation between the sun and the moon. Basically this means just the distance between the centers of the objects, seen in the sky from the Earth. If the angular separation is very close to zero, we have a solar eclipse.

However! I didn't want to compute this for a given coordinate. I wanted to compute the location of a total eclipse given a timestamp (if there is one).

Ideally, we would grab 3D coordinates for the Earth, the sun, and the moon. Then project a line between the sun through the moon, see if that line hits the Earth, and if it does, find the latitude and longitude of this intersection. This is probably the “right” way to do it, and if I had time, I would brush the dust off my geometry skills and do this.

However, I don't have time! It's the day before the eclipse, and I just want to compute coordinates in the least arduous way possible. We already have something that computes a related thing, but we need to flip things around a bit. We're going to do this using a bulldozer I love using for stuff like this: black-box optimization.

Solving for the coordinates using black-box optimization

We have a function that takes (timestamp, latitude, longitude) and outputs the distance between the sun and the moon in the sky. But let's instead try to solve this related problem: Given a timestamp, find the latitude and the longitude that minimizes the distance between the sun and the moon in the sky.

If the minimum distance is essentially zero, this means that we found a solar eclipse. In that case, the coordinate that minimizes the function is the center of the sun's shadow on the Earth.

It's relatively straightforward to minimize an arbitrary function like this. My go-to package for this is scipy.optimize which has a bunch of well-tested routines that are probably implemented in Fortran 77 if you dig deep enough. We don't even have the gradient for the function, but that's fine — Nelder-Mead is your friend.

The nice part of it is we can treat this function as a completely black-box and optimize it from the outside. It does get somewhat computationally expensive, but it's not something I would personally lose sleep over.

The code to use scipy.optimize.minimize to find the eclipse location ends up like this:

def find_eclipse_location(dt: datetime) -> tuple[float, float] | None
    """Return the coords of a total eclipse, or `None`."""
    t = datetime.timestamp(dt)
    fun = lambda x: sun_moon_separation(x[0], x[1], t)

    ret = minimize(fun, bounds=[(-90, 90), (-180, 180)], x0=(0, 0))
    return ret.x if ret.fun < 1e-3 else None

Basically, we bind the time to sun_moon_separation, and construct a new function with 2 variables: latitude and longitude. And then we search over this function (with bounds) to find the minimum.

This almost works! Well, part of the problem was that I wasted 2 hours because of a dumb sign error with latitudes and longitudes. But even after fixing that, I ended up with weird spurious coordinates.

I think this is because of bogus minima, since I think the the antipode of one solution is another solution. We should obviously discard solutions when you can't see the sun. Two simple modifications makes the solver work super reliably:

If the sun or moon is below the horizon, return some large number
Instead of using (0, 0) as the starting point, do a simple grid search over a few points on the Earth and pick the one with the smallest sun-moon distance. Then use that point as the starting point for the optimization.

My final code for sun_moon_separation and find_eclipse_location ends up just a tiny bit more complex than what I shared above. With these tricks, we now have a function that reliably takes any timestamp and figures out the latitude/longitude for a solar eclipse (if there is one).

Finding all the eclipses

Ok, so now let's find a bunch of eclipses! In particular, let's find the path of every eclipse in the 2020-2030 span. This will require us to search over a lot of timestamps.

Alas, the find_eclipse_location function is pretty slow!

So what do we do? More tricks:

Do a coarse search over the full decade, only probing every hour. If we identify an eclipse, do a more granular search and map out the path minute by minute.
Parallelize!!!

I'm the CEO of Modal, which makes it super easy to take Python code and run it in the cloud. Honestly I wasn't planning on using Modal for this, but scaling out computationally intensive functions is such an great use case for Modal that I immediately just grasped for it.

We can find all eclipses in the 2020-2030 period my adding a simple decorator to find_eclipse_location and then mapping over it. The mapping code ends up looking like this:

def run():
    dt_a = datetime(2020, 1, 1, 0, 0, 0, tzinfo=timezone.utc)
    dt_b = datetime(2030, 1, 1, 0, 0, 0, tzinfo=timezone.utc)

    # Compute evenly spaced datetimes
    dt = dt_a
    dts = []
    while dt < dt_b:
        dts.append(dt)
        dt = dt + timedelta(seconds=3600)

    # Map over it using Modal!!!
    for tup in find_eclipse_location.map(dts):
        if tup is not None:
            print("Found eclipse at", tup)

Plotting it

I'm glossing over a few details in the actual code, but bear with me. Once we have all the paths, we can plot them. I used Basemap and got something half-decent pretty quickly:

from matplotlib import pyplot
from mpl_toolkits.basemap import Basemap

def plot_path(dts: list[datetime], lats: list[float], lons: list[float]):
    # Set up a world map                                                                                                                                                                              
    pyplot.figure(figsize=(6, 6))
    lat_0, lon_0 = lats[len(lats) // 2], lons[len(lons) // 2]
    bm = Basemap(projection="ortho", lat_0=lat_0, lon_0=lon_0)
    bm.drawmapboundary(fill_color="navy")
    bm.fillcontinents(color="forestgreen", lake_color="blue")
    bm.drawcoastlines()

    # Plot eclipse path
    x, y = bm(lons, lats)
    bm.plot(x, y, color="red")

I added a few more things in my final script, including local times by using timezonefinder to look up local timezones from (latitude, longitude) pairs.

This is what the eclipse tomorrow (on 2024-04-08) looks like if we plot it using the script:

Gorgeous!

Actually this probably isn't award-winning in terms of design quality, but it feels fairly decent for something as a starting point — the point here isn't necessarily to win design awards, but to find eclipses in ~100 lines of Python.

Which the script does! In fact, it finds all the eclipses in the 2020-2030 period:

2020-06-21 over Africa, Middle East, and Asia
2020-12-14 over a tiny bit of South America
2021-06-10 over northern Canada and Greenland
2021-12-04 over Antarctica
2023-04-20 over Australia and Papua New Guinea
2023-10-14 over USA, Central America, and South America
2024-04-08 over Mexico, USA and Canada (tomorrow!!)
2024-10-02 over a tiny bit of South America (again?)
2026-02-17 over a tiny bit of Antarctica (will anyone see it?)
2026-08-12 over Greenland and Spain
2027-02-06 over a tiny bit of South America (a third time??)
2027-08-02 over North Africa and Middle East
2028-01-26 over South America and Spain
2028-07-22 over Australia and New Zealand

This does indeed look identical to other lists I found online, which is quite reassuring.

Total runtime is a few minutes thanks to Modal.

It is admittedly a bit of a brute-force approach to do it this way, and I'm sure NASA has a version in C++ that runs 1000 times faster. However, the brute-force approach is such a obvious winner in terms of developer productivity, even ignoring the fact that we also plotted maps!

Notes

Lucky bastards in the south of South America catching three eclipses in a decade.
The code is here if you want to check it out!.
I was somewhat inspired by this blog post doing something similar in Mathematica and I guess I have to say I'm impressed with the amount of eclipse-related functions in Mathematica?
Credits to Stackoverflow code here for a starting point in my code
I ignored the difference between annular and total eclipses in my code, although this probably isn't super hard to fix.
I also didn't compute the width of the path of totality, i.e. the width of the sun's shadow on the Earth. Just the path of the center of that shadow.

Simple sabotage for software

2023-12-13T00:00:00Z

CIA produced a fantastic book during the peak of World War 2 called Simple Sabotage. It laid out various ways for infiltrators to ruin productivity of a company. Some of the advice is timeless, for instance the section about “General interference with Organizations and Production”:

Insist on doing everything through “channels”. Never permit short-cuts to be taken in order to expedite decisions.
Make “speeches”. Talk as frequently as possible and at lengths. Illustrate your “points” by long anecdotes and accounts of personal experience. Never hesitate to make a few “patriotic” comments.
When possible, refer all matters to committees for “further study and consideration”. Attempt to make committees as large as possible — never less than five.
Bring up irrelevant issues as frequently as possible.
Haggle over precise wordings of communications, minutes, resolutions.
Refer back to matters decided upon at the last meeting and attempt to re-open the question of the advisability of that decision.
Advicate “caution”. Be “reasonable” and urge your fellow conferees to be “reasonable” and avoid haste which might result in embarrassments or difficulties later on.
Be worried about the propriety of any decision — raise the question of whether such action as is contemplated lies within the jurisdiction of the group or whether it might conflict with the policy of some higher echelon.

I guess I've always been fascinated with how well this has stood the test of time? I even got this particular section framed and hung up at our office:

Your mission

Let's say you were employed as a CTO behind the front lines and you wanted to destroy productivity for as long as you can without getting caught. You can of course make a series of obviously bad decisions, but you'd get fired quickly. The real goal here is to sap the company of its productivity slowly, while maintaining a façade of plausibility and normalcy. What are some things you can do?

Technology

When joining, require a 6-18 months rewrite of core systems. Blame the previous CTO.
Encourage everyone use their own language and frameworks.
Split systems along arbitrary boundaries: maximize the number of systems involved in any feature.
Encourage a complex dev setup: running a service mesh with a dozen services at a minimum.
Make sure production environment differs from developer environments in as many ways as possible.
Deploy as infrequently as possible. Urge extreme caution about deployments. Leverage any production issue as a reason to “pull the brakes”.
Introduce very complex processes for code change and common workflows. Blame it on “security” or “compliance”.
Make sure every task is tracked in a task tracker and has been reviewed, prioritized, and signed off by a group of at least five people.
Disallow anything outside the scope of the original task, such as code cleanup or other drive-by improvements.
Build in-house versions of almost anything that's not a core competency. Justify it by “avoiding vendor lock-in”.
Insist on adding abstraction layers on top of everything. Use vendors that are themselves abstractions and then add extra layers of abstractions.
Encourage technical decisions based on wildly optimistic expectations of scale. Plan for at least 3 orders of magnitude more load than you have.
Encourage communal ownership of systems. Make sure no one feels responsible for maintenance.
Insist on centralizing almost everything as a “platform” owned by the “platform team”. Understaff the platform team and prevent other teams from building anythings that the platform might “own”.
Make the platform team iterate on APIs frequently and mandate that other teams refactor their code to the latest version as frequently as possible.
Hire “architects” and require even small changes to have an “architecture review”.
Require even small changes to have a “security review”.

Product

Dismiss useful metrics on academic grounds (e.g. “biased” or “lagging indicator”).
Pick vanity metrics with little or no correlation with business value and high amount of noise.
Insist on anything to be done as a “big bet” and insist on everything to be completely done before deployed.
Consider every feature a “must-have” and critical part of “version zero”. Do not budge.
Develop incredibly detailed “strategic” plans.
Pivot frequently.
Dismiss obvious improvements as “local optimization”.
Use latest trends to tie up resources. Kickstart a vacuous “AI strategy” that seems plausible at the surface. Spend heavily on vendors and consultants for these.
Encourage product managers to spend most of their time on “strategy” and “planning”.
Make it hard/impossible for engineers and product manager to use the product internally.
Dismiss users as “stupid” internally.

Leadership

Link compensation to title, and title to to team size, in order to incentivize bloat.
Make big talk about strategies, features, or technical complexity.
Make expensive acquisitions to enter new product areas. Refer to “synergies”. Shut down the acquired product.
Use lots of dotted lines in the reporting structure.
As much as possible, have people to report into managers in other teams, locations, or functions. Make sure managers are ill-equipped to supervise their reports.
Frequently reassign underperformers to other teams.
Put high performers on highly speculative R&D projects with unclear deliverables.
Always require a meeting for any decision, no matter how trivial.
Insist that every “stakeholder” needs to be present in the meeting.

Hiring

Create a hiring process that seems plausibly objective but in reality subjective.
Reject the best people based on “poor culture fit” or other vague criteria.
Hire the weakest candidates based on “potential” or “attitude” or other vague criteria.
Recruit very expensive senior leaders with large headcount promises.
Use inflated titles and made-up roles to attract opportunists.
Hire highly specialized “experts”, then create contrived projects to prevent them from quitting.
Use specialization as a justification to hire other, complementary people.

Project management

Require very detailed estimates for any project.
Encourage projects that span as many teams as possible, ideally in different locations.
Add new requirements that depend on work done by other teams.
Frequently make use of expensive agencies. Make the scope ambigious and hand over unfinished prototypes on the in-house team for them to finish.
Build complex “self-service” systems for stakeholders in other teams.

This is from the 1994 music video Sabotage by Beastie Boys. The lyrics are mostly about technology leadership and developer productivity.

The outcome

It's a hard job to pull it off! But if you can parachute behind the enemy front lines, and land a job as a CTO, you can make this happen.

For the non-saboteur: this is obviously a story about how to get the most out of your team. Productivity in general is a story of a thousand cuts, and none of these things are in themselves the thing that will ruin the productivity. But productivity adds up on a logarithmic scale, meaning that all these things compound in a multiplicative way. Basically, do 100 things that each is a 5% tax on productivity, and you just slowed everything down by 131x! The only way to keep engineers happy is to say no to 100 minor cuts that each sound plausible and specious.

What I have been working on: Modal

2022-12-07T00:00:00Z

Long story short: I'm working on a super cool tool called Modal. Please check it out — it lets you run things in the cloud without having to think about infrastructure. Scaling out, scheduling, containerization, using GPUs, setting up webhooks, and all kinds of other stuff. It's primarily meant for data teams. We aren't quite live, but you can sign up for our waitlist.

Context

Most of my career has been in data. I spent seven years working at Spotify, doing everything from large-scale numerical methods to making charts for board decks. I then spent six years as a CTO, although I managed the data team directly for a long time and would occasionally write some data code.

Data¹ strikes me a a discipline that deserves a bit more love. It's grown from obscurity to become some meaningful % of software engineering, but the state of the art in terms of tools and workflow is still emerging.

Data as its own discipline

I spent a lot of time as a CTO, and if I had to condense how to make engineers productive into one thing, it would be something like: make the feedback loops fast.

What sort of “loops” am I talking about? In frontend development it's saving the code in the editor on one screen and seeing the updates in the browser on another. In backend development it's running unit tests (or sometimes, just compiling).

Data is sort of weird because you have to run things on production data to have these sort of feedback loops. Whether you're running SQL or doing ML, it's often pointless to do that on non-production data. This violates a holy wall for a lot of software engineers: the strict separation of local and prod. ²

There are other differences too! Data teams often need to change infrastructure a lot more often (sometimes every new cron job needs a Terraform update), have very “bursty” needs for compute power, and needs a much wider range of hardware (GPUs! high memory jobs! etc). Not to mention the “liberal” (ehum) standards for packaging cutting edge ML code ³.

There's a weird sort of backend-normative view of what data teams should do, but I think it's very misguided. UN should impose sanctions on anyone complaining that “data teams need to adopt software engineering practices”. The needs are different! Lets let the right workflows emerge from what makes teams the most productive, and lets let data workflows stand on their own feet.⁴

Let's make data teams more productive!

I started looking at this a few years ago, looked at the absurdly crowded market landscape, and spent a lot of time talking to very smart people. I kind of wanted to rebuild everything! Which is an incredibly aspirational and dumb idea. Looking at a lot of different parts of the stack, I started getting interested in the lowest part of it, which I've been thinking of as the “runtime”.

A lot of the issues with data productivity I think comes back to the runtime:

Infrastructure is often a “chore” in the end. After you get something running locally, you now have to do a ton of complex testing/configuration to ship it.
The feedback loops working with infra is super slow, because every iteration requires deploying code to K8s or similar
Things break in production in weird ways because the environment is different
Every company that reaches a certain stage tends to build its own data platform. This seems… wasteful?
It turns out it's hard to build abstractions on top of Kubernetes without it leaking through in 100 ways.

Some very large fraction of this has effectively been solved by shoving all the transformations into SQL in the last few years. I'm a big fan of SQL. Better tooling made it 10x easier to write complex pipelines in SQL and so its adoption increased by a lot. But there's still a lot of things where you need code. What can we do to make data teams 10x more productive when they write code?

Let's build a new runtime

I wanted to build something that takes code on a user's computer and launches it in the cloud within a second. Skimming over all the intermediate steps, I built Modal.

Let's write some code that computes the square of 42 and prints it to standard out:

import modal

stub = modal.Stub()


@stub.function
def square(x):
    print("This code is running on a remote worker!")
    return x**2


if __name__ == "__main__":
    with stub.run():
        print("the square is", square.call(42))

The foundational building block in Modal is a decorator that takes any Python function and moves its execution to the cloud. This might seem like a very trivial thing, but it turns out you can use this as a very powerful primitive to build a lot of cool stuff.

Let's run this code:

Two things worth noting here:

This launches the code into the cloud in ~1s.
The square function runs in the cloud, but when it prints, we see it on the local stdout.

Things like this lets us take the cloud inside the innermost feedback loop. If we edit the code and re-run the app, the new code just runs. Instead of the loop of: build container → push container → trigger job → download logs which can take a few minutes if you're lucky, or a few hours if you're unlucky (especially if there's version control and CI/CD in that feedback loop).

The other benefit is, you only have a single environment. We can define arbitrary environments in code and have Modal build it for you, in the cloud:

@stub.function(image=modal.Image.debian_slim().pip_install(["numpy"]))

This says: run the function square inside a container image that has numpy installed in it. When we run this, if the image doesn't exist, it will be built in the cloud for us. We build it super fast — the above example in a couple of seconds, since we built our own container builder and have fast machines in the cloud with super fast internet.

What about using a GPU?

@stub.function(gpu=modal.gpu.A100())

Or a cronjob running in the cloud?

@stub.function(schedule=modal.Period(hours=1))

Any function in Modal can also be used to map over it, fanning it out to lots of containers running in the cloud (see code):

There's plenty of other things too. All of this while not making you write a single line of YAML in the process (in fact, there is no configuration in Modal — everything is in code).

And since we take care of all the infrastructure, it's trivial to get started, once you have an account ⁵, just run

pip install modal-client
modal token new

This initializes an API token and you're ready to run Modal. We run everything in our infrastructure, so there's nothing to set up other than that.

We've been focusing a lot on machine learning recently, in particular model inference — Stable Diffusion is obviously the coolest thing right now, but we also support a wide range of other things: Using OpenAI's Whisper model for transcription, Dreambooth, object detection (with a webcam demo!). It's possible to deploy an ML model on Modal in a few minutes, and the actual deployment step usually takes a few seconds.

But Modal is really a general purpose compute layer you can use for a lot of stuff. You can use us for query datasets using DuckDB, or for web scraping, and for many other things: data pipelines, cron jobs, large-scale simulations, and many more things. And it's serverless⁶, so you only pay for the actual usage.

How does it work?

Taking code on a user's computer and launching it into custom containers in the cloud in less than a second isn't trivial. Along the way, we ended up building a lot of custom stuff: in particular a container runner, an image builder, and our own filesystem. Most of this in Rust for performance and safety.

We decided to not build this on top of tools like Docker/Kubernetes because we want infrastructure to be fast. I met a lot of VCs and other people while I was still just working alone on a prototype and most of them told me I was nuts when I started talking about building custom file systems and container engines. But we built it and it's working beautifully! Modal has no problem building a 100GB container, and then booting up 100 of those containers — you can do the whole thing in a few seconds. This is what it's built for. ⁷

There's a lot of other complexity: the work of scheduling tasks onto workers, exposing everything as a Python SDK, and much more. I'm very spoiled to have this team ⁸ and really proud of where we've gotten so far. BUT it's also clear that we're extremely early in this journey — we're working on infrastructure at a very low level and it's going to take a lot of time to support all the cool stuff we want to get to. I will be posting a lot more about it!

I'm deliberately vague about what exact role I mean here: take it to mean data engineers, data scientists, ML engineers, analytics engineers, and maybe more roles. The current division of responsibility is still in flux, and I think it's a mistake overfitting to what we have today. ↩︎
I think it's interesting that Snowflake was (arguably) the first major database to only run in the cloud, and it sort of makes sense from this view. OLTP databases typically need to run in multiple places: in prod, in dev, and in CI. An OLAP database like Snowflake doesn't have this portability requirement. ↩︎
I looked at some code recently where the author had put the entire virtual env in a zip, and the notebook included steps that patched installed site-packages. ↩︎
To be clear: I would still recommend every data person to learn a lot about “traditional” software engineering! But let's not treat it as the “right” way to do things. ↩︎
Unfortunately account registration still isn't open, because we aren't quite ready for it! ↩︎
The term serverless is applied waaaaay too liberally by vendors today, so I struggle a bit with the term, to be honest. ↩︎
Pushing and pulling OCI images is incredibly inefficient. Only a very small fraction of the image content is ever read, and there's an extremely high degree of overlap between even unrelated images. We're exploiting this by running our own file system in FUSE which we expose to containers. But under the hood, the we use a content-addressed storage system. This means we don't have to copy around large images, because we can fetch individual files lazily when they are needed, and we can also achieve a very high cache efficiency for these files. I hope to write a blog bost about this in the future! ↩︎
Just one random thing: we have 7 people on the team and 5 IOI gold medals, which is a fairly irrelevant metric, but I think it's still cool. ↩︎

We are still early with the cloud: why software development is overdue for a change

2022-10-19T00:00:00Z

This is is in many respects a successor to a blog post I wrote last year about what I want from software infrastructure, but the ideas morphed in my head into something sort of wider.

The genesis

I encountered AWS in 2006 or 2007 and remember thinking that it's crazy — why would anyone want to put their stuff in someone else's data center? But only a couple of years later, I was running a bunch of stuff on top of AWS.

Back then, AWS had something like two services: EC2 and S3. Today, that number is closer to 350: ¹

And today, the cloud is obviously here… I mean, despite what some people may think about cloud adoption ² it's clear that building technology is vastly different today than it was a decade ago, and the cloud deserves a big part of the credits for it.

I think we might be early though?

In some sort of theoretical abstract platonic form type thing, the cloud should offer us

Infinite scalability
Less time spent on infrastructure
Fewer constraints
Lower costs.

Here's a random assortment of things I feel like we should have, if the cloud had truly delivered. But we don't:

When I compile code, I want to fire up 1000 serverless container and compile tiny parts of my code in parallel.
When I run tests, I want to parallelize all of them. Or define a grid with a 1000 combinations of parameters, or whatever.
I never ever again want to think about IP rules. I want to tell the cloud to connect service A and B!
Why is Bob in the ops team sending the engineers a bunch of shell commands they need to run to update their dev environment to support the latest Frobnicator version? For the third time this month?
Why do I need to SSH into a CI runner to debug some test failure that I can't repro locally?

I could go on!

The current state doesn't strike me as a slam dunk improvement along every axis. Most egregiously, why have the feedback loops writing code become longer? And the environment difference between local and prod larger? I don't know…I look at modern programming and cloud computing and it feels like we have most of the building blocks, but we're still so far from it? I can't help but getting the feeling that we basically just did this thing so far: ³

Docker is an incredible piece of technology for many reasons. But just using it the way the meme above implies, seems as if we didn't fundamentally change development to take advantage of this new magic technology, we just sort of just think of it as an adapter. We've seen Cloud 1.0 and maybe Cloud 1.5 but there's a lot more remaining for us to reach Cloud 2.0.

Adapters and toys

I spent seven years at Spotify so let me tell you a story. It might seem tangential but hear me out:

Music consumption in 2008 was going in two directions. iTunes had successfully shifted the consumption of music from physical to the cloud, but the consumption model of owning content remained the same. At the same time, Napster et. al. showed the promise of the cloud but was unable to license content and do it the legal way.

Increasing broadband penetration initially enabled music downloading, but quickly also made the location of the music bits irrelevant.

Spotify then (imo!) pulled off a massive Pareto improvement — they offered a service that were both better than iTunes and piracy for consumers, and somehow also had a business model that was sufficiently producer friendly. I remember the magic moment from the first time I tried Spotify — it was like having 30M tracks on my own computer.

A steam locomotive pulling horse carriages: ⁴

I find this useful to illustrate how music consumption went through a couple of shifts in rapid succession:

The first shift is trying to put the old stuff into the new thing. This is the “lift and shift” or “adapter” type approach. These things take advantage of the new tech to some extent, but only partially, and are stuck with legacy technology. The second shift involves products that are native to the new technology, but look like “toys”. Maybe there's no obvious business model, or the tool solves just one particular use case. But it gives some conceptual insight into what's coming.

The exciting thing happens when the toys evolve into something that is obviously better than the legacy technology, and also better than the adapters. Online streaming services basically represents a third and final shift: take the toys but turn it mainstream.

Cloud is coming for your workflow

Somewhat ironically, software development is one of a vanishingly small subset of knowledge jobs for which the main work tool hasn't moved to the cloud. We still write code locally, thus we're constrained to things that work in the same way both locally and in the cloud. Thus, adapter tools like Docker.

I get this. I love writing code just on my laptop because of the fast feedback loops it creates. My laptop is full of weird-ass scripts and half-finished projects, and I haven't encountered any other tool that lets me iterate on these things as quickly, maybe with the exception of ssh into a devbox, which is something we've done for about 50 years now: ⁵

We could of course put the whole development environment in a VM in the cloud, but I'm not sure that's much more than the “lift and shift” I already talked about — in the sense that we're not taking advantage of all the cloud stuff — scaling up/down (including to zero), flexible resource needs, pay-as-you-go, etc. I think we're overdue for some larger change in how we develop, deploy, and run code.

That doesn't mean someone is confiscating your beloved local development environment! Or I mean, it sort of does, but it won't happen until the new tools are better — which frankly is not the case today, outside of a few use cases. Like I said, the fast feedback loop of local coding is hard to beat today, but there's a lot of good reasons why an environment native to the cloud would let you iterate faster eventually. And there are other benefits too, like reducing/removing big environment differences which represent a massive tax on productivity.

To be free, we need to break free of our past

Our stacks are based on 50 years of technology. On top of that, we also have a lot of legacy portability needs: where we needed the same software to run in the cloud, locally, and on-prem. These complex interdependencies will take a while to break free of.

Rethinking these abstractions to be native to the new world let's us start over and redefine the abstractions for what we need? Do we need IP addresses, CIDR blocks, and NATs, or can we focus on which services have access to what resources? Do we need to reserve resources advance (like having a cluster of n workers, each having x GB of RAM), or can we let the runtime expand/shrink in a split second? You can probably guess which option I would put my money on.

These low-level primitives will still be there of course, under the hood, and some people will still think about hardware interrupts and dangling pointers. But just like most developers don't think about these things, the abstraction layers will keep moving up. I'm excited for a world where a normal software developer doesn't need to know about CIDR blocks to connect a Lambda with an RDS instance.

Vertical toys

Building and running software was convoluted enough pre-cloud, and has gotten a lot more complex over the years. I think it's quite likely that the easiest way to start over is to focus on particular use cases and optimize end-to-end for the experience. “Repackage the cloud” for different end users, be it frontend developers, data scientists, and so one. I mean, I'm extremely biased here, but I'm essentialy doing this for data teams, so I'm obviously a believer in this.

You don't have to look far to see some pockets of this. Analytics engineers who run all their SQL in a runtime in the cloud (the data warehouse). Data scientists who develop all their code in notebooks in the cloud. Some (very small subset of) backend developers who run all their code in Lambda containers, even during development. The broad userbase of people building and hosting apps on tools like Replit.

You might look at them and think of them as examples of “toys”, or niche use cases, but that's sort of the point that I'm making — it might look insignificant to the larger set of software engineers. It might not look like the software engineering you are doing, and it probably won't quite look like what software development will look like in the future. But it's very possible these people are in fact ahead of the curve.

Welcome, Cloud 2.0

This was just a long way of saying: the clouds are here, and a lot has already happened, but the bulk of what it means for building software is yet to come.

I also want to point out that I wrote a whole blog post about how we're early with the cloud, and I didn't even mention the massive amount of companies that aren't even in the cloud. Besides the massive changes to the cloud itself, there's so much growth just there. And there's something like an order of magnitude more software engineers today than 20 years ago, which will further drive demand, etc etc etc. It's going to be a fun decade for tech!

Addendum

This was on the front page of Hacker News and got a bunch of comments.

Separately, if this post resonates with you, and you're working on data and want better infrastructure, you might be interested in Modal, which is what I'm working on right now. We take your code and run it in the cloud for you, while letting you retain the productivity of writing code locally. You can scale things out, schedule things, run things on GPUs, and a bunch of other things.

This chart is pretty crude — I just looked at the earliest version number for each service in botocore. But it's a bit imprecise, e.g. EC2 was launched in 2006, but the oldest API version is from 2014. ↩︎
And despite some companies with different margin structures ↩︎
I would love to attribute this to the right person! This might to be the earliest source. ↩︎
Credit to Paul Graham for tweeting this. ↩︎
Credit to Cynde Moya for tweeting this. ↩︎

σ-driven project management: when is the optimal time to give up?

2022-04-05T00:00:00Z

Hi! It's your friendly project management theorician. You might remember me from blog posts such as Why software projects take longer than you think, which is a blog post I wrote a long time ago positing that software projects completion time follow a log-normal distribution.

Just a bit of a refresher if you don't want to re-read that whole post. What does it mean that project completion time has a log-normal distribution? If a project is estimated to take one month to complete, it will sometimes take half a month, sometimes two months. We can define the “blowup factor” to be the ratio of actual vs estimated. Then the assumption is basically that the logarithm of the blowup factor will follow a normal distribution. and in particular, it's a normal distribution with zero mean and some standard deviation σ (which is the Greek letter “sigma”).

We can plot a normal distribution in terms of its cumulative density function. What does this mean? It means the probability that the project has finished, as a function of time. See the chart below:

You can see that it's 50-50 (median outcome) that we have completed the project at the point that corresponds to 100% of the original estimate (the dashed line). But let's contrast two different values of σ:

time	σ = 1.0	σ = 1.4
0%	0%	0%
50%	24%	31%
100%	50%	50%
200%	76%	69%
400%	92%	84%
∞	100%	100%

So just as an example of how to read this table: if σ = 1.4 then in 84% of the outcomes, you are finished with the project within 400% of the original estimate.

So where does the σ come from? My thesis for this blog post is that σ is an inherent property of the type of risk you have in your project portfolio, and that different values for σ warrants very different types of project management. Low σ means low uncertainty and means we should almost always finish projects. High σ means high uncertainty — more like a research lab — and means large risks of a huge blowup, which also means we should abandon lots of projects.

All models are wrong, etc…

The general gist of the model is something like this:

The actual time it takes to finish a project has a log-normal distribution
Every project has the same value if it succeeds ¹
Once we start to work on a project, we get no feedback until suddenly it finishes
At any point in time, we may choose to (a) keep working on this project (b) abandon it

This is obviously a very crude model! It's a bit like, you're down in the mine blasting rock looking for one super big diamond. Finding the diamond is a very “binary” event in the sense that either we found it or not — there's no partial credit, and nothing “learned” up until that point. However, if we've been down in one mine looking for a diamond for, I don't know, ten years, then maybe we should reassess. Maybe it's time to go to a different mine?

So let's focus on the decision of finishing or abandoning a project, which roughly comes down to: once something is late, is it still worth working on it? Are you getting closer or further away from success? ² ³

How much business value are you creating, my friend?

The business value per time is basically ⁴ the success ratio per time spent (which is, roughly, the probability distribution function).

I'm going to rescale it so that we always start at $$ y=1 $$ for reasons I'll get back to shortly. Here's what it looks like: ⁵

What's going here? Working on a project has increasing marginal business value in the beginning, which intuitively makes sense because we're getting closer to finishing it. But if we haven't finished it at some point, it's somewhat likely we ran into a “monster” project that's going to take an massive amount of time to finish, much more than we initially thought. So the business value starts to decline at some point (for high-σ projects rather quickly). Which means, are we actually working on something valuable still?

Drop it like it's lower marginal ROI

Presumably, we picked this project from a crop of potential projects, where the top one beat the 2nd one by a small margin. So at some point, once the business value per time drops below where it started, we end up in a place where abandoning the top ROI project and switching to the second honest one makes sense. That's why it's interesting to compare the current marginal ROI with the initial marginal ROI.

Graphically, this happens in the previous chart at the dashed line $$ y = 1 $$. Let's record when the curves intersect $$ y = 1 $$ and put those points back into the first chart in this post — the cumulative distribution function:

These points are wild! Like, we start with curves that graphically aren't too different, but when we solve for the optimal time to abandon a project, we end up with wildly different results. I find that pretty amazing and worth putting in a table:

σ	When to give up	Project finish ratio
0.6	1492%	100%
0.8	674%	99%
1.0	320%	88%
1.2	153%	64%
1.4	72%	41%

So just as an example, if σ = 1.0, then we should give up at 320% of the original estimate. If we follow this policy, then we finish about 88% of all project.

But if σ = 1.4, then we should give up at 72% of the original estimate, and if we do that, we finish only 41% of all projects.

These are pretty huge differences!

Project management depends on σ

What we've established so far is that high-uncertainty project management imply a high percentage of abandoned project.

This seems to pass a rough sanity check with reality! Any sufficiently research-like project will have a large risk of blowing up. For that reason, we should also be willing to give up on a high % of these projects. The optimal way to manage planning, resource allocation, and other things are wildly different:

Low-σ management

Low uncertainty
Near-100% of all projects finish
Very accurate estimates
Perfectly forecasted timelines for milestones
Every day, monitor project completion and make sure it's tracking

High-σ management

High uncertainty
Most projects are abandoned
Estimates are pointless
Resources are allocated towards ideas with potential
There's going to be lots of sunk costs
Every day is day one

Is software different?

I've kept it super general so far and you can apply it to almost anything — digging for dinosaur fossils, or painting a house (question for the reader: which one of these is low-σ and which one is high-σ?)

But let's talk about software for a second. Why is it so hard to predict? Here's my theory: because anything predictable suggests something is redundant and should be factored out.

If it takes an engineer one day to build one API integration, it's not going to take them 100 days to build 100 API integrations, because on day 3 or whatever, the engineer will build an API integration maker framework that lets them build API integrations faster. ⁶ This will lower total effort a lot, but the total uncertainty a bit less. The σ — which is the relative predictability of the task in logarithm-terms, will go up.

In general, this is how us software engineers have done it for 50 years now. Everyone's job is simultaneously building features and building tools that makes it easier to build features. We end up with layers and layers of abstractions, and each layer reduces the work we have to spend in the layer below. This is obviously great for productivity! It does however mean that software projects will be hard to estimate, and a lot of software projects will be abandoned.

We don't really lose any generality making this assumption since it essentially cancels out. Same with the expected time spent, which is why I just talk about it in terms of % of the initial estimate. ↩︎
This is a bit of a Lindy effect: […] is a theorized phenomenon by which the future life expectancy of some non-perishable things, like a technology or an idea, is proportional to their current age. ↩︎
It also reminds me to some extent of a blog post I wrote a long time ago about how long it takes to wait for the NYC subway. ↩︎
It's roughly the increase per time in the probability of finishing, which is the derivative of the cumulative density function (CDF) with respect to time, which is the probability density function (PDF).

It turns out this isn't exactly right. Early on, the CDF is convex, which means that you can project a better slope by aiming for a point further out. So the “ROI” ends up being:
$$ \max_{t' > t} \frac{\mathrm{cdf}(t') - \mathrm{cdf}(t)}{ t' - t} $$
Once derivative is decreasing, then this is maximized as $$ t \leftarrow t’ $$, which turns the expression above into the derivative, and the derivative of the CDF is just the PDF. That's reassuring! ↩︎
All the code is available on Github ↩︎
And on day 21 they will open source it and a bunch of other people will join and collaborate, and on day 42 they will create a startup and raise money and build API-integrations as a service, or something. Point is, you factor out common functionality not just across projects, but also across teams and companies. This all reduces predictability! ↩︎

Storm in the stratosphere: how the cloud will be reshuffled

2021-11-30T00:00:00Z

Here's a theory I have about cloud vendors (AWS, Azure, GCP):

Cloud vendors¹ will increasingly focus on the lowest layers in the stack: basically leasing capacity in their data centers through an API.
Other pure-software providers will build all the stuff on top of it. Databases, running code, you name it.

We currently have cloud vendors that offer end-to-end solutions from the developer experience down to the hardware:

What if cloud vendors focus on the lowest layer, and other (pure software) vendors on the layer above?

Feel free to bring this up in five years to make me embarrassed about how wrong I turned out to be. But let me walk you through my thinking—I think some of it is quite well illustrated through the story of Redshift.

Redshift and what happened

Redshift is a data warehouse (aka OLAP database) offered by AWS. Before Redshift, it was the dark ages. The main player was Teradata, which had an on-prem offering. Startups said no to SQL and used Hadoop—SQL was kind of lame back then, for reasons that in hindsight appear absurd. I'm very happy we're out of this era.

Anyway, one vendor was a company called ParAccel. AWS licensed their technology, rebranded it Redshift, and launched in 2012.

Redshift at the time was the first data warehouse running in the cloud. It was a brilliant move by AWS, because it immediately lowered the bar for a small company to start doing analytics. You didn't have to set up any infrastructure yourself, or write custom mapreduce and reload the jobtracker all day. You could spin up a Redshift cluster in AWS, feed it humongous amounts of data, and it would … sort of just work.

Enter Snowflake

Snowflake² is a $100B+ publicly traded company. Basically their entire offering is a data warehouse that looks fairly similar to Redshift.³

If you looked at Redshift in 2012, there was a lot of things that favored it. AWS had large economies of scale, had control of the underlying substrate (EC2), and could make larger investments in building the software. Maybe because of the value of lock-in, they could even subsidize the development of Redshift and make up the money through other products. Anyway, this is what it looks like nine years later⁴:

What happened? But my more general question is: what are the forces that favor a company like Snowflake?⁵ And what does that mean for other cloud products?

What if….?

There's some sort of folk wisdom that the lowest layer of cloud services is a pure commodity service. So in order to make money you need to do at least one of:

Make money higher up in the stack.
Use services higher up in the stack to lock customers in. Then make money lower in the stack.

I think there's some truth to these, at least looking historically, but there are some interesting trends blowing in the other way:

The “software layer on top” is getting incredibly competitive. There's so many startups going after it, fueled by cheap VC money and willing to burn billions of dollars on building software.
Cloud vendors might be pretty happy making money just in the lowest layer. Margins aren't so bad and vendor lock-in is still pretty high.

Startups are coming for the cloud

There's never been this many companies going after services that traditionally belonged to the cloud vendors:

What's going on? Probably a confluence of a lot of things. If I was tired, I would just shrug and say something like “startup circle of life, whatever”. And I think this is roughly one factor, but I see at least 3 different ones:

Incentives at big companies often makes it hard to ship new crazy ideas. At the same time, VCs are pouring in money into the segment. If you're an ambitious person, do you go work at AWS? Or do you join an early stage startup, or create your own? It's expected that innovation shifts away from big companies to startups.
Software vendors can build for all the cloud vendors at the same time. I think this was a real benefit for Snowflake, since a lot of their early customers were banks who care about multi-cloud, but more generally it also expands the market size vs the reach of any cloud vendor.
A lot of the successful cloud products started out as internal services. This has been an amazing source of products, that have been battle-tested at Amazon, Google, and Microsoft scale, and it makes sense that those tools are a great match for their big enterprise customers. But the flipside of the extreme focus on scale, reliability, and configurability, is that the developer experience has become an attack vector, in particular when you look at mid-market and smaller customers who may care more about improving developer productivity. Slightly larger companies like Uber, Netflix, and Airbnb have a history of teams leaving to commercialize internal tools (often through the intermediate step of open sourcing it). Somewhat subjectively and anecdotally, these tools tend to have a much higher focus on developer experience.

Maybe owning the lowest layer isn't so bad?

Let's say a customer is spending $1M/year on Redshift. That nets AWS about⁶ $500-700k in gross profits, after paying for EC2 operational cost and depreciation. If that customer switches their $1M/year budget to Snowflake, then about $400k⁷ goes back to AWS, making AWS about $200k in gross profits.

That seems kind of bad for AWS? I don't know, we ignored a bunch of stuff here. Snowflake's projected⁸ 2022 research and development costs are 20% of revenue, and their sales and marketing costs are 48%! For a million bucks revenue, that's $700k. Translated back to AWS, maybe AWS would have spent $300-400k for the same thing? Seems reasonable.

Now the math suddenly adds up to me. AWS basically ends up with the same bottom line impact, but effectively “outsources” to Snowflake all the cost of building software and selling it. That seems like a good deal for them!

What about lock-in?

The other argument for why AWS should build their own software services is that it increases lock-in. So maybe Redshift in itself isn't a cash cow, but it decreases the churn on EC2.

I'm not so sure? I spent six years as a CTO and moving from one cloud to another isn't something I even remotely considered. My company, like most, spent far more money on engineer salaries than the cloud itself. Putting precious eng time on a cloud migration isn't worth it unless cloud spend starts to become a significant fraction of your gross margins. Which is true for some companies! But those are in a minority.

A significant factor in all of this is that existing infra provides significant “gravity”. It's not like you can just pick whichever cloud vendor has the cheapest database and run your DB there. You want to run things in the same cloud provider⁹ and in the same data center¹⁰. Looking at sign-up flows for cloud products gives us a hint:

The screenshots above show the onboarding for Snowflake, Confluent, and MongoDB (Atlas). They all ask:

What's your cloud vendor?
What region?

Note that the only options for the first questions are AWS, GCP, and Azure.

The other side of the equation of a potential cloud migration is—how much money you can save? And I think the truth it's never going to be substantial, since no one wants to start a price war. Being in a fairly stable oligopoly seems pretty nice and cozy, and if I was cloud vendor, I wouldn't try rock the boat.

The cloud in 2030

We're roughly 10 years into the shift to the cloud, and even though it feels like it's transformed how we build software, I think we're still just getting started. I really wouldn't be surprised to wake up in a world where most developers don't interact with cloud vendors.

Big transformations tend to happen in two stages. The first step happens when some new technology arrives and people adopt to it in the simplest way that lets them retain their conceptual models from the existing world. The real transformations happens later, when you rethink the consumption model because the new world opens up new ways to create value. The way we consumed music didn't change materially when Apple started selling songs online. The real transformation happened when providers like Spotify realized the whole notion of ownership didn't matter anymore.

If you think about it from that angle, the last 10-15 years look a bit like a dumb “lift and shift”. Crudely speaking, we took computers and we put them in the cloud. Is that really the right abstraction for where we are? I don't think so. I think new higher level tools will let us focus on building application code and not worry about the underlying infrastructure.

Startups are coming for your code

The forces I've talked about have been most clear when you look at Snowflake vs Redshift, but you can see it in other places too. Transactional databases is another very exciting area. But where I think we'll see the most change is how software vendors will increasingly run customer code.

This isn't exactly a new idea—Heroku launched in 2007, and AWS Lambda in 2014. Kubernetes has been this interesting trend in the last few years on what I think is still essentially an inevitable march towards a fully “serverless” world, whatever that means to you.

I wonder if a weird corollary of this is… maybe it's actually really good for the planet? The computers sitting in the cloud are ridiculously underutilized—my guesstimate of average CPU utilization is that it's maybe 10%? If I had a Ph. D. in bin packing, I'd go looking for a job at some serverless infrastructure provider right now.

One way to tell the story in 2030 looking backwards is that the cloud vendors needed software running on top of it, so they had to provide that themselves first, in order to drive cloud adoption. Luckily they already had a bunch of internal stuff they could ship! But eventually the market matured, and they could focus on the place in the stack where they had the strongest advantage.

Predictions

The cloud market will grow to $1T/year in revenue. Ok, that's almost entirely noncontroversial.
Most engineers will never interact directly with cloud vendors, but through services on top of those.
The database market (OLAP, OLTP, you name it) will be dominated by vendors running on top of cloud vendors, where the underlying layer is completely abstract.
We will have some amazing new runtimes, finally figuring out the developer experience problems that are currently holding serverless solutions back.
We will see a lot of partnerships between startups and cloud vendors, where a cloud vendor may concede an area and try to be a preferred partner with a startup instead.
Kubernetes will be some weird thing people loved for five years, just like Hadoop was 2009-2013, but the world will move on.
Resource utilization in the cloud will be much better, and engineers will spend an order of magnitude less time thinking about resource allocation and provisioning.
IBM has finally given up on “hybrid multi-cloud”.
YAML will be something old jaded developers bring up after a few drinks. You know it's time to wrap up at the party at that point.

This generated a bunch of comments on Hacker News, most of which strongly disagree! Looking forward to see what the world looks like in 10 years.

Thanks to Josh Wills, Akshat Bubna, and Sarah Catanzaro for feedback on an earlier version of this post!

I'm ignoring the CDN world a bit here. It's very clear right now that Cloudflare is doing an amazing job owning the stack all the way from developer experience to networking equipment. But frankly I don't see any long-term foundational difference operating 300 small data centers vs 30 large ones. Cloudflare has done exceptionally well staying ahead of the innovation game, but I suspect the same economic forces that apply to AWS et al eventually apply to them too. ↩︎
A fun thing is I randomly had lunch with the Snowflake founders in 2012 and they offered me a job the next day. The company was like… 10 people in total? ↩︎
There was one major architectural difference of Snowflake vs Redshift. A very underappreciated tech shift is how much faster networks got around that time. Up until that point, the wisdom was “move compute to data”, but Snowflake bet early on a full decoupling. AWS launched Athena in 2016 which was based on Presto, not Redshift, and launched Redshift Spectrum in 2017 which lets you query data in S3 through Redshift. As a bizarre coincidence, Redshift just launched a serverless product today, which is something they probably should have done a long time ago. ↩︎
This is from DB rankings. ↩︎
A pretty long history about Redshift vs Snowflake, which also points out that AWS unintentionally ended up incentivizing their sales teams to recommend Snowflake to customers. ↩︎
The margins on EC2 is reportedly about 50% and about 60% for AWS as a whole. Just comparing instance pricing for Redshift and for EC2 for equivalent instance types also gives a good idea of the Redshift markup above EC2. ↩︎
Snowflake is a public company and you can compute their margins from their latest quarterly report. ↩︎
From Snowflake's investor presentation ↩︎
Security is another topic that drives this, although security to some extent works against startups: the bar to adopt additional AWS products is often lower than completely new vendors. ↩︎
Cloudflare is actively going after the very high AWS egress fees and it's going to be interesting to see how that plays out. ↩︎

What is the right level of specialization? For data teams and anyone else.

2021-07-23T00:00:00Z

This isn't as much of a blog post as an elaboration of a tweet I posted the other day:

I think this specialization of data teams into 99 different roles (data scientist, data engineer, analytics engineer, ML engineer etc) is generally a bad thing driven by the fact that tools are bad and too hard to use
— Erik Bernhardsson (@bernhardsson) July 21, 2021

This seem to have resonated with a lot of people, but for whatever reason, it ended up being a lot more polarizing than I thought! There was a fair amount of misunderstanding of what I meant, so I just wanted to expand this into a slightly longer argument:

Specialization is a good thing

I'm all for specialization! The society has come a long way from subsistence agriculture and that's almost objectively a good thing. The economy organizes people into different trades and different professions and lets people benefit from their comparative advantage.

There were a few replies (and subtweets) misunderstanding my tweet as an argument against specialization. Which is a bit frustrating, because I think it's pretty useless to argue against for or against specialization. Clearly, neither extreme is good? Like, subsistance agriculture isn't great, but you could imagine going in the other extreme and organizing the society around people doing exactly one thing well and nothing else.

A restaurant could have one chef who only chops onions, another one who only souffles things, another one who only makes apple tarts. If this sounds dumb, then it's because it's intentionally hyperbolic! My point is, neither extreme is good, so the question is: what's the right level of specialization?

What are some drawbacks of specialization?

Not an exhaustive list:

Resource allocation. If you have a chef who only chops onions, they are probably idle most of the time. That sounds bad! If they are more versatile, they can jump around and do a larger set of things, depending on what's needed at the moment.
Reduction of transaction cost. If every project involves coordinating 1,000 specialists, and each of those specialists have their own backlog with their own prioritization, then (a) cycle time would shoot up, with a lot of cost in terms of inventory cost and lost learning potential (b) you would need a ton more project management and administration to get anything done.

Specialization is probably driven a lot by bad tools

So I think the question is, what is the right level of specialization? A bunch of people replied to me saying you need different roles because (I'm just picking one example) some people are better at training models, while some people are really good at all figuring out how to wrangle with Kubernetes and all of that stuff to get models deployed. Which is exactly the point I was trying to make. It seems fair that, if tools didn't require so much knowledge to use (I'm looking at you, Kubernetes), then on the margin, the need for specialization would be less.

I'm super interested in this because I spent about 12 hours every day thinking about tools in the data science space. People are spending way too much time working on things that have nothing to do with their business. We have come a long way, but I still see people wasting way too much time debugging YAML, waiting for deployments, or begging the SRE team for help.

It's dangerous when people lose sight of the goal

I often think of people as (and this is an unfair crude generalization etc) roughly on a spectrum between tools-oriented and goal-oriented. Some people have their favorite tools, and that's what they like to use. They make their whole career about honing a craft with those skills. Other people are more entrepreneurial, and don't care about what tools they use: they care about the ultimate goal.

I think tools-oriented people can be valuable in certain contexts, like if you need some super deep expertise on some topic. If you're trying to build a lithium mine in Angola, you might want to find experts in lithium mining and Angolan mineral rights.

But a lot of the time, experts can also be a huge liability, because they are overly biased towards picking tools that they have deep skills in. If you hire the world's foremost expert in functional data structures for your e-commerce startup, you probably shouldn't be surprised if that person wants to use functional data structures? And maybe that's fine, if you are convinced that 100% of your problem can be expressed as operations on functional data structures, but more likely that's not a pragmatic perspective, and you end up picking suboptimal tools for the job.

When this ends up segmenting different life cycles of a product, I think it gets even more sketchy. So much of the total cost of building tech products is post lauch. But also the opportunity to start iterating on it and learning from it! Adding hand-off points because of specialization feels like putting up a Chinese Wall (in the business sense) between two functions that constrains the information flow and obstructs the value.

And what I also see to some extent is a bit of an entitlement attitude in some developers. They aren't interested in doing the last 10% of the work that you need to get 100% of the value. Which frankly, I don't really understand, because here's the opportunity to shine. Let's say you built a model that can save a gazillion trillion dollars for your company. If there was a tool to press a button to put it into production, why wouldn't you want that? You can do it yourself without having to coordinate with other teams, and you can be the hero of the day!

(I'm writing this about deploying ML models, but that's really just one example… there's a lot more going on in the data world: pipelines, reporting, monitoring, …).

I mean, I think we're very far from this, tools-wise, but we should aspire to get there! Let's not argue against this world on arbitrary grounds of the benefit of specialization. Let's instead think about what that world would look like, and what tools we would need to get there, and then let's build those tools. This is basically what I'm obsessed with and I think it's a somewhat ambitious perspective, but I think there's an incredible opportunity!

Building a data team at a mid-stage startup: a short story

2021-07-07T00:00:00Z

I guess I should really call this a parable.

The backdrop is: you have been brought in to grow a tiny data team (~4 people) at a mid-stage startup (~$10M annual revenue), although this story could take place at many different types of companies.

It's a made up story based on n-th hand experiences (for n ≤ 3), and quite opinionated. It's a story about teams and organization, not the tech itself. As a minor note, I deliberately use the term “data scientist” to mean something very broad.

July 1: morning

It's your first day as head of the data team at SuperCorp! The CEO gave a quick but passionated pitch during your interview process how the world is changing and how the company needs to keep up with all the crazy data stuff going on. They whole exec team is super psyched.

The first few hours, you get access to all the major systems. You start browsing around in the Git repo and discover some interesting code. It looks like a neural network for churn prediction. You start to parse through the code but you're interrupted by a calendar notification that you have a 30 minute intro meeting with the Chief Marketing Officer.

The CMO is super energetic. “We're so excited you're here”, she says. “I recently talked to my buddy at HyperCorp who runs marketing and they are working with a vendor to use AI for user segmentation. Pretty cool! I can't wait for you to sink your teeth into it.” After a bit of chit chat, you start to probe into the data practices of the marketing team. “How is the customer acquisition cost looking”, you ask? “Well…", the CMO says, “pretty awesome actually. Our data scientist ran the numbers, and our online ad cost per click keeps going down.”

You're a bit confused because you were told all data scientists would report into the data team, but apparently other functions have their own data scientists? You make a note to follow up.

The CMO continues: “The real problem is that the growth team aren't converting all the traffic we're driving to the site.”

You ask if there's a dashboard to look at to see the conversion funnel but the CMO says it's the growth team's job to convert the leads.

Later that day, you spend time with some of the product managers. There's just been a big redesign of the start page and the lead PM of that effort is really excited because the number of user registrations went up by 14%. You ask them if the difference is statistically significant but you get a blank stare back. “That's not my job to figure out, that's your team”, the PM says. “Last time we asked them, they said they didn't have the data, and it would take months for them to get it.”

For whatever reason, you can tell that the product manager has more to say, so you let her continue

“Besides, amazing stuff isn't built on incremental changes. We decided not to A/B test this change because sometimes you need to make big bets that takes you out of your local maxima. Steve Jobs didn't A/B launching the iPhone! My team crushed it delivering this launch two days before the deadline and that's what matters!”

You try to look busy by scribbling down some notes in your notebook.

You spend the rest of the day talking to your new team. It's a small team of only three people, but you have been given a budget to grow it to 10 by the end of the year. The people in your team are clearly excited by your arrival. They walk you through what they built so far. There's the neural network for churn prediction that you saw earlier. There's a notebook with an implementation of a whole recommendation system for finding related items to buy. There's a lot more stuff, some of it quite cool.

You notice a a lot of the code starts with very complicated preprocessing steps, where data has to be fetched from many different systems. There appears to be several scripts that have to be run manually in the right order to run some of these things.

You ask the team why they haven't been launced in production. The data team looks frustrated. “When we talk to the engineers, they say it's a very large project to get this to production-level. The product managers have put it on the backlog, but they keep pushing it off because other things keep coming up. We need executive support for this.”

July 1: afternoon

Later in the day, you talk to the head of supply chain. He doesn't seem as excited as the CMO. “Frankly, I don't know if I need help from the data team”. he says. “We don't have those types of problems. What I need is business analysts. I have a whole team, and they are spending hours and hours every day working on a very complex model. They don't even have time to answer basic questions I have. I have a whole spreadsheet full of questions I'm dying for answers to.”

You look at the spreadsheet and discover things like: What's the conversion rate of customers who filed a support ticket and got a ticket resolution in <1h versus the conversion rate for the customers who got a ticket resolution in >= 1h? Break down by order value in buckets of $100 intervals.

When you ask about the “model”, it turns out it's a very complex thing in Google Sheets with lots of VLOOKUPs and data that has to be copy-pasted into the right tab in the right format. The data is updated daily and the output of the model determines the team priorities for the day. Not just that, but they rely on the spreadsheet to calculate payments to the vendors.

You go home that day and pour a large glass of whiskey.

🥃

What's happening so far?

This is basically a (somewhat cynical) depiction of things that may happen at a lot of companies early in the data maturity stage:

Lack of data, and fragmented data
- The product is poorly instrumented so data often doesn't exist in the first place
- A fragmentation of data systems, with data spread out over many different ones
- Brittle business processes driven by data but with little or no automation
An unclear expectation of what the data team's job is supposed to be
- Data scientists hired to do R&D and figure out some way to deploy AI or whatever — as a result not having any clear business goal
- Data team complaining about it being hard to productionize ML, yet the product team doesn't really seem to care about the feature
- People in need of “English-to-SQL translators”
A product team not trained to be data driven
- Product managers not thinking about data as a tool for building better features
- A lack of alignment between what product teams want to build versus what data teams have
A culture that fundamentally is at odds with being data driven
- A culture of celebrating shipping, versus celebrating measurable progress and learnings
- To the extent teams actually use metrics, they are inconsistent, poorly measured, and in some cases at conflict with other teams
No data leadership
- A fractured data org with various data people reporting into other functional areas
- Other departments not getting the help they need, so they work around the data team and hire lots of analysts
- Lack of standardizations of toolchain and best practice

Wow! This is depressing! Let's talk about what you can actually do to break out of this.

July 8

The next week you start setting a new direction for the data team. One of the people in the data team turns out to have a bit more experience with infrastructure, so you put him in charge of setting up a centralized data warehouse. Right now, you just need the fastest route to get data into one place. The plan is basically to just dump the production database tables into the data warehouse every hour.

It turns out the framework you use for ad tracking on the frontend makes it easy to export the huge event logs into the data warehouse, so you set that up as well.

You make a mental note that this is tech debt you're going to have to revisit later.

Fig 1: Extremely crude distillation of how the data gets into the data warehouse

You work with the recruiting team to define a profile for a generalist data role, that emphasizes core software skills, but with a generalist attitude and a deep empathy for business needs. For now, you remove all the mentions of artificial intelligence and machine learning from the job posting.

You spend a bit more time with the various data people that do not report to you. The data scientist in the marketing team is a young person and you can tell she's super excited to talk to you. “I've always wanted to become a data scientist, and I can't wait to learn from you” she says.

Later that day, you call your friend who runs a coding bootcamp and ask them if they have any great SQL training classes. They have, so you set something up for later that month.

You start working on a presentation for the product team about A/B testing and how it works. You showcase many examples of tests with unexpected outcomes from your previous experience, and you make parts of the presentation a bit interactive where the audience has to guess whether test or control won.

You track down the CEO's executive assistant and get some time on her calendar later that week. Your goal is to figure out a few metrics she wants reported on weekly in an automatic email.

Later that week, you talk to a few of the business analysts in the supply chain team and you realize they are also reasonable people, but they seemed scarred from previous interactions with the data team.

One of them has experience using SQL in his past job. He has a question for you about conversion rates that you realize should be possible to answer with the few tables that are already replicated to the data warehouse, so you give him access and tell him to give it a shot. You don't really know what to expect but figure it's worth a shot.

You set up weekly 1:1s with a number of key people across the org that need data. The point is to find data gaps and opportunities and dispatch it to the data scientists. Some of your data scientists are a bit disappointed since the research work needs to get deprioritized. “We need to focus on delivering business value as quickly as possible”, you say, but you add that “we might get back to the machine learning stuff soon… let's see”.

September 1: morning

It's been three months, but you feel like you're starting to make progress with some of the stuff. In your weekly 1:1s with various stakeholders, you keep finding huge blind spots and opportunities for data to make a difference. You use these things as a forcing function for a lot of core platform work. In particular, many pipelines need to be built to produce “derived” datasets. There's a high upfront cost of those analyses, but subsequent analyses are much easier once the right datasets have been built.

You have started opening up access to the data warehouse internally to other teams in other departments. Some of the people are starting to pick up SQL and doing a lot of basic analyses themselves. An early win is that one of the junior product managers discovers that the conversion rate on iOS Safari is extremely bad. It turns out there's a frontend bug with local storage that's a one-liner to fix.

When you're thinking about all the progress you've made already, you're suddenly interrupted by an email from the head of supply chain. He's pissed. Apparently nothing is working with their model and it's a big problem for them.

You immediately send a Slack message to the person you know there. He's the business analyst who eagerly started writing SQL when you gave him access. It's super stressed. “The table in the database changed, and suddenly our SQL query we use to populate the spreadsheet generates nonsense output”.

When you look at the SQL query, you almost spit out your coffee. It's a 500 lines long query. The author of the query seems apologetic but at the same time a bit annoyed. “We kept coming to you several times asking for help with these questions”, he says, “and you told us you didn't have resources, so we built it ourselves”.

The data scientist in your team who gets assigned the monster SQL query isn't happy. “That team is stupid for writing those queries… we told them this would happen. These MBA types are useless. Besides, I was hired to work on machine learning, not to debug SQL queries”, he says. You're desperate so you try to dangle a carrot in front of him. “Please try to do what you can, and I promise we'll find some cool machine learning problem for you later this month”, you say.

September 1: afternoon

Later that day, you're in a meeting going through recent launches. The product manager of the checkout team goes through a major overhaul of the credit card flow. But when you ask him if they saw any improvements on relevant metrics, he is confused. “We haven't had time to look into that”, he says. You're disappointed because just a crude analysis would have been very easy to do for one of the data scientists.

At least, later that day you feel a bit better. The data scientist in the marketing team emails you and says she's talked to her manager. The CMO is totally fine if she reports to you, but makes it clear that “I need her 100% of the time dedicated to marketing.” You loop in the HR and asks them to make the updates in the internal systems to do the management change. Even though she's obviously very junior, you have been impressed with her ability to grasp complex business problems.

You wrap up work that day at 9pm and pour a big glass of red wine for yourself. The bottle is already open and you don't want the wine to go to waste, so you drink the rest of it too.

🍷

What's happening?

You're starting to lay the most basic foundation of what is most critically needed: all the important data, in the same place, easily queryable. Opening up SQL access and training other teams to use it means a lot of the “SQL translation” goes away.

The flipside is, some teams will go too far with their newfound freedom. It's tempting to prevent this by putting very strict guardrails on access to data, but this can often have more drawbacks. People are generally mostly rational, and do things that generate positive ROI for the business, but they might not understand what the data team can build for them. That's your job to demonstrate!

Similarly, with the checkout team you see something similar: there was a simple analysis your team could have done, but didn't happen, because that team didn't know who to ask.

These are primarily organizational challenges. Teams don't know how to work with the data team. You're probably a bottleneck, even though you don't realize it. Other teams will build around the data team. Lots of “simple” analyses are not getting done.

What I think makes most sense to push for is a centralization the reporting structure, but keeping the work management decentralized.

Why? Primarily because it creates a much tighter feedback loop between data and decisions. If every question has to go through a central bottleneck, transaction costs will be high. On the other hand, you don't want to decentralize the management. Strong data people want to report into a manager who understands data, not into a business person.

Fig 2: Data team with centralized backlog and centralized management

Instead, push out the resource management to other teams. Given them a handful of data people to work with, and let them work with them. Those data people will be able to iterate much more quicker, and will also develop valuable domain skills. This reduces the need for other teams to work around the data team and hire their own resources.

Fig 3: Data team with decentralized backlog but centralized management

A good thing is to some extent your results drive organizational centralization in itself: the junior data scientists in the marketing team moves into your team because she wants to work for you.

September 2

Your data team at this point has grown to six people. One of them is super busy with the infrastructure related to the data warehouse. For the other five, you assign each one of them to a team:

One of them is assigned to the onboarding product team.
The second is assigned to the supply chain team
The third is assigned to the checkout team.
You already have the data scientist from the marketing team working on marketing
The final person is assigned to support the CEO and helping with investor/board decks

You send out an email to a large group of people outlining this change, and make it very clear who people should work with for their data needs. As you hire people going forward, you are planning to assign them to different teams throughout the company. Mostly product/engineering teams, but in some cases other teams.

January 3

You start the day with a frustrating email. One of your data scientists has decided to leave. “I'm going to AcmeCorp to join their new machine learning team”, he writes. You're not going to try to persuade him to stay. Frankly, he hasn't seemed particularly happy for a while, and you don't have much work he would be excited about.

You have a bunch of new people in the team that are more excited. Most of them are people who know a bit of software engineering, a bit of SQL, but most importantly have a deep desire to find interesting insights in the data. You think of them as “data journalists” because their goal is to find “the scoop” in the data.

One particular member of your team is working directly with the onboarding team. She talks to her product manager pretty much every day and the team loves her for the insights she's discovered. For instance, there was a big friction point where the onboarding flow asked for the customer's address even though it wasn't needed. Removing that step increased conversion rate by 21% in a subsequent A/B test. It wasn't easy to find this initially because the data model in the database was very complex and she had to build a set of ETL jobs to “flatten” the data into tables that are easier to query. But a bunch of Python jobs chained together did the trick.

Later that day, there's a quarterly review of all the major projects that happened. It's a big deal, and the CEO is in the room. She's excited about all the progress that's happening.

When the turn comes to the growth initiatives, the lead PM presents a new splashy landing page redesign that they launched. The PM points out several times that the team of 20 engineers was working overtime to hit the deadline and they did it. She walks everyone through the amazing job the designers did. It's beautiful. The CMO was very involved with the product because they made a big bet on direct mail as a part of the redesign. Everyone looks at the CEO to see what she thinks.

She's quiet for a while but then opens her mouth. “What are the metrics so far? Do we know if the customer acquisition cost went down?” she says, and you smile to yourself, because you've been hoping for this question.

The PM who put together this slide says they actually ran an A/B test and there are numbers in the appendix of the presentation. It shows a jumbled picture. Some of the metrics went up, and some went down. There's no result that shows a significant outcome. There's a table summarizing some early numbers for customer acquisition costs but the numbers look quite bad. The CMO emphasizes that the numbers are “still baking” and that for this type of campaigns, it can take many months for the users to transact.

You send a Slack message on your phone to someone in the data team that they should do a cohort plot of these things instead next time.

What's going on?

The good news is that the product team is starting to experiment with A/B tests. The bad news is that it's ignoring the results and that projects seem mostly driven by milestones and artificial deadlines. The excellent news is the CEO is pushing for teams to use data as the truth.

Once there is an organizational pressure to be more data driven, this is a time to accelerate the way the data team works with other teams. In particular, people at the highest level will start to focus more on metrics, and it's your responsibility that the data team works with them on it. One simple thing that goes a long way is to work with every team and make sure they have their own dashboard with the top set of metrics they care about.

Fig 4: Different services for different layers of the org drives the most progress

April 1

Almost all of the old machine learning work done by the data team has gone nowhere with one exception. One of the data scientists who works with the inventory product team gets really interested in the earlier work on recommendations. She's one of the new team members you hired, who has a bit more of a generalist background. She picks up the notebook with the recommendation system work, and is able to turn it into a small Flask app deployed internally.

The product manager for the inventory team is ecstatic when she sees it. “How do we ship it?” she asks. One of the metrics the team tracks is average order value and she thinks this could drive big improvements.

A quick estimation reveals it's still a big problem to make it scale, but your data scientist has an idea. “What if we only launch it for 1% of all customers?” She says. “We could have it powered by a dumb cron job and pregenerate all the recommendations in a database. I think I can hack something together in a few days.” Everyone's excited about it so she gets started.

You have been spending a bit more time with the supply chain team and discovered a lot more gigantic SQL queries used for various business-critical things. They break a lot but your team is rewriting a lot of it into proper pipelines run by code. The head of supply chain wants more of your team. “Once you started getting involved, my team of business analysts are getting so much more done,” he says. “I'd do anything for you to hire more data scientists to support me!”

OK, what's happening here?

First of all, there's a glimmer of hope for some of the cool machine learning work. It seems like the product team is finally excited about launching the recommendation system as a small test. It was previously stuck because the product engineering team couldn't estimate the work and didn't want to commit to it, but the data team didn't have the practical software skills to bring it to something where the work of productionizing it is more tangible.

What resolved this was the data team taking it one step further and actually building a demo. Not just does it bring it close to production, it also shows the potential in a more clear way.

It's easy for data teams to feel defeatist when these projects stall, like they were hired to do this cool AI stuff, but now the executive support went away. In practice I think it's more common they just didn't take it upon themselves to get the work to a place where it demonstrates value and is reasonably easy to ship.

The other thing is, note what's happening with the supply chain team. The journey is roughly:

That team started out with their own “business analysts” (outside the data team) but need the data team to run queries for them to get data
Those business analysts are starting to run queries themselves with the help of the data team
They start to build up “shadow tech debt” (in this case monster SQL queries) which first causes a bunch of friction with the data team
The data team starts embedding into the team and helping them get to a better place
Because of the embedding, the need for business analysts goes down and data scientists goes up

Note that you took on a lot of “tech debt” earlier when you started dumping the production database tables straight into the data warehouse. Data consumers downstream will have SQL queries that break a lot. Over time, you're going to have to add some sort of layer in between, that takes the raw data from the production database and translates it into various derived datasets that are more stable and easier to query. This will be a LOT of work to do right. It's probably also needed for security reasons: you need to strip out lots of PII in the production data.

July 1

It's planning meeting for Q3. Previously, these have often turn into big debates about what “bets” the company is making for the next few quarters. This time, you start by going through the company's high level key results. Each team then has sub-metrics that constitute a more granular breakdown of the top level metrics.

It's clear that your work with the product management team has paid off. The PMs often justify their investment in various projects by talking about what they learned running tests, or what they discovered in the data.

A major win is that one of your data scientists working with the checkout team discovered a major bug where users would hit the back button at the confirmation page which would cause the cart object to end up in a bad state. When they fixed this, there was a dramatic improvement in conversion rate.

Another insight has been that traffic from different ad campaigns have very different conversion profiles once they land on the site. It turned out some campaigns had very cheap clicks but converted terrible. Some other campaigns were quite expensive but those users convert extremely well.

Since you're now tracking the UTM parameters and tying it to the account creation, you can now measure the conversion rate from ad click to purchase. This wasn't possible before all the data was brought into the same data warehouse and normalized so you can query it easily. Working with the marketing team, the main KPI is now end-to-end customer acquisition cost, rather than cost per click.

Another exciting news is that the 1% recommendation system test has done exceptionally well. Even though it's a very substantial project to scale it up to 100% of users, the CEO greenlights the project.

Of course, all outcomes aren't positive. There's been a lot of tests that didn't work out. One of the first slides describes a test you ran where shipping is baked into the price rather than charged separately.

It's quiet in the room for a while until the CEO starts talking. “What did you learn from this?,” she asks. This leads to an engaging conversation where the outcome is to run a series of follow-up experiments to get to the bottom of what happened.

You go home that night and pop a champagne.

What just happened?

You made it. You have transformed the organization to be truly data-native. The data team works cross functionally with lots of different stakeholders. Data and insights is used for planning. Data is used to drive business value and not a standalone lab with unclear goals. The company is working in an iterative way instead of big “waterfall” style planning, with quick data-driven feedback cycles. Metrics are defined in such a way that people feel a responsibility for generating business value. The data culture is driven both from above (the CEO pushing for it) as well as from below (people in the trenches). It's OK to fail if at least you learned something from it.

Congratulations — you deserve that champagne!

🥂

Comments on Hacker News

Software infrastructure 2.0: a wishlist

2021-04-19T00:00:00Z

Software infrastructure (by which I include everything ending with *aaS, or anything remotely similar to it) is an exciting field, in particular because (despite what the neo-luddites may say) it keeps getting better every year! I love working with something that moves so quickly.

In the last few months, I've thought a lot about where it's going in the next 5-10 years and an wishlist has taken shape in my head. It's very opinionated! As in, you may not agree with these. That's fine — these are fundamentally predictions I'm making, or at least a wishlist. I'm OK if I'm right about some but not all of these. Let's dive right in.

Built for delight

You know how crappy software is crappy in ways that are so blatantly obvious to the user that you wonder why it was released? A touchscreen interface that's super laggy, or an appointment booking app that forces you to go in and out of possible dates and fill in all information before it tells you if it's available. We've all seen janky stuff like that, and they are generally janky in the same way: it feels like no one actually used the product after it was built, and said like, hey, this is kind of annoying, maybe we should make it more intuitive?

In 99% of the cases, I imagine they ended up in this situation because someone spelled out a long checklist of requirements, but there was nothing on the checklist to make sure the experience is delightful. Like, someone started with a wall full of post-it notes going “as a user, I want to …". Which I think logically makes sense — you can define a requirement that users should be able to do x, y, z, but you can't define that the experience shouldn't suck.

Anyway, I feel like this applies to like 90% of software infrastructure products.

I mean, as a user, I can set up a static website in AWS, but it takes 45 steps in the console and 12 of them are highly confusing if you never did it before. And it's also super slow to do it, and any time I make a mistake, I end up in some weird state where maybe I broke something terribly and I might have to start over. It's sad this is the current state of infrastructure.

There's a lot to learn from how the best companies build consumer products. How they use data to identify friction points, and constantly experiment with changes to make things easier. I have a lot of hope here that natural selection will favor the products that are easy to get started and fun to use. The first step is, we just need more alternatives and not just a handful of big semi-monopolies. Can't wait.

Truly serverless

We are, like what, 10 years into the cloud adoption? Most companies (at least the ones I talk to) run their stuff in the cloud. So why is software still acting as if the cloud doesn't exist?

The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me.
I don't ever want to provision anything in advance of load.
I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.
Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.

I could go on, but I won't. I'm dreaming of a world where things are truly serverless. As in, I don't want to think about future resource needs, I just want things to magically handle it. The good news is I think we're actually getting closer to this dream every year!

The beauty of this is that a lot of the configuration stuff goes away magically. The competitive advantage of most startups is to deliver business value through business logic, not capacity planning and capacity management!

Not just that, but multi-tenancy is actually truly a “free lunch” from a resource utilization point of view, so any opportunity to pool resources represents a true win-win bargain. At the scale data centers worldwide, it's big — it depends on who you are, but you could either get excited about the gigatons of CO₂ saved, or the increased corporate net income margin (I guess I like both!)

Fast

I don't mean fast as in, serving requests fast. We have software that does a great job doing this! Honestly, I think it's mind blowing how good it is: you can run functions at the edge and get response times around the world on the order of milliseconds.

The speed that's not there is setting up infrastructure. If I make a change in the AWS console, or if I add a new pod to Kubernetes, or whatever, I want that to happen in seconds. I'm not asking for milliseconds! Just please at least get it to less than a second. If we can serve requests in milliseconds then I have no doubt that we can get it there. We have the tech to boot VMs and containers basically instantaneously.

The speed matters because this is a serious waste of time for engineers. I feel like I've wasted years staring at some infrastructure change to kick in. I'll get back to this topic in a second because I think it's an important one!

Ephemeral resources

Almost all infrastructure I've worked with treats resources as something meant to exist indefinitely. If I create a database in the cloud, it sticks around, and unless I do anything, it will clutter up the console forever and I will pay money for it forever.

I used to think this was fine! My justification was that, well, if you want to run a test suite, just run the database yourself locally, maybe in a container. This is fine for some stuff, but I've come to think that it's probably pretty bad:

It's a lot of work to build your own replica of the infrastructure so you can run it locally.
The development-production delta gets bigger. There's always going to be subtle differences in how cloud infra works vs when you run it locally.
A lot of cloud infra is proprietary and impossible to run locally!

My deep desire is to make it easy to create ephemeral resources. Do you need a database for your test suite? Create it in the cloud in a way so that it gets garbage collected once your test suite is done. Run your tests against the cloud infra!

My dumb take is I feel like the debate in the last 5 years roughly looks like this:

(To be 100% clear, what I'm advocating for in this blog post isn't necessarily pushing out changes in front of users immediately, although I generally support that for other reasons not covered in this blog post: go read everything written by Charity Majors. I'm saying — let me use production-like infrastructure as much as possible throughout the process of building and testing code.)

The point about ephemeral resources gets 100x more powerful in conjunction with the previous point about letting me create resources quickly. The general pattern of how code gets built is that infrastructure has been decoupled from logic, and the logic is tested independently. Slightly simplified, you can think of the development process of a set of nested loops where the cycle time of each loop gets exponentially worse at each level:

At each loop level, the stakes get higher and the feedback cycle gets slower. This has an extremely strong relationship to productivity! The key thing to note is how important it is to shift concerns from outer loops into inner loops. Getting iteration speeds down by an order of magnitude has dramatic impacts on getting things done.

Having fast ephemeral cloud infra resources would let us move a lot of the infrastructure concerns from the outermost loop to the innermost loop. This lets you get feedback in seconds or at least minutes, rather than hours or more.

Code not configuration

There are at least 4 ways I can think of that you can interact with infrastructure:

Web interface
Local configuration, then run some command-line client that talks to the system
APIs and you have to build the client yourself
Client libraries

The first one is great to have but generally only for getting started. Once you have something set up, you typically move away from it as a way to make changes, and maybe only use it for monitoring etc.

Local configuration seems to be the general next step. Which is fine for a while, but half the time you realize that

Actually, I want this framework to be controlled by another framework at a higher level. In this case, you have two (both bad) options: expose configuration for both frameworks, or have the outermost framework generate configuration dynamically for the other framework.
You need to generate resources dynamically, maybe in a for-loop or whatever.

Now suddenly you move from YAML to YAML generated using Jinja or Handlebars or whatever. Slowly, you start adding custom functions to those template languages to make it easier to generate configuration. Eventually, it evolves into its own super-custom DSL with its own documentation.

This is super annoying! 10 times out of 10, I prefer to have everything accessible through a nice little client library. This library might in turn be a simple wrapper around a solid API. Now I can write my own for-loops! I can generate things dynamically! I don't have to learn a custom DSL! The world is a happy place again.

Built for productivity

I wanted to wrap this up under sort of a meta-point which isn't really a point in itself but more of a mindset change and maybe a corollary of all the other points.

Infrastructure feels like it's been built to solve hard scalability and reliability problems. There's some amazing infra and I'm in awe how much hard thinking must have gone into it. But things are rarely built to optimize for developer productivity. I think long term, the tools that “win” are generally the tools that optimize directly for that. Actually it's not just productivity, it's also quality, and these tools push the quality-productivity tradeoff “up and to the right”:

My point is, the new tradeoff curve lets you “cash out” the improvements in different ways: maybe purely as higher quality, maybe purely as higher productivity, maybe a bit of both.

To me, this represents a massive opportunity gap over the next 5-10 years. I can't wait for engineers to unleash another order of magnitude of productivity. There's so much software waiting to get built!

What's Erik up to?

2021-04-01T00:00:00Z

I joined Better in early 2015 because I thought the team was crazy enough to actually change one of the largest industries in the US. For six years, I ran the tech team, hiring 300+ people, probably doing 2,000+ interviews, and according to GitHub I added 646,941 lines of code and removed 339,164. But I also got married, had two kids, bought an apartment and renovated it! From time to time, there was some intense periods of hard work.

Better has gotten to a pretty amazing place. It's one of the largest startups in NYC (by several metrics, like valuation or headcount) and it has a world class engineering team that makes me insanely proud. I think it will keep being super successful in the future as it deconstructs the housing market down to its core principles (people want somewhere to live!) and builds something that solves these things with a much better experience and less waste.

Anyway, after six years of Better, I'm moving on to do something new. I always had an itch to do my own thing, and I was originally planning to do that back in 2015 when I left Spotify. But Vishal (the CEO of Better) convinced me to spend a year or two with him, and learn how the sausage is made. How to raise money, how to work with the board, how to run a company, and all that stuff. I joined and I got stuck for a lot longer than I thought!

At some point, I decided that I need to do my own thing, or I'm never going to do it. Since Jan 1 this year, Diane Yu is now running the tech team. We were super lucky to find her and I feel confident that she's going to take Better to the next level.

With Diane at Better, I took a 6 week semi-vacation in the Caribbean with my family, working 2-3 days a week and taking care of two crazy girls the rest of the time.

Anyway, since a few weeks ago, I'm back in NYC ready for some serious crunch mode now.

So what's next?

So what am I doing next? I'm extremely psyched to share with the world that I will be joining Oracle to work as a Senior Implementation Engineer, on Blockchain solutions for mid-size maritime companies!

The opportunities are unlimited. The Suez blockage shows how vulnerable we are, and I'm 1000% convinced it would have been prevented if they had used a decentralized P2P network powered by huge FPGA-rigs in Inner Mongolia computing quadrillions of hash values every second.

Sorry, just kidding… it's April Fools day, after all. But I am leaving Better, for real.

So what am I focused on next? I'm extremely determined that I want to start my own thing (meaning, don't try to hire me, it's probably a waste of time), and it's highly likely it will be something in the data engineering/science tools/infra space.

Why data?

I've spent most of my career working in data in some shape or form. At Spotify, I was entirely focused on it. At Better, I was a CTO with responsibility for all of technology, but for a long time I was also the manager of the data team, and I always kept dabbling with data problems in my spare time.

Data as a subfield of software engineering has a crazy growth rate. 10 years ago, companies didn't have big data teams. But these days, a lot of startups have 20%+ of their tech team dedicated to data (engineering/science). At the same time, I look at backend/frontend engineers and see how many amazing tools they have gotten in the last 10 years. Not just tools, but also processes. I truly think they are an order of magnitude more productive than they used to be.

I don't see this productivity revolution for data teams! I see teams wasting lots of time with nonsense tasks, like writing YAML or waiting for jobs to finish. But I also see other teams going around data teams and building up their own “analyst” teams. To me this reflects on the extreme demand for people who are actually skilled in working with data. How can we make them more productive? And I think a lot of it boils down to giving them better tools.

I'm interested in about 100 subproblems in this field and I'm hacking on prototypes for several of these things, but just to pick a few:

Workflow scheduling. I built Luigi a long time ago, which I think was kind of the first widely used one. Later, Airflow has become the de-facto standard. But I still don't think it's anywhere close to where it could be in terms of turbocharging data teams.
All the work around workflow scheduling. Lineage, documentation, observability, et cetera. I think there's an opportunity to solve a lot of those problems at the same time.
The chasm between SQL and Python. What to use for what, and how to increase the interoperability.
How to run data jobs. I don't think Kubernetes is the answer. It's hard to work with, not great for ad-hoc jobs, and fundamentally in the age of cloud, why can't we have a much more elastic pool of resources available?
Visualization. I kind of want something that's both matplotlib and WYSIWYG at the same time?

I could keep adding to this list for hours, these are just some things off the top of my mind.

Anyway, feel free to drop me a line if you're interested in anything related to this. For the next few months, I'm going to be knee deep in code hacking on prototypes, so I love any break where I can talk to people who care about these things! I don't want to put my email here, but I'm sure you can figure out how to reach me.

🍻

Giving more tools to software engineers: the reorganization of the factory

2020-12-16T00:00:00Z

It's a popular attitude among developers to rant about our tools and how broken things are. Maybe I'm an optimistic person, because my viewpoint is the complete opposite! I had my first job as a software engineer in 1999, and in the last two decades I've seen software engineering changing in ways that have made us orders of magnitude more productive. Just some examples from things I've worked on or close to:

Spotify built a whole P2P architecture in C++ in order to distribute streaming music to listeners, something which today is a trivial problem (put the data on a CDN).
I used to write custom mapreduce jobs to pull basic stats, then wait for hours for those jobs to finish. Today these would be SQL queries in a datawarehouse that would run in a few seconds.
I once ran a web shop and spent a week implementing credit card payments. Today it's 15 minutes using Stripe.

Not to mention the changes in developer processes:

Unit tests were really rare in the industry — I first encountered it working at Google in 2006.
When git was new, git-flow was the prevalent workflow, and many developers I worked with spent an inordinate fraction of their time (like, 20-30%) just rebasing.
CI systems were fragile, CD basically didn't exist, and deployments were scary manual affairs.

These are just examples — I could go on all day. Much like the classic No Silver Bullet paper on software productivity, none of these things in themselves were a dramatic improvement. But productivity improvements add up, and they add up on a logarithmic scale, meaning it's not unreasonable that we see orders of magnitude of improvements when your time scale is decades.

(Note: I'm going to use the term “tool” throughout this post to refer to all kinds of things: frameworks, libraries, development processes, infrastructure.)

The insatiable demand for software

An extremely crude classification of the world is that

There are some things that has great software
There are some things that has mediocre software
There are some things that has no software

This is pretty obvious so far, so I'll keep going: if there was an ability to produce infinite amounts of software instantaneously at zero cost, the two last buckets would go away. Mediocre software exists because someone wasn't able to hire better engineers, or they didn't have time, or whatever. Things without software… it's kind of everything? I mean, why don't your shoes have a step tracker built in? Why does Mark in Accounting still generate PDF invoices and email them to clients? Why doesn't… I could keep going.

We've obviously been going in this direction for a long time, and my point is that there's so much left that can use software. Why hasn't it happened already? Because someone made the economic decision that the cost of building that software was too high. It costs money and time to hire engineers and train them and getting things done. So how do we think about that cost?

Supply-demand of software engineers

Let's go back to my points about software engineer productivity going up over time. In economic terms, it means that the value of the output per time goes up. On the demand side, it means that the cost of building software goes down. What previously used to take 1,000 hours now takes 100 hours.

If demand was fixed, it would mean mass unemployment and lower salaries for engineers, but demand isn't fixed! As I've implied earlier, lower costs of building software means new opportunities open up. It previously wasn't worth paying for 1,000 hours of engineering effort to build your gizmo, but it might very well be worth 100 hours. There's enormous latent demand. You can even afford to pay engineers a bit more per hour.

These things can all happen at the same time:

More software gets built
Software engineer salaries go up
The number of software engineers grows

This might be counterintuitive, so it's worth contrasting this with something where demand is fixed. Let's say some process is invented that increases the output of diaper factories by 10% (keeping all costs the same) and that invention is immediately available to all diaper manufacturers. What's the result? There will be a temporary surplus of diapers on the market, and the market will converge to a new equilibrium where the supply shrinks to meet demand. In the process, some factories will close, and prices of diapers will go down.

The difference with software engineers is that demand grows when the cost goes down. Your aunt's dental practice didn't use to have a website, but now it's worth having a whole online booking system. The school your kids go to suddently has an app to send updates to parents. I don't know — my point is, with price going down, more demand opens up.

It's not just across the whole economy that this happens, it can also happen at a micro scale. If you have a team of data scientists and you introduce a tool that makes then 2x more productive? Great — you just doubled the ROI of each incremental hire, and you should go and hire more people. I've never seen a company where data scientists run out of work: there's always more things to analyze.

Will this go on forever? I guess in some end state, we'll hit someting that looks like the singularity, where software itself builds software so fast that demand can't grow with supply. That's at the point where I'd be worried about software engineer unemployment. But for the next few decades, I think it's a pretty decent bet that we'll see the number of software engineers growing and growing.

How to become a software engineer

The growth of software as a field and the growth of salaries obviously in itself attracts more people to the field, but I think there's also another thing that's happening.

Decades ago, software engineering was hard because you had to build everything from scratch and solve all these foundational problems. You need storage to build something to serve 1M concurrent users? Wade through papers about consistent hashing, CAP theorem, CRDTs and what not, roll up your sleeves, and prepare yourself for 100,000 lines of hard core C++.

Today, these problems are “mostly” solved, and you can use off-the-shelf tools for most of it. It's not easy though! Like I mentioned earlier, there is no silver bullet, and there's a million tools out there, and you need to know based on best practices how to stitch it all together. But knowing this is a different type of hard.

Software engineering a few decades ago favored the people that were deep abstract thinkers — who could stitch together complicated software from atoms. To me it looks a lot more like a craft today — it's a lot more about learning what tool should be used for what job. I mean, it still very much favors deep abstract thinkers, but relatively speaking, that skill is a smaller differentiating factor than it used to be.

I think this changes the equation on the supply side. The barriers to become a software engineers are different today, and it's opened up a larger pool of talent. This unlocks new supply, which means even more software is created.

The software engineer role in the knowledge factory

Building software used to be, and still is, a very expensive endeavor for companies. At the average company, you still have a backlog that's seemingly endless. The software side of the factory is still often a bottleneck of the company.

What do you do when you have a widget maker that's a bottleneck in a manufacturing plant? You make sure that the bottleneck runs at full utilization at any time. This means you centralize the resource management of the widget maker — you put controls on the inputs, and put a lot of effort into making sure what widget gets made in what order.

Most companies still very much operates roughly like that. But, the companies that (in my opinion) have exceptionally productive engineering teams are organizing themselves in slightly different ways. They tend to decentralize prioritization, and work directly with product and business stakeholders in tight iterations. When a resource isn't the bottleneck any more, you can achieve vastly higher iteration speeds by spreading out resource allocations to many different teams. I don't mean decentralizing in terms of management to be clear. What I mean is decentralizing the backlogs into small teams that work directly on business needs.

In this model, you don't have a marketing team put something on the backlog of the engineering team. You have a team of cross functional people who own acquisition, where some people are traditional marketers and some people are engineers. You can imagine this with almost any function throughout a typical company: customer support, finance, or anything else.

A historical parallel that I find super fascinating is why electricity took so long to change manufacturing. Factories in the age of steam engines were built around power distribution from the almighty steam engines. Energy was the precious resource, so it's natural to think about manufacturing plants as built around energy distribution. Electricity changed this, and decentralized energy generation, but it took a really long time for manufacturing plants to realign and take advantage of this.

(Note 1: This isn't a perfect analogy since steam power wasn't just the precious resource, it was also hard to build small steam engines.)

(Note 2: The main point of the article I linked to is that innovations often take time to unleash productivity, because the first attempts to use the new technology often tries to retrofit it into legacy structures. For instance: internet first created DVD-by-mail, but the real internet-native innovation was streaming video.)

Bottlenecks in the knowledge factory

I've been talking a lot about tools that make engineers more productive, but that's not the entire story here. There are clearly also lots of tools used by non-tech people to get their own things done, and that's great! But I also see a huge amount of tools built so that people don't have to work with engineers. Why? A few reasons:

Iteration speed: The cost of explaining to an engineer what you need makes it not worth doing it
The engineering resources are not available (or too expensive, or whatever)
You just need a fraction of an engineer but that market does not exist
There's some special domain knowledge needed to build something
Engineers are weird and smell funny

Ok, the last point was just a dumb joke. I think the other points cover most of it though.

Out of these, I think the first one (iteration speed) is a 100% valid reason. As an example, I encourage all business people to learn SQL so they can run queries themselves.

What I think is unfortunate is the second one (no engineering resources available). A lot of us have probably encountered manual pipelines of people sending around Excel files with macros, copy pasting data new into the spreadsheet every morning, or something similar. This is sometimes referred to as “shadow tech”. Had there been dedicated engineering resources, the total cost of building and owning those things would probably be much lower.

But the fact that non-engineers are building technology validates that there's demand for engineers. At many companies, engineers just can't keep up with the demand. So, through better tools, over time, more needs can be served. Companies at the forefront of engineer productivity will probably see less of these issues: engineers will be involved early on and work on business problems.

The third point (fractional resources) is probably true at small companies. If you're a dentist, you're not going to hire an engineer to build you a booking system. Luckily, they benefit from the increased output industry-wide: there might be a whole new ecosystem of dentist software to buy (because building it gets much cheaper).

The last point (special knowledge) has some validity. I often see business people building manual workflows that later get taken over and automated by engineers, as a sort of a first line attack squad to get a basic prototype running. The counterpoint is that with increased decentralization, engineers will increasingly develop subject-matter experience. A lot of companies have dedicated data science and data engineering resources to the HR and Finance teams, as an example.

The great productivity inequality

An interesting corollary of all of this is that it creates a positive feedback where some companies will fall behind even further:

Lack of adoption of new tools means falling behind the companies leveraging those tools.
Higher salaries for software engineers means these companies are priced out of the higher end of the hiring market.
A failure to realign the factory means lower iteration speed.
A lack of engineers means a temptation to adopt tools to build technology without using engineers, with associated costs that are much larger.

In contract, the companies at the forefront of this will see their software engineer productivity surge and the iteration speeds improve.

I have spent six years in the mortgage industry and I have seen these trends play out very clearly in front of me. The biggest laggards are desperately adopting RPA and duct taping together off-the-shelf POS and CRM software. The slightly better companies have their own engineering teams, but seem to fail to realign the process factory to be a tech-driven company. My company Better is implicitly a bet on everything I mentioned in this post, plus maybe a bunch of other trends as well.

Wrapping it up

This was a lot of words for something that I think could be put succinctly into a causal graph:

This is clearly not an exhaustive theory of everything that happens with software. There are many other trends, like internet erasing physical moats, and software creating more economies of scale. But this explains a lot of it, I think.

A theory is useless without an ability to generate forecasts or policy recommendations, so here are some things I would bet on:

It's a good time to be a software engineer
The market for tools to software engineer will keep growing
Many “legacy” companies will fall behind the productivity gains
New entrants will threaten these companies and succeed to a large extent
Every company should think about how to realign how they build technology to focus on decentralization and higher iteration speed, embedding engineers throughout the factory
In the long run, it won't be a good idea for companies to adopt tools with the only purpose of building technology without engineers
Software engineers should wholeheartedly adopt tools that make them more productive: it makes them more valuable
Companies with high productivity engineering teams will have faster-growing engineering team (because of higher ROI of hiring more engineers)

This blog post is a confluence of a number of previously unrelated thoughts in my head. Some of it might be super obvious to readers, some of it not. Hopefully there was something in the latter category you found useful!

Developer experience as a competitive advantage

2020-10-06T00:00:00Z

I spent a ton of time looking at different software providers, both as a CTO, and as a ~~nerd~~ “advanced” consumer who builds stuff in my spare time. In the last 10 years, there has been an order of magnitude more products that cater directly to developers, through APIs, SDKs, and tooling. I'm pretty psyched about this trend. As the cost of building software goes down, that drives up the demand for software engineers. That then drives up the demand for even more products built for software engineers. That then drives down the cost of building software even more!

This flywheel seems like an excellent thing for our economy. You don't need to look to far to find a startup built on top of AWS that probably wouldn't have existed 10 years ago because it was too hard/expensive.

Back in the days, a lot less existed, and it was hard to buy software. If you bought it, it was shinkwrapped on-prem software that came with expensive supporting contracts. Maybe you had to go through your company's procurement team and they would come up with a checklist, go to Gartner and look at vendor and all that stuff. Maybe I exaggerate, I don't know — I've suppressed those memories.

These days, some developer in your team often discovers a solution they are struggling with, goes and signs up for a trial, and demonstrates a working prototype of an integration to their team in half an afternoon. This “consumerization” of developer services explains a lot of what's changing. When everything previously went through a procurement team, developers now buy things themselves. This changes the sales process from one typically driven by a commissioned sales force, to one primarily driven by self-service (but often still supported by a sales team for larger contracts).

Crappy tech vendors

Consider a typical API product. Stuff you might have to do to get a prototype integration working:

Schedule a meeting with a sales person
Wait x days to get API tokens
Get a PDF with integration docs
Deploy a service inside your production environment
Deal with firewalls, routing, DNS config, and certificates
Configure your authentication provider to work with it
Browse through confusing (often auto-generated) API docs that doesn't seem to be written for humans
Webhooks?
PGP Keys?
Environment variables?
…

This is my terrible mental image of every crappy tech vendor's website. I've spent six years in the mortgage industry, I've seen it all. Just give me an API token and docs!

It often feels ike the developer experience is pretty much an afterthought, and I keep making fun of it. I imagine it's not just me but probably millions of developers spending an inordinate amount wading through confusing API docs. And this stuff is like, seriously hard! Mess it up and you might get slapped with a $80M fine for misconfiguring IAM policies. The cost of poor developer experience is real.

Growth hacking the developer

If you have spent any time in the last 10 years at any moderately successful consumer-facing startup, you have probably been exposed to their growth team. Growth teams iterate the living daylight out of the landing pages, onboarding funnel, account registration flow, remarketing campaigns, cart abandonment emails, and 1,000 other things that drives conversion rates.

Say what you want about it, but the end result is that it's about 100x easier to open a bank account online at an online bank than it used to be, or buy bread, or send flowers to your parents, or whatever you want to do online. The other day, I took my toddlers to the playground and was running around chasing them while ordering gardening equipment on my phone.

Acquisition for a developer-focused business is just like any other online service these days. All the tricks in the playbook apply: SEO, retargeting, content marketing, and much more. It doesn't matter if the end goal of the conversion funnel is buying flowers or hitting the sandbox API with a request. The principles are the same.

The Stack Overflow momentum effect

Writing code is like 30%+ googling error messages and finding posts by other people who already solved the same issue. This suggests that there is a strong virtuous cycle getting momentum online. More users of a software generates more Stack Overflow posts (and Github stars, blog posts, and what not), which then increases the developer experience, which then wins more developers. These dynamics are set up to create winner-takes-all markets.

As it says in the bible, the rich get richer and the poor get poorer. If I was an investor in SaaS products (I am not!), I would ask for the number of Stack Overflow posts as the leading indicator of success. I'm like 75% serious.

You see this a lot in a different form. Companies have developer evangelists, community managers, or whatever they call it. They sponsor hackathons and run meetup groups, and you keep bumping into their colorful t-shirts at conferences.

Self-service and freemium

To increase the conversion rate, free tiers are often available. For instance, AWS has a free EC2 tier that you can sign up for, that lets you run basic things in the cloud. Running a free tier as a loss-leader is a bet that enough of those users will convert in the future.

But done right, it's actually more than that. Free users can trigger some sort of viral growth. They might generate questions and answers on Stack Overflow. They might write stuff on their blogs. They might tell their friends and coworkers — developers are actually far more social than we like to admit!

Open source as a freemium model

Open source often ends up being the best example of a freemium tier (or I guess, alternatively, an acquisition channel). Feel free to run MongoDB or Kafka on your laptop, and build stuff on top of it. Once you get hooked on it, these companies offer super powerful cloud-hosted solution that they make a ton of money on. Confluent (Kafka creator) is a $4.5B business — incredible! 👏

Although I think there's a great window of opportunity to turn open source software into business opportunities right now, I do have some reservations about it in the long run. This might have to be another blog post one day.

Developer experience as a competitive advantage

Wrapping it up, what do I think is going to happen? I don't know, but here's where I would put my money:

There's going to be a group of companies who understand these trend and go all in on it. I already mentioned Stripe, Twilio, and Netlify as three great examples.
There's going to be a bunch of dinosaurs that die out. Any direct Stripe competitor that requires you to talk to sales person, or has a super confusing SOAP API, or whatever.
Some companies will struggle to reorient themselves, and it could go either way. AWS feels like the million ton gorilla here. Some of their services are fantastic, but many are of highly questionable quality. Will they be able to bring that spark of joy to developers or will they get unbundled and outcompeted? Time will tell.
The market for products for software developers will grow by another 10x, easily
The quality of those products will be vastly better. It will be easy to get started, and fun to use them.
What's going to be for sale out there? I think the opportunities are endless, but here are some general categories of things:
- Pretty much everything that isn't core business logic should rationally be pulled out of your codebase and sold back to you at a fraction of the cost.
- Probably anything you can do in the real world. Ship things. Pay people. Sue someone. Get an apartment cleaned.
- Get data on anything. The atmospheric rate of methane in Bangladesh. The current water flow of Missisippi. All the parking tickets for a person.
- Anything you have to set up and run on your own infrastructure (example: Kubernetes)
- Much better closed-source versions of the tools in the previous category (example: Datadog).
- Any special internal infrastructure/tools that so far only Google and the alikes have had.
- The great AWS unbundling: any product that AWS has which is just OK but not great.
- Productivity tools for developers. Tools that let you remove time wasted not solving problems.

If there's any persistent lesson we've learned about the future, it's that it's a lot harder to predict than we think. But I think it's a pretty safe bet that software engineers will have some amazing tools in 2030 that makes today's tool feel antiquated. Looking forward to it!

Mortality statistics and Sweden's "dry tinder" effect

2020-09-23T00:00:00Z

We live in a year of about 350,000 amateur epidemiologists and I have no desire to join that “club”. But I read something about COVID-19 deaths that I thought was interesting and wanted to see if I could replicated it through data. Basically the claim is that Sweden had an exceptionally “good” year in 2019 in terms of influenza deaths causing there to be more deaths “overdue” in 2020.

This post is not an attempt to draw any scientific conclusions! I just wanted to see if I could get my hands on any data and visualize it. I'm going to share some plots and leave it to the reader to draw their own conclusions, or run their own experiments, or whatever they want to do!

As it turns out, the Human Mortality Database has some really awesome statistics about “short-term mortality fluctuations” so let's see what we can do with it!

*Rolls up sleeves.*

Let's first look at the most basic time series plot. We'll start with the Nordics:

There's a lot of seasonality! And a lot of noise! Let's make it a bit easier to follow trends by looking at rolling 1 year averages:

Phew, that's a bit easier on my poor eyes. As you can see, it's not an unreasonable claim that Sweden had a “good year” in 2019 — overall death rates dropped from 24 to 23 deaths/day per 1M. That's a pretty huge drop! Until looking at this chart, I had never anticipated death rates to be so volatile from year to year. I also would have never anticipated that death rates are so seasonal:

Unfortunately the dataset doesn't break out causes of death, so we don't know what's driving this. Amazingly, from a cursory online search, there seems to be no research consensus why it's so seasonal. It's easy to picture something about people dying in cold climates, but interestingly the seasonality isn't much different between say Sweden and Greece:

What's also interesting is that the beginning of the year contains most of the variation in what counts as a “bad” or a “good” year. You can see that by looking at year-to-year correlations in death rates broken down by quarter. The correlation is much lower for quarter 1 than for other quarters:

(I only used data up until 2018-2019 from this scatterplot since COVID-19 causes a weird cluster of points)

I'm still super confused. My only two guesses for why there's so much seasonality and year-to-year variation would be:

Some winters are really mild, some are really bad
Influenza season hits different in different years

But not a ton of people die of influenza, so it doesn't seem likely. What about cold weather? I guess plausibly it could lead to all kinds of things (people stay inside, so they don't exercise? Etc). But I don't know why it would affect Greece as much as Sweden. No idea what's going on.

Mean reversion, two-year periodicity, or dry tinder?

I was staring at the rolling 1 year death statistics for a really long time and convinced myself that there's some sort of negative correlation year-to-year: a good year is followed by a bad year, is followed by a good year, etc. This hypothesis sort of makes sense: if influenzas or bad weather (or anything else) provides the “final straw” then maybe a “good year” just postpones all those deaths to the next year. So if there truly was this “dry tinder” effect, then we would expect a negative correlation between the change in death rates of two subsequent years.

Let's look again at the Nordics:

Let's look at Germany/Switzerland/Austria, for which the mortality stats barely budged:

UK, Belgium, and Netherlands, which have much bigger increases in mortality:

I mean, looking at the chart above, it clearly feels like there's some sort of 2 year periodicity with negative correlations year-to-year. Italy, Spain, and France:

So is there evidence for this? I don't know. As it turns out, there is a negative correlation if you look at changes in death rates: a positive change in a death rate from year T to T+1 is negatively correlated with the change in death rate between T+1 and T+2. But if you think about it for a bit, this actually doesn't prove anything! A completely random series would have a similar behavior — it's just mean-reversion! If there's a year with a very high death rate, then by mean reversion, the next year should have a lower death rate, and vice versa, but this doesn't mean a negative correlation.

If I look at the change in death rate between year T and T+2 vs the change between year T and T+1, there's actually a positive correlation, which doesn't quite support the dry tinder hypothesis.

I also fit a regression model: $$ x(t) = \alpha x(t-1) + \beta x(t-2) $$. The best fit turns out to be roughly $$ \alpha = \beta = 1/2 $$ which is entirely consistent with looking at random noise around a slow-moving trend: our best guess based on two earlier data points is then simply $$ x(t) = ( x(t-1) + x(t-2) )/2 $$.

If we had found that $$ \alpha < 0 $$ then that would have implied a “good” year would be negatively correlated with a subsequent bad year next year, and vice versa. This would be my most “strict” interpretation of the “dry tinder” hypothesis, and it's not what we're finding.

However, the solution we find has a bit of a two-year periodicity. You can turn the recurrence relation $$ x(t) = ( x(t-1) + x(t-2) )/2 $$ into the polynomial equation $$ x^2 = \frac{1}{2} x + \frac{1}{2} $$. If I'm not mistaken, this is called the “characteristic polynomial” and its roots tell us something about the dynamics of the system. The roots are -1/2 and 1, and the negative root implies a two-year damping oscillating behavior. So it least that shows something along the lines of what we're looking for. I think this implies that at two-year average might be a better way to smooth it, and at least qualitatively it looks that way:

A fun thing is that we can actually use this method to forecast the curves forward (I added “last week” as a third term in the regression):

My confidence in these predictions is roughly zero.

Appendix

This is not a proof of anything! This is obviously extremely far from the scientific standards required for publication. So why am I posting this? Mostly because

I thought the Human Mortality Database was a really cool public dataset.
These mortality were sort of surprising, at least to me.
I haven't posted much on my blog and felt compelled to write something!

On the last topic, I'll try to get back in a regular habit. Sorry!

How to set compensation using commonsense principles

2020-06-08T00:00:00Z

Compensation has always been one of the most confusing parts of management to me. Getting it right is obviously extremely important. Compensation is what drives our entire economy, and you could look at the market for labor as one gigantic resource-allocating machine in the same way as people look at the stock market as a gigantic resource-allocating machine for investments.

Yet, almost everything I learned about compensation comes from practical experience, so what follows is a blog post I wish I could have read ten years ago! I wrote it from the perspective of a manager at a tech startup, but hope there's some parts that generalize.

Good/bad compensation systems

I'll start by stating what I think are goals and anti-goals:

A good system doesn't waste money hiring new people when you can pay to keep existing people.
A bad system keeps people below the salary that you would give them to keep them.
A good system pre-empts people interviewing elsewhere.
A bad system gives raises mostly when people interview elsewhere and get offers.
A good system phases out underperformers or makes it clear they will never get a raise until they step it up.
A bad system give raises to people who are already paid more than you would pay to hire them today.
A good system pays people on an absolute scale, where salary history is not a factor.
A bad system thinks of salary increases in relative terms (eg top performers get +5%).
A good system doesn't pay more just because someone interviews with many companies.
A bad system determines offers to new hires mostly based on their other offers.
A good system pays only for the value someone is creating, not for irrelevant things.
A bad system pays based on years of experience or pedigree.
A good system gives the biggest raises to humble people who never asked for it, but deserve it.
A bad system gives raises to people when they ask for it, and gives more when people complain more.
A good system understands how to use equity to motivate people.
A bad system penalizes people for trading off salary for equity.
A good system incentivizes people to do whatever adds the most value to the company.
A bad system incentivizes people to go into management in order to get a raise.
A good system has a clear link between performance and salary that people understand.
A bad system makes people frustrated because they don't understand why some people make more than other.

Phew, that was a lot! Most of is should be sort of self-evident, or at least it feels so to me. But I've seen a lot of companies mess it up! What are some principles we can use that prevent us from all of these traps? Hold my beer as I go through:

Principle 1: align with market

Ignoring for a second that productivity is basically impossible to quantify, let's say we made a scatter plot of employees at the company CrazyBananaCo, where the x-axis is how much value they create and the y-axis is their cost.

Just to make it a bit simpler, let's say the chart only includes people who spent at least 1 year at the company, so that we can roughly ignore the “ramp time”. There's about 99 other simplifications here, some of which I addressed at the end of this post, but they key point here is this a toy model that still lets us say a lot about compensation management!

Two quantities are important benchmarks for someone's salary:

Market rate: if this person started here today, what would we have to pay to get them?
Replacement cost: how much are we willing to pay in order to get this person to stay?

The market range is not a perfect “price” so I picture it more like a range. The replacement cost will be a bit higher than the market rate because of the extra cost of hiring/training the person. It's rational that we would rather pay a bit more to get someone to stay versus hiring an identical person coming in from scratch.

With productivity, I also mean value, so the diagonal line represents the curve where the net benefit of an employee is exactly $0. In general, market rate ≤ replacement cost ≤ value.

Once we calibrate this against market/replacement, hopefully we'll end up with something like this:

If your compensations are aligned with market, then no one should be below market range (give them a raise before they quit), and no one should be above replacement cost (then they should be replaced). Ideally pretty few should be above market range but below replacement cost (the purple zone), although I consider that zone a bit of a special “insurance card” you can buy, like if someone on your team has some deep domain knowledge that's absolutely crucial right now so you don't have time to hire a replacement and train them.

But how do you know the market rate and the replacement cost? You really don't ever know it, but the best way to get a good pulse on it is through recruiting. A good offer acceptance rate is in the 70-80% range—too low, and you might be underpaying, too high, and you might be overpaying. If you make a lot of offers, you should develop a pretty good understanding of the market from making offers and learning about what other companies are offering. There are also datasets such as Radford, which you have to pay for, and levels.fyi, which is a free database of self-reported salaries.

Principle 2: create consistency

Let's consider the company SuperFairCo, which has a super nice consistent pay distribution.

Let's say SuperFairCo is considering hiring Meg—an engineer highly desired in the market. Other companies are offering a lot as well. But we still have an opportunity to hire her at a cost lower than their value. Should we do it?

Meg would generate surplus value, so from that perspective it might seem like a good trade. But there's a subtle catch. Consider these situations:

Joe in the HR team is accidentally shares a spreadsheet with everyone's salaries. Mike notices someone with less experience than him is making more money.
Anna goes to her manager and asks for more money. She's currently making $150,000 and the company values her contributions as $200,000. She's asking for $190,000 or else she will quit.

The last one is especially tricky. If you comply, you're still benefitting from having Anna on the team, but you just created a culture where people who complain to their manager get more money and people who don't complain get no raise. That seems like bad incentives!

The only solution to all of this, is to make sure salaries are consistent. A useful Litmus test is: if all salaries were made fully transparent, would people be upset? If so, you might have an inconsistency problem.

Our hypothetical offer to Meg comes with two big hidden costs:

The heartburn of creating a very inconsistent pay scale (with all the bad feelings if people find out)
The hypothetical dollar costs of raising salaries for many other people if you ever want to get the pay scale back to consistency:

On the other hand, a consistency-based model implies the salary range for Meg's offer is much tighter, and that the SuperFairCo is willing to lose her even at a salary that generates surplus value to the company. And this is why the company is probably better off declining Anna's request for a raise.

I kind of glossed over what “consistency” exactly means here, but just to clarify: if person A is more productive than person B, then A should have a higher salary than B. I think that's reasonable!

Corollaries of a consistency-based framework

I think the consistency aspect largely explains why companies have a tighter “cap” on salary than you would otherwise expect. They might not be willing to pay above market salary for someone, even though their salary is below replacement cost, and even though they generate a lot of surplus value.

The theory also implies that companies will be really risk-averse about offers. There's an asymmetry in the “total inconsistency”, where overpaying often creates a lot more inconsistencies than underpaying. Rationally a company may thus lowball such an offer.

Putting it to practice: the salary calibration process

Let's discuss how to actually calibrate salaries. We'll consider another startup MegaHyperCo with a somewhat inconsistent salary distribution:

Let's first aim for consistency. One way to do it is to simply give the smallest raise to everyone in order to bring the total inconsistency to zero:

The method I used in the chart above isn't meant to be taken extremely literally. And note that this example started with a somewhat extreme distribution—ideally, you never deviate very far from consistency, and so the inconsistency-driven raises should be small. But hopefully you get the point of what I'm doing here!

A major benefit of the consistency-based calibration is that it doesn't require modeling out a person's value. All we need is to be able to rank people. It's almost impossible to argue that Abdul who's an associate engineer making $120,000 is exactly 69% as productive as Pierre the senior engineer making $175,000. But we can often rank Pierre vs Abdul.

We should also calibrate against market salary. Usually, markets don't move much, so the calibrations will be minor, but occasionally when markets move quickly, this could mean changing a larger set of salaries (say 50%+ of the team). You definitely shouldn't just give everyone a blanket x% raise! Instead, do it in a way that targets the underpaid employees:

Note that the two models (figure 8 and 9) recommend raises to roughly the same set of employees:

What do you do with people who are overpaid? I've never lowered anyone's salary, and if they are vastly overpaid (above their replacement cost) then they should probably be let go instead. But it occasionally happens that someone is slightly overpaid, but still generating surplus value. If you're not letting them go then you don't have much of a choice here: keep their salary the same for multiple years in a row, potentially indefinitely. If that person makes a fuss and ends up quitting, that's OK!

This system is greatly simplified by introducing levels. By bucketing everyone into say 5-10 buckets, with clearly defined expectations, you can build salary ranges for each bucket and use that for calibration as well. There are many other benefits of levels (and some drawbacks) but I'll save that for another time!

I have done something similar to this every six months for the past four years and it works reasonably well. A year ago I started bringing in my reports (who each manage 20-30 people), and we spend a few hours calibrating until we feel like the final compensation data is reasonably consistent. I try to target about 30% of the team getting a raise each year, so 15% each six months.

Other considerations

What happens when someone asks for a raise? In almost all cases, the answer should be no. You should do the salary calibration process in a way that people get salaries they deserve, but not more. The only exception to this rule is if their performance is way out of range and there's a real retention risks—essentially you're just doing an early off-cycle calibration.

I also strongly recommend against paying more to new managers than individual contributors. This incentivizes people to go into management for the wrong reasons, and make it very hard for people to step back if they decide management isn't what they wanted to do. In my experience, something like half of all managers realize they were happier not managing, and you don't want to lock people in.

Other types of compensation

So far, I simplified the problem a bit and talked about salary as if the base salary is the only consideration. There's of course many other things that constitute the total compensation, so let's consider what I think are the most common ones:

Base salary
Bonus
Equity
Sign-on bonus

Base salary is straightforward so let's skip to the other three.

Bonus

An odd thing about bonuses is how different it is for different jobs. Most salespeople are heavily commission-based. Most software engineers are compensated in stock and salary, with little/no bonus. Why? Is this just some dumb artifact of how the markets for those roles have evolved?

Of course not. You can easily quantify a salesperson's contribution to the company, and paying them commission aligns their contribution to the company with their compensation. It's extremely hard to quantify the contribution of software engineers. This is sort of an obvious point, but people forget about it! Why do traders and salespeople have bonuses? Why don't many other jobs? An excellent short read on this is Measuring and Managing Performance in Organizations.

I don't have as much experience with bonuses, but my observation is that bonuses tend to work only in these three circumstances:

The work result is quantifiable and aligned with the business. This is true for salespeople, but also quant developers and some other jobs.
The bonus is paid ad-hoc for hard hustle, maybe lots of overtime to hit a deadline. Management can dangle a bonus in front of employees to get something done quicker. This should probably only be a rare thing.
The bonus is roughly guaranteed (maybe even on the offer letter), and set to a fixed amount in advance. It's revoked only if the employee is severely underperforming. This seems to be the culture of Google and a few big companies, but I could be wrong and I'm hesitant to say much more since I don't know.

My feeling is that for many professions, bonus tied to performance reviews or ratings is blunt and arbitrary, and should be avoided. For jobs where performance takes a long time to observe, sustained good performance should be reflected in higher salary+equity instead. I've always been skeptical about paying performance bonuses to software engineers other than in truly exceptional cases.

Equity compensation

Equity is a key part of the compensation at startups. Some of the arguable benefits:

Financially, it preserves cash for cash-strapped startups, and offers another currency so they can compensate top talent.
Economically, it aligns incentives between employees and the company, although I'm skeptical this matters past the first 5-15 people.
Psychologically, employees will rationalize their holdings and root for the company. I think this one might be underestimated, but it's hard to tell? There could also be some kind of “survivor bias”—people who deeply believe in your biz idea will be more likely to join.
Practically, employees will be tied to the employment because they don't have the money to exercise options. IMO this is kind of a dumb point because (a) if the company is doing well then you can usually arrange financing (b) it's probably not in the company's best interest to keep disgruntled employees who are only staying because of the equity.

My highly speculative perception is that employees will view $1 of equity as having something like the following value:

What's going on here?

In the early days, people will take a huge salary discount to get something like $100,000 in equity in a promising super-early startup. Everyone in the tech industry has a friend (or a friend of a friend) who made a bajillion dollars in some acquisition. At this point, you can probably get away paying substantially lower than the market rate in terms of base salary.
Once the company matures, you're not going to get these early crazy people, and people will view equity as some weird paper money that may or may not be worth money in many years. At this point, you're going to have to raise the base salary substantially.
As you approach the IPO, people are excited to get into something that will have an “pop” once it goes public, and that they can sell relatively soon.
Once the company goes public, equity compensation is pretty liquid much just like cash, with some annoying constraints (trading windows, lockup periods etc).

There's also difference in risk aversity and other things to be attuned to—some which depends on the personality, some on the life situation of the person.

What's my point of all of this? When you look at compensation, you should be smart about the breakdown between salary and equity. Optimizing for what people value increases your bargaining power.

In my opinion, the biggest thing companies mess up with equity compensation, is to not explain it very well. This causes people to discount its value, which means companies have to pay more salary to compensate for it. Companies should think more about this!

Sign-on bonus

I'm not going to write too much about this one. I used to think sign-on bonus was the world's dumbest thing. Why pay more in the first year when an employee is the least productive? But from a company's point of view, it offers a critical advantage: you can retain salary consistency, while still being market competitive.

This applies especially in cases where there's significant uncertainty about a candidate. A $150,000 salary with a $20,000 sign-on bonus is essentially like a salary that drops from $170,000 to $150,000 the second year. If the person is awesome, you can always raise the salary back. If they aren't, then you keep them at their new “lower” salary, and keep your team's compensation consistent.

Total compensation

When you try to get salary consistency, you should really look at the total compensation number, not just base salary. Here's my opinionated view on how to do this:

Base salary is obviously part of it.
Equity is a part of it, and divided by the vesting time, and using the original value of the grant, not the present value. And for options, subtract the strike price. I often also think it's worth putting a small discount (10-30%) on the value to compensate employees for the risk.
Bonus should only be a part of it if it's the “roughly guaranteed” type of bonus.

Let's look at some examples. Let's say the value per share is $4, the strike price is $1, so that the value per option is $3. We'll discount it to $2.5 for the purpose of compensating employees. The shares are vesting over four years.

Let's consider two offers that are roughly equivalent from the company's point of view:

$100,000 in base salary, with 50,000 shares: total value is $100,000 + 50,000 · $2.5 / 4 = $131,250
$110,000 in base salary, with 34,000 shares: total value is $110,000 + 34,000 · $2.5 / 4 = $131,250

Two years later, the new value per option isn't $2.5 but $10. What happens to the total compensation. Using the value at grant, the compensation is still the same. Using the present value, the total compensation is now:

$100,000 in base salary, with 50,000 shares: total value is $100,000 + 50,000 · $10 / 4 = $225,000
$110,000 in base salary, with 34,000 shares: total value is $110,000 + 34,000 · $10 / 4 = $195,000

Let's say we had hired two employees, one who picked each option. Is either of them underpaid compared to the other? One has a lower base salary, but the other one has a lower total compensation. To me it's clear that these two should be considered to have the same compensation. So the logical way to think about total compensation is to use the value at the grant date.

This is an underemphasized point. HR people generally like to use the present value (because it makes the numbers look more impressive), or exclude equity compensation completely. This applies to resources like levels.fyi and Radford as well (both use present value, afaik). So when you do cross-company calibrations, be careful looking at total compensation.

Of course, use your own best judgement. If someone came in early and is vesting trillions of dollars of shares each month, they proabably aren't a retention risk, and you might be fine looking at the present value of the shares. Just be judicious!

Conclusion

I started writing this post thinking it would be a quick one, but it ballooned into 4,000 words and that's after heavy editing where entire sections were removed. There's a lot more to be said, for sure. But hopefully you'll find some of this useful!

Notes

Throughout this post, I assumed people's productivity is a known exact quantity. Of course, that's a gross simplification, but it's a toy model, OK :)
Likewise, I assumed the market rate is a pretty tight band. There's always going to be weird outliers: people having deep expertise in things you don't care about, etc.
Where does the market rate come from? In theory, the market cost should converge to the second highest value that any company would get out of a candidate. I think this convergence is extremely slow, but definitely happens in the long run.
Of course, total compensation isn't the only thing that matters to people. The ability to learn and grow, perks, lifestyle, titles, office and much other things matter as well! But none of this affects the mechanics of the models.
I was tempted to write about remote work, but I really don't have much experience with it. The question I'm fascinated by: what's the equilibrium?
Note that a consistency-based model doesn't necessarily imply that a company pays the same salary across different locations! They could legitimately pay 2x salary to employees in NYC vs Boise, ID. People may grumble a bit, but the company could say something half-hearted about adjusting anyone's salary if they move to NYC.
You could even state inconsistency in mathematical terms. One way would be to define the inconsistency loss $$ I = \sum_{i} \max_j (s_j - s_i)H(p_i - p_j) $$, where $$ s_i $$ are salaries, $$ p_i $$ are productivities, and $$ H $$ is just a function that is 1 for positive numbers and 0 for negative ones (aka the Heaviside step function). The loss is zero if you have no inconsistency whatsoever, but positive when as soon as the productivity-salary curve is no longer monotonic.
A fun theory! Consistency implies that roles with more people will have less upside flexibility. If Amazon gives a 5% raise to one of the warehouse workers, then they kind of have to give a raise to another 100,000 warehouse workers! But if they give a huge raise to a SVP of Mega-Technology, then that doesn't really violate any inconsistencies.
If someone is slightly overpaid, of course shouldn't just keep their salary without having a blunt conversation with them about their performance and the expectations you have on them, maybe also putting them on a performance improvement plan.
I modeled the function of salary vs productivity as a sublinear relation. The astute reader might point that in that case we should only hire people as far out on the x-axis as possible! That's maybe true, but there's more complexity to it, and I've written about some of it in the past.
I didn't talk about equity types, but there's a whole rabbit hole of RSUs vs ISOs vs NSOs and their legal/tax implications which would take a whole book to cover!

Never attribute to stupidity that which is adequately explained by opportunity cost

2020-03-10T00:00:00Z

Hanlon's razor is a classic aphorism I'm sure you have heard before: Never attribute to malice that which can be adequately explained by stupidity.

I've found that neither malice nor stupidity is the most common reason when you don't understand why something is in a certain way. Instead, the root cause is probably just that they didn't have time yet. This happens all the time at startups (maybe a bit less at big companies, for reasons I'll get back to).

Some examples of things I hear all the time:

I don't understand why team X isn't working on feature idea Y. It's such an obvious thing to do!
Why is bug Z still present? Hasn't it been known for a really long time? I don't understand why they aren't fixing it?
I don't get why HR team still doesn't offer perk W. So many other companies have that!
I've told my manager that we need a process for thing V but it still hasn't happened!

Of course, why these things never happened is that something else was more important. Quoting Wikipedia on Opportunity cost:

When an option is chosen from alternatives, the opportunity cost is the “cost” incurred by not enjoying the benefit associated with the best alternative choice. The New Oxford American Dictionary defines it as “the loss of potential gain from other alternatives when one alternative is chosen.” In simple terms, opportunity cost is the benefit not received as a result of not selecting the next best option.

Thus I've started telling people: Never attribute to stupidity that which is adequately explained by opportunity cost.

Your friends aren't stupid, just busy

It might seem obvious the way I put it that opportunity cost is the thing to blame. But it's not. Human psychology works in weird ways. People love to conclude that something wasn't done because they are stupid, or possibly lazy.

This happens about 95% of the time when you don't know a certain person/team. The team that works on the CI/CD system is just a faceless blob in a different office and they must be completely stupid for never fixing the super annoying button in the interface. What are they doing all day! When you know a certain team/person, the percentage drops a bit: maybe stupidity is the first thought only 40% of the time. I'm obviously just making up these numbers, but you get the point.

I spent many years working in a satellite office. A lot of the time when I had some deep technical disagreement about something, I'd fly to the main HQ, and go out for dinner with the other team. I wouldn't even talk about technology, just about random stuff. Once they knew me, and realize I was a human (and not a faceless blob), most technical disagreements tended to go away. People are more likely to assume positive intent and not malice/stupidity/laziness.

Ruthless prioritization

Erring on the side of assuming opportunity cost as cause precludes any evil/stupid/lazy narrative: the team/person probably actually just had other more important priorities. But how did they pick those?

I'm not going to offer any smart advice on how to estimate effort and impact in your JIRA board. There's a trillion methods and tools and processes for doing that. But what I have come to believe is that: prioritization is the most value creating activity in any company. Generating ideas and executing things is of course also important! But what I've seen to set apart great teams from good is a brutal focus on prioritization. This means generating an absurd amount of ideas and throwing 99% of them out of the window, to focus on the 1% that have the highest impact.

Ideas will be generated much faster than there's bandwidth to execute on them, so you're doing something right if your backlog is growing indefinitely. A negative person on a mediocre team will complain that there's never time to work on their favorite pet project X. I've often heard things like “our backlog of features keeps growing so fast, how are we ever going to have time to invest in paying down tech debt?".

To me this reflects a misunderstanding of how product development should work. Backlogs should be growing indefinitely. What a good team will do is to accept that, and establish a good relationship between product and tech, and make sure you constantly keep reprioritizing. Maybe today it's shipping a bunch of features the business needs. Maybe tomorrow it's paying down some tech debt. If you have a shared framework for how to think about value and prioritization, it usually works out.

Opportunity cost matters a lot less at bigcos

I mentioned in passing that opportunity cost is a likely cause at startups, but maybe less often at big companies. Why? Because a startup is often playing catch up building things that are mostly “obvious”. When you get to a very late stage, and you have a lot of money and a lot of developers, things get a lot more tricky. You can't do X because it will cannibalize metric Y and upset advertisers. You can't do Z because that would be inconsistent with how Q works. And so on.

Ultra-scientific study of company priorities.

Final statement

We started with one seemingly innocuous statement, but it quickly led to many corollaries: how to trust people, how to manage the backlog, how prioritization can generate tremendous value. Fun!

I often keep quoting the paraphrased quote to people, but people have no clue what I'm talking about. Hopefully going forward, there's this blog post to refer people to!

Edit: this post was on the front page of Hacker News and generated some comments.

How to hire smarter than the market: a toy model

2020-01-13T00:00:00Z

Let's consider a toy model where you're hiring for two things and that those are equally valuable. It's not very important what those are, so let's just call them “thing A” and “thing B” for now. For one set of abilities, the scatter plot looks like this:

The assumption here is that A and B are drawn from a 2D-Gaussian with a mild positive correlation. I crammed a whole lot of stuff into this plot: the scatter plot shows the distribution of A vs B, and the two histograms (on top and on the right) shows the distribution over A and over B.

We're going to hire some people, so we look at a bunch of resumes and try to decide who's going to make it to the next stage. The best candidates are the ones that are great at both A and B, and we'll obviously bring them in. But some candidates are going to be good at A but not B, or vice versa. So you might choose to evaluate candidates on some combination of the two. For instance, bring in people for which $$ A + B > k $$ where $$ k $$ is some constant:

We can already see something interesting here, which is that the candidates we bring in exhibit a negative correlation between thing A and B, despite those being independent. This is something that's called Berkson's paradox.

I've written about a particular example of this previously: Google found that experience with programming competitions was negatively correlated with actual work success. This would happen if we relabel the x-axis above to say “general interview feedback” and the y-axis as “programming competition success”. The problem isn't that programming competition success is somehow bad: it might have a strongly positive correlation with future work performance. The problem is that Google probably overweighted programming competition success in their hiring process versus other things that would more accurately predict future work performance. This caused an “artificial” negative correlation between those two qualities among the group that were hired.

An interesting paper claims a negative correlation between sales performance and management performance for sales people promoted into managers. The conclusion is that “firms prioritize current job performance in promotion decisions at the expense of other observable characteristics that better predict managerial performance”. While this paper isn't about hiring, it's the exact same theory here: the x-axis would be something like “expected future management ability” and the y-axis “sales performance”.

This problem of overweighting is a consistent theme throughout this post and we'll get back to it!

It gets worse: the market forces

Because you're hiring in a market with many other players, the really good candidates may simply have so many options that they are going to go to whatever company they want to and make a zillion dollars. Let's say A and B are equally valued by you, as well as the market. We end up we something like this:

Look at the green segment here: now there's an even stronger negative correlation between A and B.

Recruiting is like buying a home

This isn't only for recruiting, and I think this negative correlation is intuitively more clear in the context of buying a home. You might value an extra bathroom as worth +$50,000, and an extra bedroom as +$100,000, but so does the market. As a result, given your budget, you'll see a negative correlation between having an extra bathroom and having an extra bedroom, because the market prices you out of having both.

But this also presents an opportunity. The tradeoffs forces you to focus on the things that you value more than the market. Maybe you don't think a 4th floor walk-up is more than a -$10,000 penalty for you, but the market values it as -$20,000: then in fact you should hone in and target exactly those apartments. As we will see with recruiting, the trick to figure out your own preference versus the market. Let's dig into a few cases.

A few case studies

A common human bias is to interpret confidence as a sign of competence, and as a first order approximation, let's say the market prices both equally. But let's assume that we decide we're not going to succumb to the same bias as the market as a whole. We conclude that the market overvalues confidence, and that we're going to do everything we can to eliminate this bias from our interviewing process. While competence isn't easy to observe, let's say we can get very close by having a carefully thought out interview process. The resulting chart is something like this:

Our cutoff is going to be a vertical line, since we only care about competence, not confidence. The “market cutoff” is going to be the diagonal line. The people we end up considering will be the green triangle.

Here's the weird thing though: our group of people that we consider will:

On average have low confidence
Exhibit a negative correlation with confidence and competence

The same phenomenon arises in more complex situation. Let's say everything else equals, it is better to hire someone from a fancy school, but that the market overvalues it. In contrast, let's say we value general competence slightly more than the market. The market cutoff will be a 45 degree line, but our cutoff will be a line with a different angle. We end up with something like this:

There's a similar conclusion here: we actually see that the candidates we're interested in went to a less than average fancy school.

More generally, the conclusion when you're hiring in a competitive market is that even if you think some quality is desirable, if you think the market overvalues that quality, you should look at the other side of the spectrum. This goes back to my example about buying a home.

A model for finding the best candidates

The model so far is easy to understand and helps us explore a few tradeoffs when hiring but I think it falls short in a few areas. Let's create a slightly more complex model that is a bit less intuitive but I think slightly more realistic. This section is a bit more math, so feel free to skip if it's not your thing.

What we really want to optimize for is our estimated value divided by the market's value since the market determines the salary, roughly. Assumptions:

The value to the company is $$ v_c = \exp(\alpha_c x + \beta_c y) $$ where $$ (\alpha_c, \beta_c) $$ is a vector with parameters we pick
The market value (i.e. the salary) is $$ v_m = \exp(\alpha_m x + \beta_m y) $$ where $$ (\alpha_m, \beta_m) $$ is a vector with parameters that the market values
The quantity we're trying to optimize is $$ z = v_c / (v_m + k) $$ where $$ k $$ is a constant.
The constant (which I set to $$ k = 1 $$) represents some combination of
1. We don't want to just take the market price for candidates: we pay them a fair base wage that ramps up with market demand.
2. The cost of hiring/onboarding/training.

Throughout the next few plots, let's set the “market vector” for all of them to $$ (\alpha_m, \beta_m) = (1, 1)$$ i.e. the market values the two things equally. We then vary how much we care about those things.

First of all, let's say we value A and B equally, but we care about them a bit less than the market. We bucket the candidates into three different buckets based on the value $$ z = v_c / (v_m + k) $$. If we plot this, we see the same negative correlation (the blue line):

If we decide we're going to outbid the market we can set $$ (\alpha_c, \beta_c) = (2, 2) $$ on the other hand:

The more interesting things happen if we decide we don't care about quantity B, but the market still does. This would correspond to the competence-confidence case where the x-axis is competence and the y-axis is confidence. The market values both, but we are smarter and we only value the former. We set $$ (\alpha_c, \beta_c) = (1, 0) $$:

Similarly to what we saw earlier, we see that we if we go after the “best” people (in terms of the quantity $$ z = v_c / (v_m + k) $$ then we actually end up hiring the people that have less-than-average confidence. This is because the market systematically underprices those.

In the other case when we think thing A and B are both positive, but we think the market overvalues thing B, we can set $$ (\alpha_c, \beta_c) = (1, 0.5) $$. This would correspond to the example where A is the fanciness of the school they went to.

The outcome is somewhat similar. Even though we prefer candidates from fancy schools, we still end of being better off hiring people from “average” schools. Again, this is because the market systematically undervalues those. The fact that we consider B a “good” quantity is less relevant than the fact that we consider it less valuable than the market does.

A different case would be a company that only pays attention to some superficial measurement at the cost of things that matter more. For instance, let's say the x-axis is “task-relevant experience” and the y-axis is “fanciness of their degree”. This situation isn't so contrived: I've talked to recruiters at bigcorp enterprises, where there's a strong mandate to only hire Ph.D.‘s but where (my guess) the interview process is pretty noisy. We set $$ (\alpha_c, \beta_c) = (0, 1) $$:

The interesting observation is the histogram at the top: the blue distribution of hired candidates end up having a much lower task-relevant experience than the ones that were rejected. Not a hiring strategy I would endorse. There is however one odd twist to this. Let's say we keep ignoring feature A (task-relevant experience), but we throw all our money bidding out basically anyone else based on B (fanciness of their degree):

Because there's some small correlation between A and B, we actually do end up at this point getting higher-than-average skills in terms of A. The cost is that we're going to have to out-spend everyone else. A far more cost-conscious way would be instead to pay a medium premium for A and a small for B, rather than a huge for B but none for A.

Conclusions and opportunities

If you ask the average recruiter how they find people, it's usually some type of Boolean search on LinkedIn, and when you ask them how they grade resumes, it's typically some combination of having CS degrees from fancy schools, having the exact experience with the tech stack you have (down to frameworks), etc. God forbid if someone has a gap on their resume, or if they need visa sponsorship.

What my model implies, is that there's an “arbitrage opportunity” here. In fact, it's a bit of a silver lining to the fact that the market is biased. Are companies systematically putting a premium on something? Then bet against them! Go after the underdogs. If every hiring manager acts in their own rational self-interest (which unfortunately, they don't) then over time these biases will vanish and the market will converge towards efficiency.

You might have your own personal preferences here, and I'm not going to judge you, but here's a few thoughts on things that may be undervalued by the market:

Candidates from non-fancy schools
Candidates who didn't go to school at all and are self taught, or have some non-traditional path into the field
Candidates who didn't get a CS degree
Candidates who never worked at any well-respected company
Candidates who are low confidence or “interview poorly”
Candidates that could experience discrimination for other reasons, like being from an underrepresented groups, or not fitting some stereotype of what a software engineer should look like
Candidates who need visa sponsorships
Candidates who don't have experience with your exact tech stack, but a strong generalist foundation (this is especially prevalent in complex industries, which I tweeted something about this a few weeks ago)
Candidates who left the workforce for a while to take care of family
… many more things

I want to be extra clear about what my conclusion here is. I'm not saying you should think of it as a bad thing that someone is coming from a fancy school. Everything else equals, it's typically a good thing! What I'm saying is that if you're hiring, then you will be more successful going after candidates that the market undervalues. And this doesn't just apply to measurable things (what school they went to), but also things that people subconsciously value (eg. confidence of a candidate). On the other hand, overvaluing things (that are less predictive of future work performance) can lead you to hire worse candidates.

In all these cases, it turns out your preference versus the market's preference matters more than your preference in itself.

All of this stuff may or may not sound obvious to you!

Appendix

I only talked about tradeoffs between two traits (A and B), but the model extends well into many more ones. But computer screens are two-dimensional and it's harder to plot more dimensions than 2!
These models may seem a bit arbitrary to you, and of course every model is a somewhat arbitrary simplification of reality! I'm not claiming that it's perfect.
That being said, both models gave the answer my intuition told me on basically the first attempt: the conclusions are very robust to the inputs, and I spent almost zero time trying to pick the right parameters.
Even without the screening step and without a competitive market, you might still end up with a negative correlation for the people you make an offer to. I've mentioned this in the past talking about Pareto frontiers. It's a slightly different phenomenon, though.
I don't mean to pick on recruiters, and there are some absolutely outstanding ones that I've worked with in my life!
I accidentally published a draft of this a few days ago, sorry about that!
The (very simple) code is on Github, as always.

What can startups learn from Koch Industries?

2019-12-19T00:00:00Z

I recently finished the excellent book Kochland. This isn't my first interest in Koch—I read The Science of Success by Charles Koch himself a couple of years ago.

Charles Koch inherited a tiny company in 1967 and turned it into one of the world's largest ones. That's impressive! Just a quick disclaimer just to get it out of the way. You may know the Koch brothers as the climate deniers who funded the Tea Party. I don't understand this disconnect between being so brilliant in one field, and extremely ignorant in another. But my curiosity tells me there's something worth learning from most notable people, despite what I may think of their opinions and Koch Industries turns out ot be a particularly interesting case study.

Let's go through a few ways I think startups (and any companies) can take notes from what Charles Koch has accomplished.

Measuring results

Aligning incentives with what's in the long term interest of the business is an extremely hard thing to get perfect, especially at a large scale. A lot of it comes down to how you measure results. Ideally, you want measure (and reward) the performance of individuals in a way that's 100% correlated with business value.

Koch's approach to a lot of this boils down to holding business units and even individual plants accountable for their profits and losses. This creates essentially 100% alignment with the management of those units/plants and what the company wants (maximize profits)! If you also give those managers full freedom to pursue their objective in any way you want, then you have both full accountability and agency.

There are many caveats applying this to startups, since:

The vast majority of all startups make no money
Startup teams are too entangled to have their own P&L
Startups need to invest in crazy ideas with low chance but high expected outcome

… BUT, I've also seen several success stories when you put a smart person in charge of a key metric (say acquisition costs) and tell them from now on, they own it, and they need to do whatever they can to get it down. Accountability+agency can work extremely well.

Creating a culture of entrepreneurship

One tenet under Charles Koch's school of thought called Market-Based Management (MBM) is “principled entrepreneurship”, meaning let everyone in the company think about their job as a business owner. Every company should make it clear to everyone, that it's their duty to:

Understand how what they are doing adds business value
Think entrepreneurially and suggest improvements when they see it

By decentralizing the ability to observe and suggest improvements, you get the people close to the markets to make decisions and react quickly. Every company should take notes.

Koch Industries made some mistakes along the way. There's a trap if you push too much for it. Kochland, p. 192:

Every Koch business leader was expected to create their own Value Creation Strategy. They needed to look for new companies to buy, new plants to build, and expansion projects for existing plants. […] Business leaders knew that Charles Koch would cut or increase their bonus pay based on the Value Creation Strategies they delivered. This change rippled out through the ranks. Deals were proposed and sent to Wichita—everybody wanted a big acquisition under their belt.

I have seen the unintended consequence of people being rewarded for suggesting and pitching a deal, but not held accountable for making the deal work in the long term. The results are not great. Koch seems to have learned this lesson and cracked down on the dealmaking bonanza, after a disastrous acquisition of Purina Mills.

Encouraging long-term thinking

As I just mentioned, you just have to be ruthless about holding people accountable for actually delivering, and not just delivering the slideware. Koch learned it the hard way. Kochland, p. 199:

And these people were not the right kind of people. Koch had begun to stock its ranks with MBA students from the best business schools around the country. Brad hall spent a lot of his time trying to unteach these kids what they learned at Northwestern University or Harvard. And there was a cultural element as well. Many executives inside Koch Industries saw a type of freelance culture was growing among the young guns. They were looking out for themselves, not the company.

Creating long-term alignment between the employees and the company is crucial, or you will hire careerists who are self-optimizing. When you promote based on who pitches the best rather than who delivers the best results, you will reward short-term behavior. This is a classic agency problem. I suspect moving to Wichita was in an odd way one of the ways Koch might have filtered out short-termist thinkers.

Throughout a set of sweeping changes, Charles Koch reformed the company. Kochland, p. 221:

After the purge was complete, Charles Koch didn't replace his leaders with fresh employees who were hired from the best business schools or other companies, Instead, he promoted loyalists who knew the Koch way. […] The new team was composed entirely of men who were steeped in Charles Koch's values and who were imbued with the lessons of Koch University. These were the people who spoke the language of Market-Based Management. Charles Koch promoted players from his own farm team into the big leagues.

Hiring “careerists” can be dangerous at any company, but there's also ways to make sure the system is set up to promote people with a longer term view. Startups typically award stock options vesting over four years as a way to align the interest of the early employees with the company. One of the reasons I've always been skeptical of bonuses because it favors short-term thinking, as opposed to equity grants or increases in total compensation which are awarded based on sustained high performance.

Fixing the unintended

Consider a hedge fund paying its traders a fraction of the profit: that will incentivize them to make dangerous bets of the type:

90% the fund makes $100M.
10% the fund loses $10B.

The expected value is very negative, but a smart trader should take this bet every year: they will walk away with a huge bonus year after year until bad luck strikes and they are fired (but get to keep their bonuses).

Most incentive system end up having similar flaws, and this is another thing Koch Industries learned the hard way. Kochland, p. 215:

Oil gaugers interpreted Koch's push for “continuous improvement” as a reason to give Kochs's customers a bad deal. The refinery managers had interpreted Koch's push for “profit centers” as a reason to dump pollution into wetlands and delay investments that would have reduced pollution. The common teachings of MBM had too often turned into a language of groupthink, prompting managers to persecute whistle-blowers rather than heed their important warnings. MBM's focus on growth had encouraged irresponsible acquisitions that piled up losses, and public failures like the collapse of Purina Mills.

How do you fix something like this? In the case of hedge fund traders, you can introduce certain risk metrics, and require that those can't exceed certain predefined values. Trade what you want, and you get a share of the profit, but you cannot exceed certain risk limits.

Charles Koch's invention is something he calls 10,000% compliance (100% compliance, 100% of the time). This wasn't just a directive, but also a part of the culture and belief system. When you know that the company cares and you know that your coworkers know and they know you know, then these rules are far more effective.

Creating a corporate culture

I live in New York City, which is 10+ million people living on top of each other. If you suddenly stop on the sidewalk, and someone is walking behind you, they will highly likely yell at you. This might seem harsh, but I'd argue it's actually the very basis for which so many people can coexist so close to another. Having strong cultural tenets (for instance: “you need to be careful about other people's space”) comes with the flip side, that if those cultural norms are violated, then there will be swift punishment. So in a weird way, it comes from the fact that people care.

Culture works in other ways, too. It can be a shared belief system that people can fall back on. If I work at a startup, and one department is obsessed about lowering costs, while the other one is obsessed about growing volume, then they are making decisions based on different assumptions. A shared belief system helps people coordinate without having to explicitly coordinate. Does this company believe in quality at any cost? Are we frugal? Do we hire “the best”? Kochland, p. 568:

Senior leaders at Koch Industries phrased everything they said in the vocabulary of Market-Based Management. One of Charles Koch's indisputable accomplishments over the preceding thirty years was creating an organization where every employee—to a person—publicly subscribed to the same intricately encoded philosophy, Division heads who came to Wichita spoke in terms of mental models and discovery processes and the five dimensions. They talked about integrity. Decision rights. Challenge processes. Experimental discovery. Virtues and talents. They weren't the dog whistles or catchphrases. They were the internal vocabulary of Kochland. Learning them was the first condition to winning a seat at the table.

Regardless of what you think of Market-Based Management, the fact that at whole 100,000+ people organization agreed on a shared set of principles implies a lot of decision making gets vastly simpler. There is something impressive about this.

Final words

Has Charles Koch figured it out and cracked the management code? Definitely not. Has he thought a lot about incentives and culture? Absolutely. Does saying that mean I subscribe to his worldview? No way.

I always dig a good book about corporate dynasties and Kochland was a great one. My understanding of Koch went from nonexistent to superficial, and some of this post may be of questionable accuracy. Please excuse any liberties I took trying to apply it to startups!

We're hiring at Better

2019-12-09T00:00:00Z

Just a quick note that my team is always hiring at Better. A lot of new people have been joining the team here in NYC lately—the tech team has actually grown from 35 to 60 in just ~3 months. We are primarily looking for senior software engineers and/or engineering managers. But we would love to talk if you have less experience too! Our main tech stack is mostly TypeScript and Python.

We are also growing our data team and if you're interested in different data problems, I would love to chat. The team is only six people but will be growing very quickly. The most important thing I look for is curiousity about business and product—an entrepreneurial drive that compels you to dig into numbers and understand how everything fits together. The team works on the core data infrastructure as well as exploring hypothesis with product managers and other stakeholders. Most of the work is in Python, with a fair amount of SQL as well. If you know a bit machine learning or statistics, that's helpful, but not required. We do a fair amount of ML/stats at Better, but I consider those to be tools among many other things, and we don't have any roles for doing ML/stats full-time.

We are also looking for managers to come in and manage parts or all of the data team. If you have experience with both data and management, definitely reach out!

Finally, we're also looking for product managers. The product team (under new leadership by Devang Thakkar) is growing quickly (from 5 to 15 this year). We love to talk to you, especially if you have experience with deep automation problems (e.g. have a background in logistics, supply chain, operations research, or similar), OR if you have experience working with customer facing applications and you love to understand how to make customers love your products more, increase conversion rates, etc.

Just as a quick recap of what we're doing at Better: the mortgage industry is a huge very broken industry. 7 million Americans get a mortgage every year, and the average experience is like getting a root canal done on your finances. We are rethinking the industry from the ground up, by going after it as a tech startup—automating the whole process, putting it online, getting rid of the traditional commission-driven sales-centered model, and fundamentally offering a much better user experience as well. As a result of everything we're doing, we are also able to offer the lowest rate for a large segment of the US population.

It could be you on this photo!

Feel free to email me at erik at better.com, or reach out to me on Twitter or any other platform of your choice. You can also apply directly!

By the way

I just moved my blog from Jekyll+S3+Cloudfront to Hugo+Netlify. A lot of minor things broke in the process, that I've been fixing as I discover them. Please let me know if you see anything funky!
I know I haven't posted much lately. Being a startup CTO with two toddlers has gotten the best of me. But I have a few things coming down the pipe soon!

Buffet lines are terrible, but let's try to improve them using computer simulations

2019-10-16T00:00:00Z

My company has a buffet every Friday, and the lines grow to epic proportions when the food arrives. I've suspected for years that the “classic” buffet line system is a deeply flawed and inefficient method, and every time I'm stuck in the line has made me more convinced.

This is a hill I'm willing to die on, and it's also a place where I have a secret weapon: a well-executed computer simulation that solves this problem once and for all has an opportunity to unleash an 0.1% improvement in GDP. Consider all that time lost in lines! So I went to business. A few hours later (a bit too many hours!) I built a simulation framework for buffet lines.

Note: throughout the rest of the blog, there will be some huge animated gifs. Out of respect for your bandwidth, I have cut the gifs to be only 10s each, but there are also links to full 150s videos (mp4).

How do you simulate a buffet line?

I wrote a Python script to simulate buffet lines. The setup is pretty simple, although there's a lot of assumptions being made:

There are 10 items on the buffet
Any person wants a random 40% of those items
Each person comes in from the left and exits to the right
They take the food in order from left to right
Each person walks according to a shortest path towards the next goal
There is a rate (the “arrival rate”) by which new users enter on the left

“Classic” method

With the “classic” method, I'm referring to the single-line method where everyone is waiting in one big line that passes by all the items. This method works well at low arrival rates. For instance at 0.3 people per second, here's the state after 100 seconds (see full video here):

My simulation script produces these videos as a side effect: with random emojis for the food and people. I hope you enjoy it, because there will be a lot more! Focusing on the classic method for now: it turns out to be “unstable” as soon as the arrival rate goes up by just a little bit, in this case 0.7 people per second (see full video here):

You can see the people accumulating on the far left side waiting for their turn. The line of people in fact just keeps growing and growing. It turns out that this method has a critical maximum capacity of something between 0.3 and 0.7 people per second. What do I mean with that? Every queue system has some upper capacity that determines the highest possible arrival rate. Above that, the queue keeps growing longer and longer because fundamentally the arrival rate is higher than the rate at which people are exiting the system.

There's no easy way to estimate the capacity of the system other than varying the arrival rate and seeing at what point the system breaks down. We do that by simulating the system at different rates. Let's consider some more methods next!

“Rogue” method

A marginally better model is to let everyone go in any direction they want, including skipping, but potentially also going back to the left. This method has a slightly higher capacity, but breaks down around an arrival rate of 0.7 people per second (see full video here):

The lines in the video above is the shortest path towards the next food item each person wants, or to exit the system on the far right.

The problem is that people get “stuck”. Maybe this is a problem with the simulation, since my simulated puppets are dumb and won't consider the fact that they are blocking each other. The rogue method does in fact have a slightly higher capacity than the classic method, but with maybe other drawbacks (like getting elbowed by someone panicked, starved soul).

Let's consider to some slightly smarter methods.

“Don't go backwards” method

One maybe not obvious observation from the previous system is that the possibility of going backwards (to the left) causes these blockages. If we made it impossible to go backwards, then people would never go further right than their next food item. This would mean people getting the food are never blocked from going to their next item. This simple rule change turns out to increase the capacity of the buffet substantially, to somewhere around 1.0 arrivals per second (see full video here):

This method tends to cause a bit of an accumulation of people at the first food item, especially at a higher arrival rate: see video here. As the rate goes up even more, this jam because a complete gridlock: see a full video here.

In practice, this system is pretty intuitive. Just keep walking towards your next food item, and if you don't want a particular one, just skip it and walk around anyone in between. This basically avoids the “frustrating” points of the classic method, where you're blocked behind someone loading up on food that you don't even want, just because you want something later.

“Perpendicular lines” method

One method is to make the people line up in lines perpendicular to the food. This seems to handle up to roughly 1.0 arrivals per second (see full video here):

This system might seem very orderly, although a drawback is that the long perpendicular lines causes a lot of extra walking: this drives up the total time it takes for people to get food, meaning more people around at any point in time, causing delays for other people. For a video of what happens when the arrival rate goes up to 2.0, see here.

Which system is the most optimal?

Given the four methods I've just outlined, the question is: which one is the most efficient? For reasons I've alluded to above, it's not trivial to measure capacity by just simulating. But to spare you the ruminations, I'm jumping straight to the conclusion first. Which system is really the best? If my simulations are correct (a big if), the list looks as follows:

“Don't go backwards” method (best)
“Perpendicular lines” method
“Rogue” method
“Classic” method (worst)

Let's dig a bit deeper

Analyzing these methods is somewhat ~~annoying~~ tricky. We have a system with some arrival rate, each person spends some time in the system, and then exits. The (average) time it takes for a person to enter, get all its food, and exit, is called the “cycle time”. There is also an upper limit to the capacity of the method. As the arrival rate gets closer and closer to the capacity, the cycle time goes to infinity. When the arrival rate exceeds the capacity, more and more people accumulate indefinitely, with no end in sight.

I varied the arrival rates by increments of 0.05, from 0.05 to 2.0. If we vary the arrival rate and look at the number of people in the system over time for a particular method, it looks something like this:

What are the colors here? I devised a hypothesis test to see if the queue is “stable” or growing over time. The red lines are the simulations where the arrival rate seems to exceed the capacity: the queue size grows indefinitely in those cases.

Another way to slice the data is to look at the distribution for the time it takes to “finish” for each person. I ended up using my own library convoys for this.

In the chart above, I grouped the arrival rates into slightly larger buckets so it's easier to follow. From both the charts above, we can see that the critical rate for this method seems to be around 0.8-1.0 arrivals per second. Below that, and everyone gets their food eventually. Above that, and the system turns into gridlock. From the cohort chart (the second one), you can see that less than 100% of the people that enter the system ever exit. Poor them!

Comparing all methods for a single rate shows that the “Rogue” method and the “Don't go backwards” are roughly similar up until ~0.6 arrivals per second:

However, at 0.8 arrivals per second, the “Rogue” method breaks down into rubbish. The “Don't go backwards” method is still holding strong!

Comparing all simulations shows that the “Don't go backwards” method seems to work well for most arrival rates. In the chart below, that method is the most towards the lower right side, which is the best place to be (low cycle time despite a high arrival rate):

What else can we do?

Unrelated to the method itself, there are other things we can change to increase the throughput. Moving the food so that it's accessible from both sides increases throughput a ton. With the “Don't go backwards” method, this lets us handle an arrival rate of 1.5 people/s easily (see full video here):

There's more experiments that need to be run here: consider it the next frontier for buffet line research.

Notes

If I had a million dollar grant, the next step would be to actually implement this with real humans and study their behavior. After all, this is more empirical research than it is theoretical.
Writing the simulation in Python was probably a bad idea in retrospect: it turns out to be incredibly slow to run Dijkstra's shortest path algorithm on a large grid. Some simulations took more than a day to run (on a c2.8xlarge in AWS).
A ton of time it took to implement the simulation had more to do with weird behaviors that would happen based on your assumption. For instance, I initially modeled people as circular, but they would get stuck in a “hexagonal” pattern. Changing them to be square resolved the issue.
The hypothesis test I mentioned to see if the queue is “stable” is something I cooked up myself: draw samples from a Gamma-Poisson mixture and see if the imbalance between arrivals and exits is at least as extreme as what we observed. No idea if it's the best way but I really only use it to color lines in a graph so who cares.
In order to determine the average latency in the scatter plot, I use Little's law and compute it from the average queue size and the arrival rate.
The code is on Github, as always.

Miscellaneous unsolicited (and possibly biased) career advice

2019-09-26T00:00:00Z

No one asked for this, but I'm something like ~12 years into my career and have had my fair share of mistakes and luck so I thought I'd share some.

Honestly, I feel like I've mostly benefitted from luck. Some of the things I did on a whim turned out to be excellent choices many years later. Some of the things were clear blind spots in hindsight. If I could give my 12 years younger self a bunch of career advice, here are some of those things.

Choosing a company

This roughly boils down into:

Pick the fastest growing company you can find. My own personal development was always highly correlated to the company growth. That's really just an empirical observation, but here's an attempt at explaining this relationship: Stagnant companies are zero sum. If your peer gets a promotion, that means the slot is taken and you don't get the promotion. If you get to work on a cool project, someone else can't. In contrast, fast growing companies have this Ponzi-schema quality that everyone gets promoted! everyone can work on interesting projects! On top of that, there's this fundamental mismatch between labor supply and demand internally, where there's a “pull” in every direction. I started managing at 25 and was running a machine learning team soon thereafter, despite having no formal background in machine learning. Why? Not because in the universe of people, I was the best one (by far not). But because there was no one better around and we didn't have time to find one. Someone just has to step up and do it.
When you're young, care more about building human quickly and not so much about financial capital. The human capital will pay much larger dividends over your lifetime.
Go where other smart people are. That's where you're going to build your human capital the fastest.
When I say smart, I really mean people you can learn from. I've worked with some very smart people who I didn't learn from and it was a waste of time.
Don't pick a company because your friends and family think it's cool.
Consider going into an industry where there's not many smart people. A team of smart people in an industry with not a lot of smart people can move mountains. A team of smart people where everyone else is smart will have a harder time.
Don't be afraid of going into a weird industry. No one thought the music industry seemed like a great place when I joined Spotify 2008. I bet people said the same thing about taxi industry in 2011. Etc. Things always look cool in hindsight.
In general, do a lot of internships. It's not just a great experience to have, it's also a great way to learn what you like to to. School is partly just a way to postpone all your major life decisions and learn a bit more about what you want to do.
When you're no longer learning, then it's time to do something else.

Building human capital vs building superficial markers

I largely think your internal human capital is the only thing that matters in the long run. If for whatever reason someone about to graduate <insert elite university here> breaks their foot on their graduation day, and then misses it, and then for whatever (dumb hypothetical) reason never gets their diploma, does that make them less productive? Not at all. Another example: there's some evidence that people with tattoos are short-sighted. So you can argue that tattoos are a negative external marker. But if you go get a tattoo, it doesn't change anything about your human capital.

Your title is one thing where people pay too much attention. People have a misconception that your title will open more doors. It won't. In fact, it might even hurt. For instance, as soon as you start managing people, your options actually go down, since people assume you want to stick to a management role, but there's less demand in the market for management role (most companies I know prefers to hire ICs and promote them from within).

It's also really dangerous to get carried away by something that's celebrated in whatever world you're in. For instance, in academia, tenure is the ultimate goal. At a big company, your worth is roughly measured in how many people you manage. It's easy to think that those things are universally valuable, but they often don't mean a ton outside. If you're applying to an early stage startup, they probably don't care (or it might even close doors) that you are managing 300 people at a large industrial firm with a huge HQ in a suburb.

That being said, sometimes these markers are useful to get somewhere. The key trick is to focus on the lowest cost highest impact ones. For instance, getting a PhD is an insanely high cost marker, as is going all-in and getting a perfect GPA. Those aren't bad things in themselves, but the signal value compared to the investment is much lower than other things. For some things on the other spectrum, things with high signal value compared to the investment, I would say having built an open source project (that people use), or some award, having written things that got published, having started your own company (even though it failed), or many other things.

Acquiring new skills

Some thoughts in no particular order:

There's things that you can learn yourself, and things you need to do in order to learn. Those are very different things! Skills in the latter category will often be more valuable in the long term because they will be rarer. These are things like building a startup, or managing people, or building some super complicated distributed system that handles 1M messages per second. Things in the former category are things like learning how to build deep neural networks, or coding in Rust, or iOS development. If you truly want to pick up one of those things, don't expect an opportunity to get paid while learning it. Just learn it yourself first.
Read all the time. Just whatever books about technology or business or management. History is great too. Even fiction is great. Also read a million blogs and follow people on Twitter. There's so many smart people out there saying a lot of smart things so that you don't have to figure everything out yourself. Reading a lot helps you build a mental model of the world which will help you later.
It's easy to confuse wanting to learn something with wanting to have learned something. Honestly if you don't enjoy the process of learning a particular thing, you're probably never going to be very good at it. Sorry.
… that being said, just to counter the somewhat cynical tone, I believe most things can be turned into fun learning experience, if you just gamify it for yourself. For instance, work on some projects on the side and mess around with things. If you want to learn Clojure, build a dumb webapp and deploy it to an EC2 instance. Or whatever it is.
Figure out where you want to be on the spectrum that I call tools-oriented vs goal-oriented. The former category is if you are super into deep learning or functional programming or distributed systems or something else and you want to get really deep in the rabbit hole and become and expert. In that case, you're usually better off at a large company where you can truly go deep. On the other side of the spectrum, if the thing that really excites you is to build business value, then go work at a startup.
It's totally fine to spend all your time on something if you want to. Of course, it's great to have a social life and sometimes relax, but it's also OK to go home after work and stay up until 1am hacking on a side project. This could be a wonderful thing if you enjoy it and are learning things.

Underrated skills

Some things going back I really wish I would have spent more time learning when I was younger:

Communication skills. When I moved to the US, I realized I had completely underinvested in my language. I spent the first year reading a ton of novels, underlining words I didn't know and writing their meaning in the margin. I even recorded myself pronouncing things and listened to it. It's not like my English was terrible when I came, but I wish I'd taken it more seriously earlier.
Presentation/sales skills. I massively failed at this for so many years, thinking that all that matters is to solve hard technical problems. I should have spent at least 10% of that energy and time just trying to get senior people across the org excited about the things I was building. Good ideas don't sell themselves.
Self-sufficiency. With that I mean, are you able to deliver business value by building something across the whole stack, without having to rely on other peoples/teams to help you. If you can do this, you can iterate much quicker. But you can also build a prototype of something and demo it. For instance, I recommend for any aspiring data scientist that they set up an AWS account and deploy a web service and learn the whole thing.
Statistics. Seriously, I really wish I had studied more of it in school. Basically goes for anyone in the STEM field, IMO.

What else?

There's a million other things and these are things biased by my own experiences. Don't take my advice too seriously!
The picture at the top was the view from my office a few days ago. No idea what it has to do with the content.
This is getting old, but I really feel sorry about the super low frequency of blog posts last six months. I went through a tricky phase with a newborn kid, running both product and tech at a startup, buying and renovating an apartment, and other random things. But most of those are now behind me!

Modeling conversion rates using Weibull and gamma distributions

2019-08-05T00:00:00Z

This is a blog post originally featured on the Better engineering blog. If you want to link to this article or share it, please go to the original post URL! Separately, I'm sorry it's been so long with no posts on this blog. Between kids, moving, and being a startup CTO, I've been busy. I have a few posts coming down the pipe though, so stay tuned…

Lots of companies need to analyze conversion rates. Maybe you want to understand how many people purchased a widget out of the people that landed on your website. Or how many people upgraded to a subscription out of the people that created an account. Computing a conversion rate is often fairly straightforward and involves nothing more than dividing two numbers.

So what else is there to say about it? There is one major catch we had to deal with Better. When there is a substantial delay until the conversion event, this analysis suddenly gets vastly more complex.

To illustrate what I am talking about, we can look at the conversion rate for borrowers coming to Better.com to get a mortgage, defined in the most simplistic way, dividing the number of converted users by the total cohort size:

This looks really bad: is the conversion rate truly going down over time? But that's not right: it only looks like it is going down because we have given the later users less time to “bake”.

Another line shows how confusing the definition of conversion rate is. Let's look at time until conversion (right y-axis) as a function of the user cohort:

Ok, so the conversion rate is going down over time, but users are converting much faster? Clearly, this is a bogus conclusion, and yet again we are looking at it the wrong way.

(Side note, but throughout this blog post, the y scale is intentionally removed in order for us not to share important business metrics.)

The basic way: conversion at time T

There is a few ways we can resolve this. One way is to look at conversion rate at T = 35 days, or some similar cutoff. That way we can compare and see if conversion rates are going up or down:

Sadly, this also has a pretty severe issue: we can't compute conversion rates for anything more recent than 35 days ago. Back to the drawing board!

Why does any of this matter?

It might be worth taking a step back and considering what types of issues this is causing. At Better, we spend a significant amount of money (millions of dollars) on various types of paid user acquisition. This means that we buy leads/clicks from some source, and drive traffic to our website. Some of those are high intent, some of them are low intent. Some of them can take many months to convert. This makes it challenging to answer a seemingly simple question: what's the cost of user acquisition per channel?

If we put ourselves in a position where we have to wait many months for us to measure the efficacy of an acquisition channel, that means it takes forever to iterate and improve our acquisition, and it means a lot of money thrown out the window on bad channels. So, let's consider a few better options culminating in a somewhat complex statistical model we built.

Introducing cohort models

A much better way is to look at the conversion on a cohorted basis. There is a number of different ways to do this, and I've written a whole blog post about this. I'm going to skip a lot of the intermediate steps, and jump straight to what I consider the best next point: using a Kaplan-Meier estimator. This is a technique developed over 60 years ago in the field of survival analysis.

Computing a Kaplan-Meier estimator for each weekly cohort generates curves like this

The insight here is to switch from using the x-axis for the time, and instead let each cohort be its own line. These curves help us with a few things:

✅ We can compare curves for cohorts that have been “baking” for a long time and curves that just started.
✅ We don't have to throw away information by picking an arbitrary cutoff (such as “conversion at 30 days”).
✅ We can see some early behavior much quicker, by looking at the trajectory of a recent cohort.

For a wide variety of survival analysis methods in Python, I recommend the excellent lifelines package. As a side note, survival analysis is typically concerned with mortality/failure rates, so if you use any off-the-shelf survival analysis tools, your plots are going to be “upside down” from the plots in this post.

Kaplan-Meier also lets us estimate the uncertainty for each cohort, which I think is always best practice when you plot things!

The nice thing about Kaplan-Meier is that it lets us operate on censored data. This means that for a given cohort, we're not going to have observations beyond a certain point for certain members of that cohort. Some users may not have converted yet, but may very well convert in the future.

This is most clear if we segment the users by some other property. In the case below I've arbitrarily segmented users by the first letter of their email address. These two groups contain users on a spectrum between:

Some users that just came to our site and have essentially no time to convert
Some users that have had plenty of time to convert

Dealing with censoring is a huge focus for survival analysis and Kaplan-Meier does that in a formalized way.

So far, so good

Ok, so this is great: we have are now checking lots of the boxes, but IMO not quite all:

✅ Can deal with censored data
✅ Can give us uncertainty estimates
❌ Can extrapolate: it would be amazing if we could look at the early shape of a cohort curve and make some statements about what it's going to converge towards.

So, let's switch to something slightly more complex: parametric survival models! Take a deep breath, I'm going to walk you through this somewhat technical topic:

Parametric survival models

I was working on a slightly simpler cohort chart initially, and my first attempt was to fit an exponential distribution. The inspiration came from continuous-time Markov chains where you can model the conversions as a very simple transition chart:

In the chart above, we can only observe transitions to the converted state. A lack of observation does not necessarily mean no conversion, it means they are either dead, or will convert, but have not converted yet. This transition diagram actually describes a very simple differential equation that we can solve to get the closed form. I will spare you the details in this blog post, but the form of the curve that we are trying to fit is:

$$ F(t) = c\left(1 - e^{-\lambda t}\right) $$

This gives us two unknown parameters for each cohort: $$ c $$ and $$ \lambda $$. The former explains the conversion rate that the cohort converges towards, the latter explains the speed at which it converges. See below for a few examples of hypothetical curves:

Note that the introduction of the parameter $$ c $$ departs a bit from most of traditional survival analysis literature. Exponential distributions (as well as Weibull and gamma, which we will introduce in a second) are commonplace when you look at failure rates and other phenomena, but in all cases that I encountered so far, there is an assumption that everyone converts eventually (or rather, that everyone dies in the end). This assumption is no longer true when we consider conversions: not everyone converts in the end! That's why we have to add the $$ 0 \leq c \leq 1 $$ parameter

Weibull distributions

It turns out that exponential distributions fit certain types of conversion charts well, but most of the time, the fit is poor. This excellent blog post introduced me to the world of Weibull distributions, which are often used to model time to failure or similar phenomena. The Weibull distribution adds one more parameter $$ p > 0 $$ to the exponential distribution:

$$ F(t) = c\left(1 - e^{-(t\lambda)^p}\right) $$

Fitting a Weibull distribution seems to work really well for a lot of cohort curves that we work with at Better. Let's fit one to the dataset we had earlier:

The solid lines are the models we fit, and the dotted lines the Kaplan-Meier estimates. As you can see, these lines coincide very closely. The nice thing about the extrapolated lines is that we can use them to forecast their expected final conversion rate. We can also fit uncertainty estimates to the Weibull distribution just like earlier:

The ability to extrapolate isn't just a “nice to have”, but it makes it possible to make assumptions about final conversion rates much earlier, which in turn means our feedback cycle gets tighter and we can learn faster and iterate quicker. Instead of having to wait months to see how a new acquisition channel is performing, we can get an early signal very quickly, and make business decisions faster. This is extremely valuable!

Gamma and generalized gamma distributions

For certain types of cohort behavior, it turns out that a gamma distributions makes more sense. This distribution can be used to model a type of behavior where there is an initial time lag until conversion starts. The generalized gamma distribution combines the best of Weibull and gamma distributions into one single distribution that turns out to model almost any conversion process at Better. Here is one example:

The generalized gamma conversion model has just four parameters that we need to fit (three coming from the distribution itself, one describing the final conversion rate). Yet, it seems to be an excellent model that fits almost any conversion behavior at Better. See below for a gif where I fit a generalized gamma model to a diverse set of database queries comprising different groups, different milestones, and different time spans:

Introducing convoys

Convoys is a small Python package to help you fit these models. It implements everything shown above, as well as something which we didn't talk about so far: regression models. The point of regression models is to fit more powerful models that can predict conversion based on a set of features and learn that from historical data. We use these models for a wide range of applications at Better.

Convoys came out of a few different attempt of building the math to fit these models. The basic math is quite straightforward: fit a probability distribution times a “final conversion rate” using maximum likelihood estimation. We rely on the excellent autograd package to avoid taking derivatives ourselves (very tedious!) and scipy.optimize for the actual curve fitting. On top of that, convoys supports estimating uncertainty using emcee.

You can head over the the documentation if you want to read more about the package. Just to mention a few of the more interesting points of developing convoys:

For a while, convoys relied on Tensorflow, but it turned out it made the code more complex and wasn't worth it.
To fit gamma distributions, we rely a lot on the lower regularized incomplete gamma function. This function has a bug in Tensorflow where the derivative is incorrect, and it's not supported in autograd. After a lot of banging my head against the wall, I added a simple numerical approximation. Cam Davidson-Pilon (author of lifelines mentioned earlier) later ran into the exact same issue and made a small Python package that we're now using.
In order to regularize the models, I have found it useful to put very mild priors on the variance of some of the parameters using an inverse gamma distribution. This ends up stabilizing many of the curves fit in practice, while introducing a very mild bias.
When fitting a regression model, we have separate parameters $$ c_i $$ and $$ \lambda_i $$ for each feature, but shared $$ k $$ and $$ p $$ parameters for the generalized gamma distribution. This is a fairly mild assumption in real world cases and reduces the number of parameters by a lot.

Convoys is semi-experimental and the SDK might change very quickly in the future, but we believe it has a quite wide range of applications, so definitely check it out if you are working on similar problems!

Finally…

We are hiring! If you're interested in these types of problems, definitely let us know! We have a small but quickly growing team in of data engineers/scientists in New York City who are working on many of these types of problems on a daily basis.

Why software projects take longer than you think: a statistical model

2019-04-15T00:00:00Z

Anyone who built software for a while knows that estimating how long something is going to take is hard. It's hard to come up with an unbiased estimate of how long something will take, when fundamentally the work in itself is about solving something. One pet theory I've had for a really long time, is that some of this is really just a statistical artifact.

I suspect devs are actually decent at estimating the *median* time to complete a task. Planning is hard because they suck at the *average*.
— Erik Bernhardsson (@bernhardsson) May 11, 2017

Let's say you estimate a project to take 1 week. Let's say there are three equally likely outcomes: either it takes 1/2 week, or 1 week, or 2 weeks. The median outcome is actually the same as the estimate: 1 week, but the mean (aka average, aka expected value) is 7/6 = 1.17 weeks. The estimate is actually calibrated (unbiased) for the median (which is 1), but not for the the mean.

A reasonable model for the “blowup factor” (actual time divided by estimated time) would be something like a log-normal distribution. If the estimate is one week, then let's model the real outcome as a random variable distributed according to the log-normal distribution around one week. This has the property that the median of the distribution is exactly one week, but the mean is much larger:

If we take the logarithm of the blowup factor, we end up with a plain old normal distribution centered around 0. This assumes the median blowup factor is 1x, and as you hopefully remember, log(1) = 0. However, different tasks may have different uncertainties around 0. We can model this by varying the σ parameter which corresponds to the standard deviation of the normal distribution:

Just to put some numbers on this: when log(actual / estimated) = 1 then the blowup factor is exp(1) = e = 2.72. It's equally likely that a project blows up by a factor of exp(2) = 7.4 as it is that it completes in exp(-2) = 0.14 i.e. completes in 14% of the estimated time. Intuitively the reason the mean is so large is that tasks that complete faster than estimated have no way to compensate for the tasks that take much longer than estimated. We're bounded by 0, but unbounded in the other direction.

Is this just a model? You bet! But I'll get to real data shortly and show that this in fact maps to reality reasonably well using some empirical data.

Software estimation

So far so good, but let's really try to understand what this means in terms of software estimation. Let's say we look at the roadmap and it consists of 20 different software projects and we're trying to estimate: how long is it going to take to complete all of them.

Here's where the the mean becomes crucial. Means add, but medians do not. So if we want to get an idea of how long it will take to complete the sum of n projects, we need to look at the mean. Let's say we have three different projects in the pipeline with the exact same σ = 1:

	Median	Mean	99%
Task A	1.00	1.65	10.24
Task B	1.00	1.65	10.24
Task C	1.00	1.65	10.24
SUM	3.98	4.95	18.85

Note that the means add up and 4.95 = 1.65*3, but the other columns don't.

Now, let's add up three projects with different sigmas:

	Median	Mean	99%
Task A (σ = 0.5)	1.00	1.13	3.20
Task B (σ = 1)	1.00	1.65	10.24
Task C (σ = 2)	1.00	7.39	104.87
SUM	4.00	10.18	107.99

The means still add up, but are nowhere near the naïve 3 week estimate you might come up with. Note that the high-uncertainty project with σ=2 basically ends up dominating the mean time to completion. For the 99% percentile, it doesn't just dominate it, it basically absorbs all the other ones. We can do a bigger example:

	Median	Mean	99%
Task A (σ = 0.5)	1.00	1.13	3.20
Task B (σ = 0.5)	1.00	1.13	3.20
Task C (σ = 0.5)	1.00	1.13	3.20
Task D (σ = 1)	1.00	1.65	10.24
Task E (σ = 1)	1.00	1.65	10.24
Task F (σ = 1)	1.00	1.65	10.24
Task G (σ = 2)	1.00	7.39	104.87
SUM	9.74	15.71	112.65

Again, one single misbehaving task basically ends up dominating the calculation, at least for the 99% case. Even for mean though, the one freak project ends up taking over roughly half the time spend on these tasks, despite all of these tasks having a similar median time to completion. To make it simple, I assumed that all tasks have the same estimated size, but different uncertainties. The same math applies if we vary the size as well.

Funny thing is I've had this gut feeling for a while. Adding up estimates rarely work when you end up with more than a few tasks. Instead, figure out which tasks have the highest uncertainty – those tasks are basically going to dominate the mean time to completion.

I have two methods for estimating project size:
(a) break things down into subprojects, estimate them, add it up
(b) gut feeling estimate based on how nervous i feel about unexpected risks
So far (b) is vastly more accurate for any project more than a few weeks
— Erik Bernhardsson (@bernhardsson) March 8, 2019

A chart summarizes the mean and 99th percentile as a function of the uncertainty (σ):

There is math to this now! I've started appreciating this during project planning: I truly think that adding up task estimates is a really misleading picture of how long something will take, because you have these crazy skewed tasks that will end up taking over.

Where's the empirical data?

I filed this in my brain under “curious toy models” for a long time, occasionally thinking that it's a neat illustration of a real world phenomenon I've observed. But surfing around on the interwebs one day, I encountered an interesting dataset of project estimation and actual times. Fantastic!

Let's do a quick scatter plot of estimated vs actual time to completion:

The median blowup factor turns out to be exactly 1x for this dataset, whereas the mean blowup factor is 1.81x. Again, this confirms the hunch that developers estimate the median well, but the mean ends up being much higher.

Let's look at the distribution of the blowup factor. We're going to look at the logarithm of it:

You can see that it's pretty well centered around 0, where the blowup factor is exp(0) = 1.

Let's go grab the statistics toolbox

I'm going to get a bit fancy with statistics now – feel free to skip if it's not your cup of tea. What can we infer from this empirical distribution? You might expect that the logarithms of the blowup factor would distribute according to a normal distribution, but that's not quite true. Note that the σs are themselves random and vary for each project.

One convenient way to model the σs is that they are sampled from an inverse Gamma distribution. If we assume (like previously) that the log of the blowup factors are distributed according to a normal distribution, then the “global” distribution of the logs of blowup factors ends up being Student's t-distribution.

Let's fit a Student's t-distribution to the distribution above:

Decent fit, in my opinion! The parameters of the t-distribution also define the inverse Gamma distribution of the σ values:

Note that values like σ > 4 are incredibly unlikely, but when they happen, they cause a mean blowup of several thousand times.

Why software tasks always take longer than you think

Assuming this dataset is representative of software development (questionable!), we can infer some more numbers. We have the parameters for the t-distribution, so we can compute the mean time it takes to complete a task, without knowing the σ for that task is.

While the median blowup factor imputed from this fit is 1x (as before), the 99% percentile blowup factor is 32x, but if you go to 99.99% percentile, it's a whopping 55 million! One (hand wavy) interpretation is that some tasks end up being essentially impossible to do. In fact, these extreme edge cases have such an outsize impact on the mean, that the mean blowup factor of any task ends up being infinite. This is pretty bad news for people trying to hit deadlines!

Summary

If my model is right (a big if) then here's what we can learn:

People estimate the median completion time well, but not the mean.
The mean turns out to be substantially worse than the median, due to the distribution being skewed (log-normally).
When you add up the estimates for n tasks, things get even worse.
Tasks with the most uncertainty (rather the biggest size) can often dominate the mean time it takes to complete all tasks.
The mean time to complete a task we know nothing about is actually infinite.

Notes

This is obviously just based on one dataset I found online. Other datasets may give different results.
My model is of course also highly subjective, like any statistical model.
I would ❤️ to apply the model to a much larger data set to see how well it holds up.
I assumed all tasks independent. In reality they might have a correlation which would make the analysis a lot more annoying but (I think) ultimately with similar conclusions.
The sum of log-normally distributed value is not another log-normally distributed value. This is a weakness with that distribution, since you could argue most tasks is really just a sum of sub-tasks and it would be nice if our distribution was stable like that.
I removed small tasks (estimated time less than or equal to 7 hours) from the histogram since small tasks skew the analysis and there there was an odd spike at exactly 7.
The code is on my Github, as usual.
There's some discussion on Hacker News and on Reddit.

Headcount goals, feature factories, and when to hire those mythical 10x people

2019-02-21T00:00:00Z

When I started building up a tech team for Better, I made a very conscious decision to pay at the high end to get people. I thought this made more sense: they cost a bit more money to hire, but output usually more than compensates for it. Many fellow CTOs, some went for the other side of the spectrum. This was a mystery to me, until it all made sense to me.

What is output?

Before we get started, let me clarify what I mean by “output” or “productivity”. I don't mean an engineer just hammering on the keyboard shipping code at light speed. When I talk about it, I refer to a whole range of things, like helping your coworkers, introducing new frameworks, improving the process, and much more. I've written about this in the past.

You can't really measure it, of course. But all managers try, when they set the salary of Alice to $110,000 and Bob to $115,000. So on some level, managers certainly believe they have some precise idea of the relative value of each engineer.

Headcount goals

Let's dissect an classic management objective: headcount goals. In a typical engineering hiring process, a CTO (or high up person) figures out roughly how much they need to get done compared to how many engineers they have, then goes to the CFO and haggles a bit, then gets assigned a headcount number and a salary range for those people. That then gets distributed across the org recursively, and every hiring manager gets a target for how many people to hire.

Let's say the CTO is absolutely adamant that they need to grow the engineering team by 2x in a year. This bubbles down to a junior engineering manager. If you are running a decentralized interview process, then you know create a great agency problem where the junior manager is told their success at the company is partly measured by how well they reach their hiring goal. Of course they are going to lower their bar for who they hire!

Don't think this happens? I've seen it. I've seen how recruiting bars start slipping because of well-meaning people pushing for more resources. And how over time the average level of engineering talent slowly declines.

Solving the misalignment

The right solution to this is partly to make the interview process and decision centralized. No team should impose their own hiring standards because with aggressive headcount goals, everyone on that team will be incentivized to lower the bar.

But let's also on more fundamental level: why headcount goals? This makes the underlying assumption that every engineer has roughly the same productivity. In reality, engineer productivity can be very dispersed.

So why not target a certain output level? Of course it's because engineers don't come with labels that say this one is a 2.3x engineer that costs $140,000 and that other one is a 4.5x that costs $180,000. You don't know! Let's first talk about this relationship though because I think it's important to understand.

Cost as a function of productivity

What's maybe surprising is that cost as function of productivity seems to be a sub-linear function. A 3x or 4x engineer might cost say 2x more. This is clearly not a law imposed by physics that fits a straight line, but I think most people who have done some serious recruiting would concede that it follows something slightly less than linear.

For instance, let's say the cost of a $$k$$x engineer is $$k^{0.6}$$. So for a 2x engineer we pay 1.5x more and for a 10x engineer we pay 4x more. The choice of the exponent is a bit arbitrary here, but the point is to reflect that the cost scales less than lineary. Any exponent less than 1 works for the purpose of this argument, and note that an exponent larger than 1 would not exist in an efficient market. No one would hire a 2x engineer at 2.1x the cost – they would simply hire 2 1x engineers.

This seems like a no-brainer then. Why wouldn't everyone pay a ton more money to hire the most senior engineers? Let's throw headcount targets out the window and replace with total output target Maybe we should even go as far as having a total salary dollar target, rather than headcount? Besides the challenge of convincing your CFO of this, it probably misaligns incentives even more.

Headcount targets usually come with salary bands that you agree on beforehand. This is another weird constraint if you think about it – if more expensive engineers have a higher ROI then why cap the cost (and thus the productivity)?

These are things that I've been struggling to understand. It turns out, you can formalize a simple model where it's rational to hire two 1x engineers instead of a 2x engineer even if the total cost is higher.

Feature factories and task overhead

There's one common argument for hiring “cheaper” engineering talent which is that a ton of tasks are straightforward, unsexy, or boring. Maybe an entry-level engineer doesn't mind tweaking WordPress themes all day, but a senior engineer need more challenges. At the extreme end of this spectrum is a type of company often derided as a feature factory, where I suspect people imagine a sweat shop of super inexperienced engineers basically updating forms in HTML or adding tracking pixels.

I'm pretty unconvinced by argument. A senior person will find opportunities to automate and reduce repetitive parts, paying for themselves.

However, there's a slight variant of this idea that I think actually does justify hiring less experienced engineers, which has to do with task overhead. Let's consider a toy model:

Let's say we have two engineers, one called Norm the normal engineer and one Twanda the 2x engineer. Let's say they both work at a company where Norm spends 50% of his time actually working, with the rest of the time lost as “task overhead”. Maybe a bunch of bookkeeping (going into Jira, creating Github pull requests, waiting for CI etc). This is overhead that have to happen for every task.

How much more productive is Twanda compared to Norm? 2x? No! Twanda generates 4/3 as much value! And in general, if a 1x engineer spends $$ c $$ of their time on “task overhead” items, then a $$ k $$x engineer will have output factor $$ 1 / (c/k + 1-c) $$.

Note that spending time in meetings doesn't have the same impact. In a hypothetical company where 90% of all time is being spent in meetings, a 2x faster engineer would still get 2x more work done (in the 10% of time that isn't spent in meetings). What my model is talking about is task-related overhead.

You can see in this toy model that a lot of the productivity gains of a higher-output engineer will be diminished in an environment with high task overhead. You really benefit a lot more from more productive people if you minimize the amount of task overhead!

The cost-benefit analysis of high output engineers

Now we have a bunch of the assumptions that lets us calculate the output per cost of a $$k$$x engineer. We know the output factor $$ 1/(c/k + 1-c) $$ and the cost $$ k^{0.6} $$ so the output per cost is:

$$ \frac{1}{(c/k + 1-c)k^{0.6}} $$

For any given value of $$ c $$, we can solve for the optimal value for $$ k $$! Take the derivative with respect to $$ k $$ set it to zero. Because I'm a lazy person, I just plugged it into Wolfram Alpha and the optimal value of $$ k $$ as a function of $$ c $$ turns out to be

$$ k = \frac{2}{3}\frac{c}{1-c} $$

Let's plot the optimal value of $$ k $$ with respect to $$ c $$. I had to plot it on the log-scale for the shape to come out nicely:

Beautiful! Let's unpack this by picking a few points on the chart:

Extreme case: if the overhead is 100% then the best value for money is to hire 0x engineers.
If the overhead is about 80% then the best value for money is to hire 0.2x engineers.
If the overhead is about 40% then the best value for money is to hire 1x engineers.
If the overhead is about 20% then the best value for money is to hire 3x engineers.
If the overhead is about 7% then the best value for money is to hire 10x engineers.
Extreme case: if the overhead is 0% then the best value for money is to hire ∞x engineers.

So it's all about getting the overhead of work down.

Getting the most value out of your tech team

We talked a lot about the difference between engineers in terms of productivity vs cost and how to get the most value of them. The good news is that there's really only two things that it boils down to!

Have a centralized recruiting process with a consistent high bar
Reduce the task overhead to a minimum

If you don't have those things, there's no point trying to hire super senior people: and in particular you are probably better off hiring average engineers. Xavier Amatriain wrote a blog post with sort of similar conclusions: don't expect that you can cherry-pick elements of the Netflix culture and drop it into your startup. You might have to start with your development process and your hiring process!

If you had asked me before I wrote this blog post why some companies pay top dollars for engineers and other don't, I probably would have said that some companies are super tech focused, and so they can truly get value out of really expensive engineers, whereas some companies are a collection of scripts using some off-the-shelf framework, and an expensive engineer wouldn't make a huge difference.

I still think this is right, but I think the exact causality has to do more with the model posited in this post. As an example, Google (known for paying much) have types of challenges that engineers can work independently for a very long time. That lowers the (amortized) task overhead, which means that they get more value out of an expensive (but more productive) engineer. Other companies have a large quantity of small projects (thus a large task overhead) meaning they rationally shouldn't pay at the top of the market.

This all definitely strikes me as kind of “obvious” in hindsight, and maybe you feel the same. At least you know have some math to back it up!

Data architecture vs backend architecture

2019-01-10T00:00:00Z

A modern tech stack typically involves at least a frontend and backend but relatively quickly also grows to include a data platform. This typically grows out of the need for ad-hoc analysis and reporting but possibly evolves into a whole oil refinery of cronjobs, dashboards, bulk data copying, and much more. What generally pushes things into the data platform is (generally) that a number of things are

Not latency critical so can run much later, maybe up to 24h (as opposed to reactive synchronous jobs sitting on a request-response cycle)
Easier to express as batch job operating on a large dataset rather than operating on each request

Reporting is a decent example. Let's say you need to import all transactions into your accounting system. Rather than doing it directly from the backend, it might be a lot easier to just write a script to do it every 24h.

Training machine learning models is another example. Let's say you are building a fraud detection system and you have a machine learning model to detect if some user action is fradulent. Training the model might take an hour but predictions are quick. It's much easier to re-train the model say every 24h or even every week or month. You can then serialize the model and use that for predictions in your backend system.

At Spotify the data platform started with royalty reporting, but quickly rebuilt the toplists to be a nightly data job. The data platform kept growing, in particular the music recommendation system, which became a humongous data pipeline. We retrained the core models every few weeks but typically regenerated personalized recommendations every night. That was frequent enough that people wouldn't run out of recommendations.

Why bother?

Why bother with a data platform? Because things typically get 10x easier to build and ship. Pushing work out of the backend into a separate data platform helps with a few things:

You don't have to worry about latency.
You can control the flow yourself (rather than being at the mercy of a user request waiting for a response)
You can generally write things in a much more fault tolerant way (batch processing is often easier to write as a set of idempotent operations)
Batch processing can be a lot more efficient (generating music recommendations for 1,000,000 users is maybe only 1,000 times more work than generating for 1 user)
If things fail, it's not the end of the world, since you often fix the bug within the next day or so and just re-run the job

For instance, consider the basic feature of building a global toplist that updates itself in real time, say showing the top news articles on a news website. I'm willing to bet a substantial amount of money that it's orders of magnitude harder to do this purely in the backend compared to building a cron job in the data platform that updates it every hour or every day and pushes the result back into the backend.

So you do you do it (in the least hacky way)?

Of course, backend architecture is a bit more mature, and there's about 1,000 blog posts about best practices. Martin Fowler is one that comes to mind for instance. When we're building backend systems, we've been taught things like:

Avoid integration databases (each system should have its own database, and two systems can never touch the same database)
Database queries should be simple (typically refer to an exact key and joining as few tables as possible, ideally zero)
Use transactions (and constraints/keys/etc) for data integrity
Lots of unit tests
Lots of integration tests
Decompose larger services (aka “monoliths”) into smaller ones (aka “microservices”)

… aaaaanyway, you can throw all/most of this out the window when you go to the data platform!

Oil painting featuring the defenestration of Prague (1618)

The data side: the wild west

What I've seen in terms of infrastructure is typically one of these things as a starting point:

Backend logs are shipped to some data store
Backend production databases are dumped to some data store

Back in the days, this data store was typically Hadoop (HDFS) although these days it's often some scalable database such as Redshift. There's lots of different solutions and I'm not here to opine on any of them.

Notes on data latency

Anything in the data platform is typically delayed (at Spotify typically up to 24h or even more). If it's not delayed, then it should be considered delayed, meaning anything operating on the data should not be latency critical.

Can you have a data platform that operates on real-time database data rather than a delayed database dump? Yes, in theory. But I think it can encourage really bad practices. For instance if you have cron jobs operating on the production database, then it's easy to do things like writing data back to the database which now creates an integration database with multiple writers and consumers of the same table. This can be a mess! For this reason, the right separation should be that (a) cron jobs operate on delayed data (b) any communication back to the backend system happens through internal endpoints.

And another word of caution: don't give business people access to real-time data! They will start to demand it for everything and you will have to support it forever.

Anyway, what happens on the data side? In my experience, and in my opinion, anything goes. Let's talk about some of those things:

Integration databases

For instance, the traditional constraint that services should not touch each other's databases and that those layers should be respected. This avoids the so called “integration database” antipattern where you have multiple writers and readers of the same data. In the data world, that rule goes out the window. On the data side, I find it completely kosher to join across 3 different dataset from different services. Why is this acceptable? I think it boils down to two things

Schema changes that break any downstream consumers are not the end of the world. You are going to have to update some queries, that's it.
If things break, then you can typically fix it and re-run the job. If some report isn't generated at midnight UTC, you can fix the job in the next few hours and few will complain.
All queries are read-only. That means you never have to worry about transaction integrity, or reading inconsistent data.

My conclusions is that integration databases are terrible for backend systems but great and fun in the data layer.

Mega queries

A backend system will have to deal with low-latency low-throughput queries, that usually touch only one user at a time. But that user is making a request and waiting for a response, so queries need to be fast. So how do you make queries fast?

Avoid as many joins as possible. When needed, stick to very simple, indexed joins (typically left join from table A to B)
Every query should refer do a particular item id, eg. ... where user_id = 234 or similar.

I get incredibly suspicious any time I see any queries in a backend system that are more complex than a couple of tables and one or two where conditions.

In the data world, those things don't matter at all. Almost all queries are going to cut “across” all/most of the data, and in fact most queries will boil down to “full table scans”. Queries spanning pages looking like they were written by a early 20th century German philosopher? Bring it on!

This is basically the difference between OLTP and OLAP which are different types of query pattern. There's probably lots of literature if you're willing to go deep!

Testing

I'm writing this somewhat apologetically, maybe with a sense of guilt as I'm admitting some dark secret. I'm a big proponent of very thorough testing, but getting any reasonable amount of test coverage on the data side is… hard.

For a backend system, you're sort of implementing functions like $$ y = f(x) $$ where $$ x $$ is an input (say a response from a 3rd party API) and $$ y $$ is the result (say, the API response transformed to an internal representation). That's really easy to test! Sometimes you have functions like $$ S_{n+1} = f(S_n, A) $$ where $$ S_{i} $$ is the “state” and $$ A $$ is some action. But the state and the action is super tiny and can be reasonably well represented inside a unit test.

On the data side, the $$ x $$ and $$ S_i $$ are huge. This means any unit test basically ends up being 99% just setting up all the input data. But because the input data is super high dimensional, it also means modeling edge cases becomes exponentially harder. The just to throw a wrench into this, data pipelines can often be nondeterministic (machine learning models) or have very subjective outputs to the point where you can't just write up a bunch of assertions easily. For all these reasons, I've found that tests for data pipelines have pretty low fidelity (they catch few bugs) and have high maintenance costs. Sad! I like to have a few basic tests to make sure things run, but verifying correctness might not always be worth it.

Conclusion

I've found it useful to push as much as you can out of the backend into the data platform. This includes lots of things like

Sending (non-transactional) emails to users (like personalized marketing emails)
Generating search indexes
Generating recommendations
Reporting
Generating data for business people
Training machine learning models

All of those things could be built into the backend system, but should probably be run as cronjobs in a data platform instead. This will reduce the complexity of your code by (roughly) an order of magnitude.

Side note

I just wanted to mention it's been three months since my last blog post. I'm sorry! My second daughter was born in November and life has been pretty busy (but fun!). It didn't help that my last two posts both hit Hacker News front page, putting the bar really high. I have a bunch of low key posts in my head I'm planning to post in the next couple of months. Keep an eye out!

The hacker's guide to uncertainty estimates

2018-10-08T00:00:00Z

It started with a tweet:

New years resolution: every plot I make during 2018 will contain uncertainty estimates
— Erik Bernhardsson (@bernhardsson) January 7, 2018

Why? Because I've been sitting in 100,000,000 meetings where people endlessly debate whether the monthly number of widgets is going up or down, or whether widget method X is more productive than widget method Y. For almost any graph, quantifying the uncertainty seems useful, so I started trying. A few months later:

I'm four months into this year and I gotta tell you – the rabbit hole of estimating uncertainty is DEEP https://t.co/wvSlsYskrt
— Erik Bernhardsson (@bernhardsson) May 12, 2018

I never studied statistics and learned it kind of “backwards” through machine learning, so I consider myself more as a hacker who picked up statistics along the way. Earlier this year I had some basic knowledge of bootstrapping and confidence intervals, but along the way I had to pick up a whole arsenal of tricks going all the way to Monte Carlo methods and inverse Hessians. It seemed useful to share some of the methods I've used the most, so I wrote this post!

Note: this post features some math which might not render properly outside the canonical location of the post.

Let's get started

I don't believe in learning things without a concrete example, so let's generate some data. We're going to generate a fake time series where the dates range from 2017-07-01 to 2018-07-31. Let's say the observations are all the weights of an elephant.

def generate_time_series(k=200, m=1000, sigma=100, n=50, 
                         start_date=datetime.date(2017, 7, 1)):
    xs = numpy.linspace(0, 1, n, endpoint=False)
    ys = [k*x + m + random.gauss(0, sigma) for x in xs]
    ts = [start_date + datetime.timedelta(x)*365 for x in xs]
    x_scale = numpy.linspace(-1, 2, 500) # for plotting
    t_scale = [start_date + datetime.timedelta(x)*365 for x in x_scale]
    return xs, ys, ts, x_scale, t_scale

xs, ys, ts, x_scale, t_scale = generate_time_series()

Before we get started with anything, we need graphics. Let's plot to see what's going on!

pyplot.scatter(ts, ys, alpha=0.5, s=100)
pyplot.xlabel('Date')
pyplot.ylabel('Weight of elephant (kg)')

First of all, let's not fit any fancy models. We're just going to break it up into a few buckets and compute the mean within each bucket. But let's first pause and talk about uncertainty.

Distribution of the data vs uncertainty

This makes me sometimes feel dumb but I keep confusing what “uncertainty” means and I think it's important to be hyper clear because we're going to do all of them. There's multiple different things we can estimate the distribution for:

The data itself. Given a certain time frame $$(t, t’)$$, what's the distribution of the elephant's weight during that time interval?
The uncertainty of some parameter, like the parameter $$k$$ in a linear relationship $$y = kt + m$$. Or the uncertainty of some estimator, like the mean of a number of observations.
The uncertainty of predicted quantities. So if we predict that for a date $$t$$ (possibly in the future), the elephant is going to weight $$y$$ kg, we want to know the uncertainty of the quantity $$y$$.

Hope that makes sense! Let's start with the most basic model – just break things up in buckets. I recommend the excellent Seaborn package if we just want to get some basic idea of distributions and uncertainty estimates. Seaborn usually operates on dataframes so we'll need to convert:

d = pandas.DataFrame({'x': xs, 't': ts, 'Weight (kg)': ys})
d['Month'] = d['t'].apply(lambda t: t.strftime('%Y-%m'))
seaborn.boxplot(data=d, x='Month', y='Weight (kg)')

These last charts show the distribution of the dataset. Let's now try to figure out the uncertainty of a very common estimator: the mean!

Computing the uncertainty of the mean – normal distributions

Under some mild assumptions (I'm going to get back to this in a sec and scrutinize it), we can compute the confidence intervals of the mean estimator as:

$$\bar{x} \pm 1.96\sigma / \sqrt{n}$$

Where $$\bar{x}$$ is the mean and $$\sigma$$ is the standard deviation, a.k.a. the square root of the variance. I don't think this formula is super important to remember, but I think it's somewhat useful to remember that the size of the confidence interval is inversely related to the square root of the number of samples. For instance, this is useful when you're running an A/B test – if you want to detect a 1% difference then you need something on the order of $$0.01^{-2} = 10,000$$ samples. (This is a rule of thumb, don't use it for your medical device software).

By the way – what's the 1.96 number from? It's directly related to the size of the uncertainty estimate. With $$\pm 1.96$$ you will cover about 95% of the probability distribution.

def plot_confidence_interval(observations_by_group):
    groups = list(sorted(observations_by_group.keys()))
    lo_bound = []
    hi_bound = []
    for group in groups:
        series = observations_by_group[group]
        mu, std, n = numpy.mean(series), numpy.std(series), len(series)
        lo_bound.append(mu - 1.96*std*n**-0.5)
        hi_bound.append(mu + 1.96*std*n**-0.5)
    pyplot.fill_between(groups, lo_bound, hi_bound, alpha=0.2,
                        label='Confidence interval (normal)')

pyplot.scatter(ts, ys, alpha=0.5, s=100)
observations_by_month = {}
for month, y in zip(d['Month'], d['Weight (kg)']):
    observations_by_month.setdefault(month, []).append(y)
plot_confidence_interval(observations_by_month)
pyplot.ylabel('Weight of elephant (kg)')
pyplot.legend()

Note that this shows the uncertainty of the mean and that this is not the same thing as the distribution of the data itself. That's why you see far fewer than 95% of the points within the red shaded area. If we added more and more points, the red shaded area would get more and more narrow, whereas the blue dots would still have about the same range. However the true mean should in theory be within the red shaded area 95% of the time.

I mentioned earlier that the formula for confidence interval only applies under some mild assumptions. What are those? It's the assumption of normality. For a large number of observations, this is nothing to worry about, and this is due to the central limit theorem.

Confidence intervals when all outcomes are 0 or 1

Let's look at a type of dataset that I often work on: conversions. For the sake of the argument, let's say we're running an A/B test that has some impact, and that we're trying to understand the impact on conversion rate by state. Conversion is always 0 or 1. The code to generate this data set isn't super important, so don't pay too much attention to this:

STATES = ['CA', 'NY', 'FL', 'TX', 'PA', 'IL', 'OH']
GROUPS = ['test', 'control']

def generate_binary_categorical(states=STATES, groups=GROUPS, k=400,
                                zs=[0, 0.2], z_std=0.1, b=-3, b_std=1):
    # Don't pay too much attention to this code. The main thing happens in
    # numpy.random.binomial, which is where we draw the "k out of n" outcomes.
    output = {}
    e_obs_per_state = numpy.random.exponential(k, size=len(states))
    state_biases = numpy.random.normal(b, b_std, size=len(states))
    for group, z in zip(groups, zs):
        noise = numpy.random.normal(z, z_std, size=len(states))
        ps = 1 / (1 + numpy.exp(-(state_biases + noise)))
        ns = numpy.random.poisson(e_obs_per_state)
        ks = numpy.random.binomial(ns, ps)
        output[group] = (ns, ks)
    return output

For each state and each “group” (test and control) we generated $$ n $$ users, out of which $$ k $$ have converted. Let's plot the conversion rate per state to see what's going on!

data = generate_binary_categorical()

for group, (ns, ks) in data.items():
    pyplot.scatter(STATES, ks/ns, label=group, alpha=0.7, s=400)

pyplot.ylabel('Conversion rate')
pyplot.legend()

How do we compute confidence intervals for these numbers? We could of course use the method I just covered further up where we compute the mean and the standard deviations. But in this case there's a trick we can use!

Since all outcomes are 0 or 1, and drawn with the same (unknown) probability, we know that the number of ones and zeros follows a binomial distribution. This means that the confidence interval of a “$$k$$ out of $$n$$” scenario is a Beta distribution.

I've personally benefitted from memorizing the formula for the confidence interval and think I probably use it more than the previous (Normal based) one. In particular all you need to remember is

n, k = 100, 3
scipy.stats.beta.ppf([0.025, 0.975], k, n-k)

array([0.00629335, 0.07107612])

This will compute a 95% confidence interval if you plug in values for $$n$$ and $$k$$. In this case we see that if we have 100 website visitors and 3 of them purchased the product, then range is 0.6%-7.1%. Let's try this for our dataset:

for group, (ns, ks) in data.items():
    lo = scipy.stats.beta.ppf(0.025, ks, ns-ks)
    hi = scipy.stats.beta.ppf(0.975, ks, ns-ks)
    mean = ks/ns
    pyplot.errorbar(STATES, y=mean, yerr=[mean-lo, hi-mean],
                    label=group, alpha=0.7, linewidth=0, elinewidth=50)

pyplot.ylabel('Conversion rate')
pyplot.legend()

Nice! 👌

Bootstrapping

Another approach that can be useful is bootstrapping. It allows you do compute the same statistics without memorizing any formulas. The idea is to compute the mean, but do it for $$n$$ bootstraps, where each bootstrap is a random sample (with replacement) from our observations. For every bootstrap, we compute a mean, and then we take the mean in the 97.5th and the 2.5th percentile as the confidence interval:

lo_bound = []
hi_bound = []
months = sorted(observations_by_month.keys())
for month in months:
    series = observations_by_month[month]
    bootstrapped_means = []
    for i in range(1000):
        # sample with replacement
        bootstrap = [random.choice(series) for _ in series]
        bootstrapped_means.append(numpy.mean(bootstrap))
    lo_bound.append(numpy.percentile(bootstrapped_means, 2.5))
    hi_bound.append(numpy.percentile(bootstrapped_means, 97.5))
pyplot.scatter(ts, ys, alpha=0.5, s=100)
pyplot.fill_between(months, lo_bound, hi_bound, alpha=0.2,
                    label='Confidence interval')
pyplot.ylabel('Weight of elephant (kg)')
pyplot.legend()

Miraculously, this charts look very similar to the one before! Just kidding – that was sort of expected :)

Bootstrapping is nice because it lets you dodge any questions about what probability distribution the data is generated from. It's basically plug and play, and works on almost everything, though it can be a bit slow.

☠ Be aware though that there's a danger zone of bootstrapping. My understanding is that bootstrapping will converge towards the correct estimates as the number of samples goes to infinity, but if you're working with small samples, you can get really wonky results. I generally never trust bootstrapping for anything less than say 50 samples, and you probably shouldn't do that either.

As a side note, Seaborn's barplot actually plots confidence intervals using bootstrapping:

seaborn.barplot(data=d, x='Month', y='Weight (kg)')

Again, Seaborn is great for exploratory analysis, and some of its charts can do basic statistics.

Regression

Let's crank it up a notch. We're going to fit a straight line to this cloud of points.

There's some smart ways you can implement linear regression so that it's extremely fast, but we're not going to use those methods because they don't generalize very well.

I'm going to do it in what I think of as the most general possible way. We're going to define a model (in this case a straight line), a loss function (squared deviations from this straight line) and then optimize it using a general-purpose solver (scipy.optimize.minimize).

xs, ys, ts, x_scale, t_scale = generate_time_series()

def model(xs, k, m):
    return k * xs + m

def l2_loss(tup, xs, ys):
    k, m = tup
    delta = model(xs, k, m) - ys
    return numpy.dot(delta, delta)

k_hat, m_hat = scipy.optimize.minimize(l2_loss, (0, 0), args=(xs, ys)).x
pyplot.scatter(ts, ys, alpha=0.5, s=100)
pyplot.plot(t_scale, model(x_scale, k_hat, m_hat), color='red',
            linewidth=5, alpha=0.5)
pyplot.ylabel('Weight of elephant (kg)')

Linear regression with uncertainty, using maximum likelihood

We just fit $$k$$ and $$m$$ but there's no uncertainty estimates here. There's several things we could estimate uncertainties for, but let's start with the uncertainty of the predicted values.

We can do that by fitting a normal distribution around the line at the same time as we're fitting $$k$$ and $$m$$. I'm going to do this using Maximum Likelihood. If you're not familiar with this method, don't be scared! If there's any method in statistics that I've found as easily accessible (it's basic probability theory) and useful, it's this method.

In fact, minimizing squared loss (which we just did, in the previous snippet) is actually a special case of maximum likelihood! Minimizing the squared loss is the same thing as maximizing the logarithm of the probability of all the data. This is usually called the “log likelihood”.

So we already have an expression to minimize the squared loss. If we make the variance an unknown variable $$\sigma^2$$ we can fit that at the same time! The quantity we're going to try to minimize now turns into

$$ \frac{n}{2}\log{2\pi\sigma^2} + \frac{1}{2\sigma^2}\sum \left(y_i - \hat{y_i}\right)^2 $$

Where $$\hat{y_i} = kx_i + m$$ are the predicted values by our model. Let's try to fit that!

import scipy.optimize

def neg_log_likelihood(tup, xs, ys):
    # Since sigma > 0, we use use log(sigma) as the parameter instead.
    # That way we have an unconstrained problem.
    k, m, log_sigma = tup
    sigma = numpy.exp(log_sigma)
    delta = model(xs, k, m) - ys
    return len(xs)/2*numpy.log(2*numpy.pi*sigma**2) + \
        numpy.dot(delta, delta) / (2*sigma**2)

k_hat, m_hat, log_sigma_hat = scipy.optimize.minimize(
    neg_log_likelihood, (0, 0, 0), args=(xs, ys)
).x
sigma_hat = numpy.exp(log_sigma_hat)
pyplot.scatter(ts, ys, alpha=0.5, s=100)
pyplot.plot(t_scale, model(x_scale, k_hat, m_hat),
            color='green', linewidth=5)
pyplot.fill_between(
    t_scale,
    model(x_scale, k_hat, m_hat) - 1.96*sigma_hat,
    model(x_scale, k_hat, m_hat) + 1.96*sigma_hat,
    color='red', alpha=0.3)
pyplot.legend()
pyplot.ylabel('Weight of elephant (kg)')

The uncertainty estimate here isn't actually 💯 because it doesn't take into account the uncertainty of $$k$$, $$m$$, and $$\sigma$$ in itself. It's a decent approximation, but to get it right we need to do those things at the same time. So let's do it.

Bootstrapping, rebooted

So let's take it to the next level and try to estimate the uncertainty estimate of $$k$$ and $$m$$ and $$\sigma$$! I think this will show how bootstrapping is basically cookie cutter — you can plug it into almost anything in order to estimate uncertainties.

For every bootstrap estimate, I'm going to draw a line. We can also take all those lines and compute a confidence interval:

pyplot.scatter(ts, ys, alpha=0.5, s=100)
xys = list(zip(xs, ys))
curves = []
for i in range(100):
    # sample with replacement
    bootstrap = [random.choice(xys) for _ in xys]
    xs_bootstrap = numpy.array([x for x, y in bootstrap])
    ys_bootstrap = numpy.array([y for x, y in bootstrap])
    k_hat, m_hat = scipy.optimize.minimize(
        l2_loss, (0, 0), args=(xs_bootstrap, ys_bootstrap)
    ).x
    curves.append(model(x_scale, k_hat, m_hat))

# Plot individual lines
for curve in curves:
    pyplot.plot(t_scale, curve, alpha=0.1, linewidth=3, color='green')

# Plot 95% confidence interval
lo, hi = numpy.percentile(curves, (2.5, 97.5), axis=0)
pyplot.fill_between(t_scale, lo, hi, color='red', alpha=0.5)
pyplot.ylabel('Weight of elephant (kg)')

Whoa, what's going on here? This uncertainty is very different from the earlier plot. This seems confusing until you realize that they show two very different things:

The first plot finds one solution of $$k$$ and $$m$$ and shows the uncertainty of the predictions. So, if you're asked what's the range of the elephant's weight in the next month, you can get it from the chart.
The second plot finds many solutions of $$k$$ and $$m$$, and shows the uncertainty of $$kx + m$$. So this answers a different question – what's the trend of the elephant's weight over time and what's the uncertainty of the trend.

It turns out we can combine the two approaches and make it even more complicated by fitting drawing bootstrap samples and fitting $$k$$, $$m$$, and $$\sigma$$ at the same time. Then for each of those estimates, we can predict new values $$y$$. Let's do it 😎.

pyplot.scatter(ts, ys, alpha=0.5, s=100)
xys = list(zip(xs, ys))
curves = []
for i in range(4000):
    # sample with replacement
    bootstrap = [random.choice(xys) for _ in xys]
    xs_bootstrap = numpy.array([x for x, y in bootstrap])
    ys_bootstrap = numpy.array([y for x, y in bootstrap])
    k_hat, m_hat, log_sigma_hat = scipy.optimize.minimize(
        neg_log_likelihood, (0, 0, 0), args=(xs_bootstrap, ys_bootstrap)
    ).x
    curves.append(
        model(x_scale, k_hat, m_hat) +
        # Note what's going on here: we're _adding_ the random term
	# to the predictions!
        numpy.exp(log_sigma_hat) * numpy.random.normal(size=x_scale.shape)
    )

# Plot 95% confidence interval
lo, hi = numpy.percentile(curves, (2.5, 97.5), axis=0)
pyplot.fill_between(t_scale, lo, hi, color='red', alpha=0.5)
pyplot.ylabel('Weight of elephant (kg)')

Nice! It's getting serious now 😅 – you can see a hyperbolic shape if you look closely!

The trick here is that for every bootstrap estimates of $$(k, m, \sigma)$$ we also need to draw random predictions. As you can see in the code, we're actually adding random normal variables to the predicted values for $$y$$. That's also why the shape end up a big squiggly.

Unfortunately bootstrapping turns out to be fairly slow for this problem – for each of the bootstraps, we need to fit a model. Let's look at another option:

Markov chain Monte Carlo methods

It's going to get a bit wilder now. I'm going to switch to some Bayesian methods, where we estimate $$k$$, $$m$$, and $$\sigma$$ by drawing samples. It's similar to bootstrapping, but MCMC has far better theoretical underpinnings (we are sampling from a “posterior distribution” using Bayes rule), and it's often orders of magnitude faster.

For this, we're going to use a library called emcee which I've found pretty easy to use. All it needs is a log-likelihood function, which as it turns out we just defined earlier! We just need to take the negative of it.

import emcee
xs, ys, ts, x_scale, t_scale = generate_time_series()

def log_likelihood(tup, xs, ys):
    return -neg_log_likelihood(tup, xs, ys)

ndim, nwalkers = 3, 10
p0 = [numpy.random.rand(ndim) for i in range(nwalkers)]
sampler = emcee.EnsembleSampler(nwalkers, ndim, log_likelihood,
                                args=[xs, ys])
sampler.run_mcmc(p0, 10000)

Let's plot the sampled values for $$k$$ and $$m$$!

# Grab the last 10 from each walker
samples = sampler.chain[:, -10:, :].reshape((-1, ndim))

pyplot.scatter(ts, ys, alpha=0.5, s=100)
for k, m, log_sigma in samples:
    pyplot.plot(t_scale, model(x_scale, k, m), alpha=0.1,
                linewidth=3, color='green')
pyplot.ylabel('Weigh of elephant (kg)')

There's a bit more stuff that goes into these methods – the sampling is a bit finicky and requires a bit of handholding to get to work well. I don't want to get into all the specifics, and I'm a layman myself. But it can often be several orders of magnitude faster than booststrapping and it also handles situation with less data much better.

We end up with samples from the posterior distribution of $$k, m, \sigma$$. We can look at the probability distribution of these unknowns:

# Grab slightly more samples this time
samples = sampler.chain[:, -500:, :].reshape((-1, ndim))
k_samples, m_samples, log_sigma_samples = samples.T
seaborn.distplot(k_samples, label='k')
seaborn.distplot(m_samples, label='m')
seaborn.distplot(numpy.exp(log_sigma_samples), label='sigma')
pyplot.legend()

You can see that these distribution sort of center around $$k=200$$, $$m=1000$$, and $$\sigma=100$$ which is how we constructed them in the first place. That's somewhat reassuring!

Finally, we can plot the full uncertainty of predictions using the same methodology as for boostraps:

pyplot.scatter(ts, ys, alpha=0.5, s=100)
samples = sampler.chain[:, -4000:, :].reshape((-1, ndim))
curves = []
for k, m, log_sigma in samples:
    curves.append(
        model(x_scale, k, m) +
        numpy.exp(log_sigma) * numpy.random.normal(size=x_scale.shape)
    )

# Plot 95% confidence interval
lo, hi = numpy.percentile(curves, (2.5, 97.5), axis=0)
pyplot.fill_between(t_scale, lo, hi, color='red', alpha=0.5)
pyplot.ylabel('Weight of elephant (kg)')

These Bayesian methods don't end here. In particular there's several libraries that work with these kind of problems. It turns out that if you express the problem in a more structured way (not just a negative log-likelihood function), you can make the sampling scale to large problems (as in, thousands of unknown parameters). For Python there's PyMC3 and PyStan, as well as the slightly more experimental (?) Edward and Pyro.

Wrapping up

I've taken you a bit down the rabbit hole – but it goes much further. In fact, forcing myself to estimate uncertainties for anything I do has been a great forcing function to learn a ton about statistics that I've been deferring for too long and I recommend it to anyone who is interested.

Making decisions based on data is hard! But if we were a bit more disciplined about quantifying the uncertainty, we might make better decisions. It's not easy right now to do this, but I really hope we'll see a popularization of these methods using more accessible tools.

Thanks to Jim Savage for some feedback on an earlier draft! All the code is available as a notebook on Github.

Finally

I'm looking for data engineers to join my team at Better! Feel free to reach out at erik at better dot com, or also DM on Twitter, or apply directly if you want to. You can read a bit more about the role in a previous blog post.

I don't want to learn your garbage query language

2018-08-30T00:00:00Z

This is a bit of a rant but I really don't like software that invents its own query language. There's a trillion different ORMs out there. Another trillion databases with their own query language. Another trillion SaaS products where the only way to query is to learn some random query DSL they made up.

I just want my SQL back. It's a language everyone understands, it's been around since the seventies, and it's reasonably standardized. It's easy to read, and can be used by anyone, from business people to engineers.

Instead, I have to learn a bunch of garbage query languages because everyone keeps trying to reinvent the wheel.

Take ORMs. Their alleged benefit is they cut down development time. But instead of writing SQL which everyone knows, I now have to scroll back and forth in some ORM documentation to figure out how to write my queries. On top of that, I have to spend time debugging why the ORM translated my query into some monstrosity that joins 17 tables using a full table scan. Instead of sticking to SQL, where it's reasonably easy to argue about the performance (try to stick to where clauses on indexed columns, don't go bananas with joins, et cetera), I have to deal with this opaque translation layer that obscures the exact query. And I end up with bloated higher level data classes rather than easy to understand tuples or dicts that contain the data in a dumb simple format that is trivial to introspect.

Not to mention there's like five thousand ORMs out there, so instead of learning SQL once, I have to learn 34 different ORMs. It's not like people learn an ORM instead of learning SQL anyway.

And all these SaaS products. Just to pick some tools from my company's stack:

Splunk has SPL
Mixpanel has JQL
Rollbar has RQL
New Relic has NRQL
Adwords has AWQL

What's worse than data silos? Data silos that invent their own query language.

To be fair, some of these are SQL flavors, or at least pretends to be, but all with their own quirks that forces me to unlearn everything I knew about SQL to the point that it might as well be something completely different.

Then on top of that, every database seems to reinvent query languages. Mongo has its own terrible query language that I never understood. Lucene has its own query language. Etc.

What am I asking for? Not a whole lot. Just that:

Every SaaS product should offer a plug-and-play thing so that I can copy all the data back into my own SQL-based database (in my case, Postgres/Redshift). I don't want to use their custom made DSL. Maybe European Union can mandate this as the next step after their PSD2 open banking directive.
There should be a 30 year moratorium on inventing new query languages.
Let's dispel with the myth that ORMs make code cleaner. Join the embedded-SQL movement and discover a much more readable, much more straightforward way to query databases.

That's it. I realize I sound like a old crank but that's a risk I'll take.

Addendum

This post got a fair amount of traffic so it must have resonated with a bunch people. See the Hacker News discussion and the Reddit r/programming comments.

Update(2021): this made it to the Hacker News front page for a second time and generated some more discussion.

Business secrets from terrible people

2018-08-16T00:00:00Z

I get bored reading management books very easily and lately I've been reading about a wide range of almost arbitrary topics. One of the lenses I tend to read through is to see different management styles in different environments.

It turns out that some truly f—ng horrific people have some smart management ideas. This is not maybe surprising. If you have some twisted goals, you can't have incompetent leadership or you won't get anywhere. I mean, if you have a company like Google generating billions of dollars in cash flow every year, you could probably have almost any management style, and things will seem pretty fine. I find it a bit more impressive what it takes to run a crack-dealing gang consisting of 100s of poorly educated gangsters, trying to deal with supply shocks, fighting turf wars with neighboring gangs, and figuring the best way to keep the police at bay.

So what are some of the things we can learn? Keep reading:

The Taliban

This is mostly based on the book Counterinsurgency by David Kilcullen.

So the Taliban were (obviously) bad people, who committed some truly horrific crimes against humanity. What's surprising reading about them, is to the degree civilians supported the Taliban. They may have been brutal, but they were somewhat just and provided many of the functions that the central government failed to provide:

[…] you may think of people having their hands cut off from stealing, women being stoned for adultery, beheadings, and so on. And that does happen. But in fact, the bulk of the work of these courts concerns what in the official system would be considered commercial or civil cases rather than than criminal ones. The Taliban courts issue title deeds and resolve land disputes, settle water and grazing disputes, handle inheritance and family law, and issue identify cards and even passports (in the name of the Islamic Emirate of Afghanistan). They deliver a local dispute resolution and mediation service, with a reputation for harsh but fair and swift justice. (from the Counterinsurgency book)

So why did people support them? The impression I get, is that, sure, no one likes the Taliban, but at least they provided some kind of stability. Picture living in a mostly lawless society, where occasionally bands of thieves come and steal your harvest. In such a society, people would have a much stronger preference for one band of thieves living off the land, than many. It would then be in the self-interest of the thieves to “tax” the population in an optimal way (kind of like the Laffer curve). Seeing states develop out of these primitive agreements to pilfer a little seems kind of plausible. (I remember reading a paper about the rise of Gurgaon which sort of describes why it's so successful — it's a “monopoly of corruption” compared to other cities where many (uncoordinated) parties try to steal a little, leading to a worse outcome for everyone.)

What can we learn from them? There's a couple of things that stand out to me as smart moves by the Taliban (especially compared to the US military):

They identified and targeted the influencer and key decision makers. The key decision makers were (a) the village elders (b) mullahs (a) teachers.
They understood what their pain points were and offered help, for instance offering protection against marauding local gangs.
They listened to user feedback, at least ostensibly, by having a system to file formal complaints against the Taliban.
The Taliban seem reasonably pragmatic, despite their hardline image, often relenting to the villagers, and sheltering NGO projects when they were in the best interest of the local population.

Theranos

This is mostly based on the book Bad Blood by John Carreyrou.

Theranos was a fraud company. I'm still not sure if Elizabeth Holmes (CEO) knew what was going on. Was she delusional? Or an intentional fraudster? Or kept in the dark by her boyfriend/COO Sunny Balwani? Either way, what brought them into brief fame and glory was a super charismatic leader who relentlessly pushed a vision down the rank.

Holmes wanted it to be awesome, so she declared to the world that it was awesome and told her employees that it was awesome. If people dissented, they were done. Hire the worst lawyer scumbags you can ever think of, and ruin the lives of your former employees.

But I think the interesting story here is – how do you push a vision so hard that you end up fooling everyone? Setting a vision and insisting on it (even though you know it's bullshit) can take you very far. Maybe even far enough for reality to catch up with the vision.

Maybe Elizabeth Holmes pushing her vision was just like one of those management techniques where you tell an engineer “I bet you can make that code 10x faster” even though you have no clue if it's possible? Maybe in an alternate universe, the engineers would have finally figured it out? And until then, maybe her “reality distortion field” was an effective way to bootstrap the company? And maybe her super stubborn focus and charisma was the right way to get a ton of brilliant scientists to join, who eventually would have figured it all out?

What ultimately lead to the demise was the complete disconnect between the top and the bottom of the org chart. When vision goes in a one-directional flow, with no feedback loop tying it back, you're basically just running completely blind. When you're building a startup, the tighter you can make the feedback loop, the faster you will innovate. Elizabeth Holmes wasn't a part of any feedback loops. She clearly had no interest in what happened at her company.

More so, she seemed to have a complete lack of tact/class/style. From the book Bad Blood:

The resignations infurated Elizabeth and Sunny. The following day, they summoned the staff for an all-hands meeting in the cafeteria. Copies of The Alchemist, Paulo Coelho's famous novel about an Andalusian shepherd boy who finds his destiny by going on a journey to Egypt, had been placed on every chair. Still visibly angry, Elizabeth told the gathered employees that she was building a religion. If there were any among them who didn't believe, they should leave. Sunny put it more bluntly: anyone not prepared to show complete devotion and unmitigated loyalty to the company should get “the fuck out”.

The Black Kings

This is based on the book Gang Leader for a Day by Sudhir Venkatesh.

The book follows a crack-dealing gang in Chicago. Similar to the Taliban, the gang enjoys a fair amount of support among the local population, due to their ability to deliver justice swiftly, and because the police doesn't do much for the population anyway. Some leadership lessons include

Peace is much better for the business. Avoid fights with other gangs as much as possible. That's a negative-sum trade.
Avoid people dying from your product. That also draws negative attention.
How to you deal with rank-and-file employees who misbehave, for instance diluting the crack? There's an upside if done right, because these individuals might be entrepreneurial. So first time, given them a warning. Second time, beat the living daylight out of them.
How do you get people to work doing illegal things for almost no money? You give them a dream that one day, they can get to the top. This means that as a leader you need to be seen as successful. Of course, it's a Ponzi scheme, but people buy into it.
You can find Pareto efficient outcomes with almost any party. For instance, the local cops might agree to look the other way if you deal in one particular building staircase rather than at a schoolyard. This is positive-sum thinking.
To become a leader in the Black King organization, you had to go to college. The beauty of this is it avoids adverse selection where people may seek to be leaders for prestige. By enforcing that future leaders have to go to school, Black Kings impose a certain bar that means future leaders are long-term thinkers and reasonably intelligent (qualities that schools select for as well).

Al-Qaeda

Mostly based on My Share of The Task by Stanley McChrystal.

The most interesting takeaway I had about how Al-Qaeda operates from this book is that it was almost entirely decentralized. In fact, Al-Qaeda in Iraq was run by the lunatic al-Zarqawi who many times were reprimanded by the central organization for going too far (in particular, targeting Shia muslims). They seem to have grudgingly let him do his thing, partly because the central organization seemed to have few ways to impose accountability.

The organizational model strikes me as fairly close to a business franchise, and apparently I'm not the only one that noticed this. The central organization provides a strong brand, produces propaganda, and has training facilities for new workforce. But on the grounds, it's the individual franchises that run things, with little control over the exact implementation.

Quoting Harvard Business Review, I think the Al-Qaeda model checks all of these boxes:

The advantages of franchising include allowing the firm to overcome resource constraints of limited capital and thin the ranks of experienced managers.
Franchising also provides a means of trading off certain functions; franchisees are more efficient in performing functions whose average cost curve turns up relatively quickly.
It obviates the need for monitoring (and its attendant costs) because franchisees have invested their own capital and are motivated to work hard for profitability.
It offers substantial efficiencies in promotion and advertising by leveraging the value of a trademark and brand image.

Post scriptum

I wish I had enough material to expand on more of these! Some of the more interesting leader I'd love to learn from would be Wehrmacht, Pablo Escobar, Baader-Meinhof, and Mao Zedong (in particular during the Long March).
It's been too long since my last blog post (two months) and I feel bad that this is something silly, non-technical, but this blog has always been an outlet for me to write about random silly things!
The title of this blog post was a nod to the (fictional) book Business Secrets from the Pharaohs, from Peep Show, the best TV series to come out of the UK.

New approximate nearest neighbor benchmarks

2018-06-17T00:00:00Z

As some of you may know, one of my side interests is approximate nearest neighbor algorithms. I'm the author of Annoy, a library with 3,500+ stars on Github as of today. It offers fast approximate search for nearest neighbors with the additional benefit that you can load data super fast from disk using mmap. I built it at Spotify to use for music recommendations where it's still used to power millions (maybe billions) of music recommendations every day.

Approximate nearest neighbor search is very useful when you have a large dataset of millions of datapoints and you learn some kind of vector representation of these items. Word2vec might be the most well known example of this, but there's plenty of other examples. For an introduction of this topic, check out an older series of blog posts: Nearest neighbor methods and vector models.

Anyway, at some point I got a bit tired of reading papers of various algorithms claiming to be the fastest and most accurate, so I built a benchmark suite called ann-benchmarks. It pits a number of algorithms in a brutal showdown. I recently Dockerized it and wrote about it previously on this blog. So why am I blogging about it just three months later? Well…there's a lot of water under the bridge in the world of approximate nearest neighbors, so I decided to re-run the benchmarks and publish new results. I will probably do this a few times every year, at my own questionable discretion.

Changes

There were several new libraries added to this benchmark:

NGT-Panng from Yahoo! Japan, a graph-based search structure
pynndescent which is also a graph-based search algorithm, in fact based on the same paper as k-graph
MRPT which is based on random projects, like Annoy.

On top of that, hnsw are included in three different flavor, one as a part of NMSLIB, one as a part of FAISS (from Facebook) and one as a part of hnswlib. I also dropped a few slow or semi-broken algorithms.

Another change this time was that I'm enforcing single-CPU queries. This made the benchmarks marginally slower, but I think it's the most “fair” way to compare. I think batching is not always applicable for real world application. Previously, I used a thread pool to saturate all CPUs on the instance, but there was some concern that this might affect certain algorithms in different ways. So I used Docker's ability to tie the container to a single CPU.

Results

Without further ado, here's the results for the latest run. For the glove-100-angular dataset:

sift-128-euclidean

nytimes-256-angular

gist-960-euclidean

Results: summarized

By now, you're probably squinting at charts to figure out which library is the best. To save you the pain, I'm just going to summarize it into a somewhat subjective list:

hnsw(nmbslib)
hnswlib
hnsw(faiss)
kgraph
NGT-panng
pynndescent
SW-graph(nmslib)
annoy
flann
BallTree(nmslib)
mrpt
rpforest

The various flavors of hnsw are all at the top, but that's partly because they were all built by the same person, Yury Malkov with a paper describing the approach.

pynndescent and kgraph are both based on the same paper so it's not surprising their performance is fairly similar.

For some reason, MRPT would crash when I ran it on angular data, and I gave up after some time investigating it. Hopefully next benchmark will feature MRPT for angular data as well.

There's more goodies! Martin Aumüller and Alexander Faithfull have contributed code to export all the results to a website. I put it up on a temporary URL for you to enjoy.

That's it! ann-benchmarks ~~currently has almost 500 stars on Github, so I'd love it if you can pay it a visit and who knows… starring a repo just takes a second. Just saying!~~ just passed 500 stars on Github, meaning it's a legitimate project now! 🎉

Missing the point about microservices: it's about testing and deploying independently

2018-06-04T00:00:00Z

Ok, so I have to first preface this whole blog post by a few things:

I really struggle with the term microservices. I can't put my finger on exactly why. Maybe because the term is hopelessly ill-defined, maybe because it's gotten picked up by the hype train. Whatever. But I have to stick to some type of terminology so let's just roll with it.
This blog post might be mildly controversial, but I'm throwing it out there because I've had this itchy feeling for so long and I can't get rid of it. I respect it if you want to disagree vehemently, and maybe there's something both of us can learn.
I have a weird story. My first “real” company, Spotify, used a service-oriented architecture from scratch. I also spent some time at Google which used a service-oriented architecture. So basically since 2006 I've been continuously working in what people now call a “microservice architecture”. It didn't even occur to me that some people might want to build things as monoliths. So I guess I'm coming at it from a different direction than many other. Either way, there were particular non-standard reasons why Spotify and Google had to do this that I'll get back to later.

Let's start by talking about iteration speed!

What's up with iteration speed!

I'm sort of obsessed about iteration speed. I've written about this in the past and it deserves more posts in the future, but the quick summary is that iteration speed is always going to be the strongest competitive advantage in this industry. You can't really patent anything and proprietary technology is often much less valuable than companies would like to admit. So what do you do? You start shipping new feature quicker, you learn faster from users, and you run faster than your competitors (a.k.a. the “gingerbread man strategy”).

Let's talk about testing and deploying

There's of course many ways we can iterate faster, but for today let's focus on two particular aspects of it: testing and deploying more often. I'm a big proponent of continuous deployment. I'm also a huge proponent of fast test suites. Why? You have been reading this far without any graphics so you deserve one. Tracing back all the dependencies and how it fits together, it looks like something like this in my head:

So it seems like we could improve a lot of things if we could test and deploy things faster! Of course, there's a long series of steps to get there:

Do you have fully automated tests? If not, write them, then come back.
Are deploys automated? If not, do them, then come back.
Are you deploying multiple times per day? If not, figure out how to get there, then come back.

Anyway, if only there was a “trick” to test and deploy things faster… maybe splitting things up into small independent units… if only there was a way 🤔

98% of microservice benefit is being able to test and deploy independently

By now it should be clear why splitting things up makes sense. But just as xy=0 when y=0 regardless of how large x is, don't expect that you can just breaking your sweet old monolith up into two services and derive tremendous value from that. It's rarely very valuable unless you can test and deploy those parts independently:

Here's where I see so many blog posts where people are missing the point:

If you need to deploy two services to production in tandem, you're doing things wrong
If you need to run two services together in order to run tests, you're doing things wrong
If you end up with a microservice that can't be tested in isolation, you're doing things wrong
If you end up with a microservice that can't be deployed in isolation, you're doing things wrong

Why are you doing wrong things? Because you're putting in tons of work separating out things into independent units, without reaping the benefits of fast testing and deploying cycles.

Of course, things get harder

I'm not going to dwell on this and there's much that has been written, including Steve Yegge's epic rant. Testing things in isolation means each part needs to make assumptions about how the other parts will behave and mock them out properly. Deploying a new version of an API call can be annoying and has to be done in multiple smaller steps. Tracing requests can be a massive pain. I could go on all day.

But as Americans are fond of saying, there are no free lunches.

Some questionable reasons to consider microservices

I mentioned 98% of the value is being able to test and deploy things independently? I think the other benefits are fairly marginal at best:

Writing services in different languages. I think this argument is mostly invoked by some junior dev who wants to implement a new system in Clojure. Great news for the poor person waking up at 2am getting paged because the shopping cart service is down.
Forcing applications into independent pieces so they don't sprawl into cobwebs of interdependencies. I used to think this was a super strong argument! But clearly, some huge monolithic code bases are great. The Linux kernel shows that you can write highly modular code, all inside a single process (actually, the Torvalds-Tennenbaum flame war from 1992 is still highly relevant)
Scaling two pieces of software independently. Not necessarily a strong reason, since you can also scale up a fat binary – look at Facebook.
Breaking up software with different performance characteristics. This could occasionally be a valid argument, say if you have a Node-based webserver and you need to do something CPU heavy. But could in many cases be solved by something like background threads or “modes” – the same codebase is run both for worker processes and web server processes.

Summary

I'm obsessed with iteration speed and could write about 17 more blog posts about it. If there are any takeaways I want to leave you with, it's these

Automated testing is awesome.
Continuous deployment is really sweet.
Do those two things first.
Once you see engineers starting to twiddle thumbs waiting for tests to run, you know the time is right to split things up.
Microservices can be awesome for this.
Keep splitting until you no longer see engineers twiddling thumbs and all tests blazingly fast.
Lean back, relax, and watch your company out-iterate all your competitors through a superior development process.

Interviewing is a noisy prediction problem

2018-05-02T00:00:00Z

I have done roughly 2,000 interviews in my life. When I started recruiting, I had so much confidence in my ability to assess people. Let me just throw a couple of algorithm questions at a candidate and then I'll tell you if they are good or not!

Over time I've come to the (slightly disappointing) realization that knowing who's going to be good at their job is an extremely hard problem. The correlation between who did really well in the interview process and who performs really well at work is really weak.

Confronted by this observation I've started thinking about this process as inherent noise reduction problem. Interviewing should be thought of as information gathering. You should consciously design the process to be the most predictive of future job performance. Given that you have limited time to measure, you need spend your time measuring things that have high signal-to-noise ratio and things that have low correlation with each other.

Interviewing as a prediction problem

Job interview scene from Devil Wears Prada (2006)

Let's start by stating the problem. We're trying to predict job performance from a series of measurements (interviews). Those measurements are noisy meaning that any individual measurement is not very predictive in itself, but hopefully all of them taken in aggregate can be predictive. We can also choose what we want to observe ahead of time, by coming up with an interview process where we think the aggregate judgement correlates the most with job performance. So we can choose to spend one hour on system design, one hour on algorithms, etc.

If we had complete data about candidates and their future job performance, it would look something like this

	Algorithms	System design	Grades	Github portfolio	Actual job performance
Abdullah	5	4	2	3	5
Barbara	1	5	1	1	3
Chloe	2	3	5	2	2
David	4	1	4	5	4
…	…	…	…	…	…

There's many simplifications of thinking about it this way. For instance, we don't necessarily know the actual job performance until they already got hired and worked with us (and even then, that's a hard problem to estimate, but a separate hard problem). But let's stick to this way of thinking because it will become useful later. Note that it's not just a table, it's also a matrix! I will get back to this and do some magic linear algebra on it later.

Restating interviewing as a prediction problem might seem like a super obvious thing, but I think people often forget this point. I've read about 1,000 Hacker News comments complaining that interview questions about turning a binary tree upside down (or whatever) are stupid because no one would ever do that, or there's already a library for it, or something else. I think that's completely besides the point! The real question is: does solving a problem about turning a binary tree upside down predict future job performance?

Interviewing is tricky – you only have a few hours to collect measurements. I have no moral objection against asking “unrealistic” questions, as long as I have some reasonable expectation that they predict future job performance. In fact, I think any sort of interviewing is going to be unrealistic, and I'm very happy that there are some decent ways to evaluate the performance of engineers. I have no idea of how carpenters hire carpenters or how investment bankers hire investment bankers, but I wouldn't be surprised to hear that their interview process they have is far less objective.

The depressing science on interviews

Let's review research and see what measurements to pick that are most predictive. Turns out there's a bit of research suggesting interesting things. One thing that almost all research seems to indicate is that overall, interviewing is a lot less predictive than we want it to be.

Cortina, Goldstein et. al. (2000) find that highly structured interviews add some signal on top of testing cognitive ability + conscientiousness, but that unstructured interviews are almost useless.
Highhouse (2008) finds that interviewers perceive of unstructured interviews as the most predictive, and intelligence tests as the least predictive, although the estimated actual effectiveness is actually the opposite.
The jury is still out whether grades predict much: Google claims no and this blog post sees no relationship between grades and interview performance, whereas this paper claims that grades are more predictive of “important life outcomes than IQ”.
Some recent news articles include Bloomberg calling interviews useless, NYT calling them utterly useless, and The Guardian calling them pointless. To be clear, they all attack the unstructured interview.

A few studies seem to at least call out intelligence as a decent predictor:

Vinchur and Shlippmann (1998) performs a meta-analysis on sales data and find that “general cognitive” ability is the strongest predictor for job rating, although a weak one for sales (N = 1200)
Gottfredson (1998) finds that IQ is a good predictor for life outcomes overall.
Schmidt and Hunter (2004): General mental ability predicts both occupational level attained and performance within one’s chosen occupation and does so better than any other ability, trait, or disposition and better than job experience (here's a summary of a lot of the work by the authors).

I'm not a scholar and I can't claim to have done an extensive literature review, but here's my biased summary:

Intelligence tests seem to be the strongest predictors of job performance (although possibly offensive, and probably illegal in the US)
… followed by structured interviews
… followed by unstructured interviews (which have very little predictive power).

Picking the measurements that matter the most

Let's focus for a second on what measurements matter the most. Since we can't measure actual job performance, we need to measure other skills. Those measurements will be highly noisy. So we need to find measurements that (a) are decent proxies (b) have as high signal-to-noise ratio.

I don't have a grand unified theory here. But I have a long list of ideas I think are helpful.

Bad interview signals

Judging by the literature review earlier in this post, unstructured interviews are almost useless. Avoid them.
Some interview questions spend 60 minutes focusing on a problem that relies on a single insight to solve. That's extremely low signal-to-noise ratio! Try to avoid such problems.
Likewise, avoid long interview questions depend on some kind of “trick” or “insight”. You get basically 1 bit of information out of this question.
You want the highest signal-to-noise ratio, so you should design your interview to make the candidate talk as much as possible. Besides the time set aside for questions, the candidate should do 80-90% of the talking, not the interviewer. Every minute the interviewer spends talking is opportunity cost!
However, it's important the interviewer keeps driving the discussion. If the candidate is talking about something and it veers off into tangential territory, you need to bring it back asap. Again, opportunity cost!
Certain signals like “having a great Github portfolio” can be great signals, but have tons of false negatives. Meaning that the presence of those signals can be great, but the absence isn't necessarily a big deal. I've hired many amazing engineers that had absolutely nothing on their Github, and that's fine!
I'm a bit skeptical of live coding as an interview, mostly because it's hard to do without forcing a certain environment or methodology onto a developer. As a result, you end up measuring mostly how familiar they are with the environment. But maybe it could be done well for certain tasks.
I don't think algorithms on a whiteboard are great, but mostly because they take so much time. Asking a candidate to implement Levenshtein distance can take up a full hour, and again, the opportunity cost matters. Quick algorithm questions (max 5-10 min) can be fine, although another concern I have with algorithm questions is that junior CS grads score well for no other reason than having their algo class fresh in mind.
I'm not a big fan of take-home projects. They hurt people with families and commitments. It also tends to increase dropoff for the people that are in highest demand, which is exactly the people you don't want to lose. I still do homework assignments for a few roles where I haven't found a better interview format, but we limit them to 3-5 hours and we pay for them.
Cultural interviews and using the “beer test” is great if you want to hire people for no other reason than they are similar to you. I don't do them. If a candidate is a wacko, you'll notice it anyway. I do take candidates out for lunch as a way for them to get to know us, but there's no evaluation going on.
I find questions like “what are your weaknesses” or “what is your superpower” to be silly. A good answer just means the candidate is verbal and can ramble on the spot. It favors verbal extroverted people without reflecting on their actual skills.
I don't think knowing a lot of obscure features of a particular language is super useful. Good engineers know a bunch of languages and can pick up new languages quickly. If you're interviewing for a backend role, don't spend 60 minutes probing into C++ template metaprogramming. The main exception to this rules is roles that are obviously tied to a particular tool (like CSS for frontend engineers).

Good interview signals

While I don't like long problems that rely on knowing a certain trick, I think it's great to have many short interview questions that rely on knowing particular things. If you can go through 20 such problems in one single interview, you increase the signal-to-noise ratio a lot!
I've experimented with many types of interviews, but one thing I really dig lately is code reading. I print out 10-15 code snippets (not just code, but also UNIX commands, regular expressions, SQL queries and many other things). I then go through and ask: what does this snippet do, how does it work, are there any bugs, etc.? Reading code (as opposed to writing) means I can cover a lot of ground extremely quickly, spending no more than a minute or two on each problem.
Homework assignment can be OK, although I think it's important to limit them to 3-5 hours, and definitely pay for them!
Consulting agreements (with the intent to hire) are great for the people that have that flexibility. It gives an opt-out on both sides after a certain time period.
I like system design questions (“how would you build this feature”). Anecdotally they seem to capture experience quite well.
While I'm skeptical about going deep into a particular language's features, I think it can be a good idea to cover many different languages and their features. Not everyone is proficient in every language, but decent engineers have at least one or two languages that they have a deeper knowledge of.
…however, some roles are intimiately tied to one particular tool (like frontend engineers and JavaScript). In that case I think it actually does make sense to ask about a lot of obscure language features! But to get the highest possible signal-to-noise ratio, ask 20 different questions and don't spend more than a minute or two on each.
I have had some decent success using online coding tests like Hackerrank. They skew way too algorithmic IMO, but so far they seem to be a fairly good predictor of future job performance. The main issue is that senior candidates often find it offensive to do these tests, so I usually reserve them for more junior candidates.
I'm skeptical about trying to assess communication skills directly, but I do think there's a certain “clarity of thought” that intelligent people exhibit.
While the absense of cool projects on Github is not a big deal, the presence of projects is a very positive signal to me. Same goes with open source contributions, blog posts, or other things.
There's another 100 things I could list about how I evaluate a candidate's resume, but I think in the interest of brevity I'll leave it for another blog post!
A final word on intelligence tests. Literature suggests that they can have decent predictive power. I think “brain teasers” can work as a proxy for intelligence tests. I've never used them during interviews but I'd imagine they might have a certain place, like maybe if you're interviewing for a weird role where there's literally nothing else that you can measure. I don't think we should categorically throw them out.

Combining measurements

Job interview scene from Office Space (1999)

Going back to the table of interview rankings and let's introduce some math. Stay calm, the math is not harmful! If you are squeamish about math, feel free to skim.

Let's assume all the entries in the matrix are known, and that we are trying to find the best linear combination $$ \beta_1, \beta_2, \ldots \beta_k $$ that predicts the future job performance the best. One way you can phrase this problem is to say you want the $$ \beta $$'s that minimize the quadratic error. This is a classic linear regression problem:

$$ \min_\beta \left( y - \mathbf{M}\beta \right)^2 $$

The matrix $$ \mathbf{M} $$ is the matrix with interview scores, i.e.

$$\mathbf{M} = \left( \begin{matrix} 5 & 4 & 2 & 3 \\
1 & 5 & 1 & 1 \\
2 & 3 & 5 & 2 \\
4 & 1 & 4 & 5 \\
\ldots & \dots & \ldots & \ldots \end{matrix} \right) $$

And the vector $$ y $$ is the vector of actual job performance, i.e.

$$ y = \left( \begin{matrix} 5 & 3 & 2 & 4 & \ldots \end{matrix} \right)^T $$

This isn't a particularly hard problem from a mathematical point of view to solve (it even has a closed form solution). But it's going to be super noisy! Why?

We don't have much complete data, not substantially more than the number of people you ever hired.
Each interview is a super noisy measurement of the underlying ability.
You only collected data for the handful of interview types that you have been focusing on. Everything else is unknown.
There's a tricky survivor bias where we only observe the $$ y $$ for the people that we hire. This can lead to weird patterns due to something called Berkson's paradox (like when Google noticed that programming competitions were negatively correlated with job performance).

So how can we combine signals more efficiently?

We should try to pick interviews that are proxies that have little correlation with each other. If we do two interviews that have absolutely no correlation, but each of them is a proxy for actual job performance, then the aggregate error decreases.
If you have no idea which interviews are predictive, at least it seems reasonable to assume that at least all $$ \beta \ge 0 $$, i.e. no interview is negatively correlated with future job performance. From that point of view, it makes sense to try to cover as many different topics as possible. A bunch of them might not say anything about future job performance, some of them do, you don't know which ones so why not cover many different topics?
There's an inherent bias-variance tradeoff. It might be worth lowering the error a bit while introducing a bit more bias. This actually gets interesting when you think about the risk aversity that different companies may have.

A large company like Google might actually afford to be a lot more risky. They can hire effectively for the expected value. For them, putting all their weight on a few things that correlate the most with job performance (intelligence test and various proxies for it, like algorithm puzzles).
A small startup might be super risk averse. The cost of hiring the wrong person is super high because it can completely derail their momentum. So they might hire for some expected risk-adjusted function of the future job performance. This effectively ends up distributing the $$ \beta $$ weights over more interviews. So rather than going all in on one or two interviews, it might be wise to focus on many different things. And I literally mean something like: spend 10 minutes on regular expressions, 10 minutes on HTML, 10 minutes on functional programming, 10 minutes on SQL, 10 minutes on AWS, etc. etc.
This also reminds me of portfolio optimization. If you can tolerate more risk, invest everything in stocks. If you are risk averse, hedge your bets and diversify over many asset classes.

Staying objective

Humans are prone to confirmation bias. We subconsciously form an opinion about things, and let that influence our decision making. This is dangerous! You start to like a particular candidate a lot for whatever superficial reason, you drop your guard, start giving them a bit more hints or give them the benefit of the doubt in a way that some other candidate wouldn't get.

The main remedy is to have a structured interview process. If everyone goes through (roughly) the same questions, your judgement becomes more objective. The benefit isn't just that it's more fair and you're not perpetuating biases – you also end up hiring better talent.

There's a separate trick I have for trying to make more neutral judgement is to ask myself: what would I have to see in order for me to change my mind about this candidate? If I start out super excited about a candidate, and they nail three questions in a row about system design, then I try to bring out devil's advocate: maybe this person lacks something else? And I switch to some completely different topic, like regular expressions or UNIX commands. Conversely if someone doesn't do well, think of a hypothetical question where they might win you back. Always try to poke holes in your own judgement.

Side note, but a fascinating study on objective interviews is how blind orchestra auditions became prevalent a few decades ago.

Sanity checking your interview process: would your best engineers do well?

Job interview scene from Trainspotting (1996)

I have mentioned a lot of interview formats that I think are better than others. On top of that, the most critical thing is that you learn from your interviews and continuously refine your interview set up. Over time, look at the people that you hired and ask yourself, what are the strongest predictors of future job performance?

I find this to be a powerful sanity check. Look at the best engineers in your company and try to picture them going through your interview process. Would they do extremely well in the process? Conversely, would the low performance engineers do worse? Or you can also look at the current top performers at your company, and ask yourself: what type of interview would have selected these people? If your interview process would have filtered out your best engineers, then you should probably reevaluate the process.

Some final thoughts

Interviewing is one of those things where every year I realize how much I sucked at it a year ago. I think my ability to predict future job performance has increased over time, but more importantly I've learned the hard way how hard it is. If I were to write this same blog post in a year, I'm sure it will look slightly different. As many other things, the most important thing is to keep learning and iterating.

Addendum (2018-06-04)

Markus Frödin pointed out a well-written meta-analysis that revisits a lot of the job performance prediction literature and arrive at slightly different findings. Their main contribution is to correct for the “range restriction” problem – the bias you get from only measuring the people you hire. Their top predictor is intelligence tests, interestingly followed by “integrity tests”. Unstructured and structured interviews have fairly similar strength.

If I interpret the results correctly, maybe unstructured interviews are useful, but people put too much weight on them. That would cause the correlation between interview performance job performance in the hired group to be zero, even though it could be possible across the whole population. Alternatively, unstructured interviews might be a great way to turn down candidates, but maybe not discern among the “good enough” group. Anyway, my approach is always evolving!

Waiting time, load factor, and queueing theory: why you need to cut your systems a bit of slack

2018-03-27T00:00:00Z

I've been reading up on operations research lately, including queueing theory. It started out as a way to understand the very complex mortgage process (I work at a mortgage startup) but it's turned into my little hammer and now I see nails everywhere.

One particular relationship that turns out to be somewhat more complex is the relationship between cycle time and throughput. Here are some examples of situations where this might apply:

What's a good CPU load for a database? If the CPU load is 50%, how much does that impact latency vs if CPU load is 25%?
What's the average time it takes to respond to an email as a function of how busy I am?
What's the relationship between the tech team's throughput (in terms of Trello cards per day or whatever) and the cycle time (time from adding a card until it's done)?
If we need an loan to be underwritten in at most 2 days, how many underwriters do we need?
I need to book a meeting with my manager. His/her calendar is 80% full. All meetings are 1h and s/he works 40h/week. How far out am I going to have to book the meeting?
Same situation as above but I need to book a meeting with 5 people, all having 80% full calendars (that are independent of each other). How far out?
Users file bugs to the developer. Assuming we put 1 developer full time on triaging/solving incoming bugs, and that keeps the person x% busy, what's the time until a bug gets resolved?

In all these cases, it turns out that running a system at a lower throughput can yield dramatic cycle time improvements. For instance me being just “kind of busy” 😅 vs SUPER SWAMPED 😰 anecdotally impacts my email response time easily by 5-10x.

I've written about this in the past but in an almost mythical way – I didn't understand the math behind these principles.

First of all, the relationship might seem nonsensical. If you have a garden hose, the throughput of the hose (liter/s water) is completely independent of the cycle time (the length of the hose). If a database can handle 1000 queries per second, and we're currently doing 500 queries/s (so 50% load), why is the latency higher than if we're doing 100 queries/s (10% load?)

The reason, like much else in life, is variability. If the database can be thought of as a worker that can handle exactly one query at a time, there's going to be a queue that at any time has any nonnegative number of queries in it. The number of queries sitting in the queue will vary over time. Due to chance, we might have a 10 queries that arrive at almost the same time. In that case the worker will have to process each query serially and work its way through the queue.

As it turns out, we can simulate this fairly easily using dynamic programming. Let's say the time between queries are exponentially distributed and the time of the query itself is log-normally distributed (the exact distributions here aren't super important, but there are some good reasons to model the world using those ones). You can simulate the whole system doing something like this:

query_enqueued = numpy.cumsum(numpy.random.exponential(
    size=(n,), scale=1./k))
query_time = numpy.random.lognormal(size=(n,))
query_finished = numpy.zeros((n, ))
for i in range(n):
    query_finished[i] = query_enqueued[i] + query_time[i]
    if i > 0:
        # Can't finish before the previous query finished
	# plus the time for this query
        query_finished[i] = max(
	   query_finished[i],
	   query_finished[i-1] + query_time[i])

If you run this snippet with different values for k and plot the latency as a function of load factor, you get a chart that's super interesting (in my opinion!):

At 50% utilization, you get twice the latency that you do with 0% utilization. Once you start hitting say 80% utilization then it goes up. And it goes up FAST. Getting towards 100% utilization, the asymptote here is the vertical line.

If you want to optimize for higher-percentiles (such as 90th or 99th) then just a little bit of load pushes the latencies up a LOT. Basically you are forced to run the system at a really low load factor to have some leeway.

The system described above is called an M/G/1 using fancy notation. Turns out for a slightly simpler case where instead of a lognormal distribition, we have an exponential distribution (M/M/1), we can derive exact values for these latencies. The key observation is that the number of events in the queue will have a Geometric distribution. The mean ends up being $$ 1 / (1 - f) $$ where $$ f $$ is the load factor. But my mom told me not to trust equations on the internet, so let's simulated it just to be on the safe side:

In case you can't see, there are two lines exactly on top of each other 💯.

One use case – get more done and be less stressed

I talked a lot about this in a technical context but let's switch gears a bit and allow me to put on my corporate philosopher hat. It's a metaphorical hat that I like to wear, and when I wear it, I mutter things like: the speed at which a company innovates is fundamentally limited by the size of the iteration cycle. So what can you do to reduce the size of that cycle? The observation is that if I'm spending 100% of my time on super urgent stuff, all highest priority, then with fairly mild assumptions, that stuff will have an extremely long cycle time.

So how do we prevent that? Let's split up work into “things that need to happen now” vs “things that can wait”. You really don't want to spend 100% of your time being in the critical path for all the company's major projects. It's exactly the times when I'm stuck in “reactive mode” where my backlog of stuff builds up. Instead, make sure that a big chunk is important, but has no imminent deadline attached to it.

My strategy is to take all the urgent things, and delegate/automate it until the average load of “urgent things” is below 50%. It's possible some of the output is marginally lower quality, but the upside is now I can improve the latency by an order of magnitude. I can respond to emails and Slack messages faster, I can find time for an unexpected meeting, etc, and more generally, information can propagate faster.

Other notes

It makes sense to co-locate batch jobs with low-latency processing and let the latter take precedence at any time. In fact I think this was the intuition behind building Google Borg.
It might seem like parallelization would help, but in fact it makes the problem even harder. A 2x faster machine with a single request queue will always have lower latency that two 1x machines with their own request queues. This in fact reminds me of a really old (2013) blog post that got widespread attention about Heroku's routing.

Lessons from content marketing myself (aka blogging) for five years

2018-03-07T00:00:00Z

I started writing this blog in late 2012, partly because I felt like it would help me improve my English and my writing skills, partly because I kept having a lot of random ideas in my head and I wanted to write them down somewhere. I honestly never cared too much about finding a particular niche, I just wanted to write down stuff that I found interesting. I set up a Wordpress blog on my crappy Swedish virtual private server.

2012 and 2013 were pretty slow, but a few of my blog posts would occasionally get picked up by some obscure online forum and get a few hundred hits in a day. A few hundred hits in a day! Back in the day when I saw a sudden spike of say 200 users in a day, that was a massive deal. I kept writing stuff for more than two years without really getting many readers. Looking back I'm sometimes asking myself why? But I kind of just wrote random things and was happy about it.

In late 2014, I wrote a blog post about my experiment training a neural network to play chess. It hit Hacker News front page and generated something like 50k page views in a few days. Since then, I've had several blog posts that “went viral” and generated a bunch of traffic.

It's not a ton of volume compared to New York Times, but for me it's a lot. My top post (the eigenvector post) is at 128k page views so far.

History of my blog, as told by Google Analytics

This is monthly page views since 2013. I'm now averaging about 20,000 pageviews per month so about 600/day although it's extremely spiky. If I switch to daily granularity it looks kind of silly:

The average (as I mentioned) is about 600 but the maximum is more than 50k. This is clearly not a Poisson but some really fat tailed distribution!

I get about 50 page views per day from Google search results. Not sure if this number should be higher.

I also have 1k+ subscribers on Feedly. I don't know the exact number, since it just shows up as “1K” once I crossed the 1,000 limit. I think you can tell from the User-Agent strings of the access logs but it's not showing up for some reason.

History of my blog, as told by Pingdom

Looking at uptime and page view latency is a bit like understanding volcano eruptions by digging out ice from Greenland:

Almost all the red lines are caused by blog posts going viral. It's honestly something I'm pretty embarrassed by! In total the uptime is 99.85% which I guess isn't terrible, but weighted by traffic volume it's probably far less.

But here's the good news. The uptime since April 2017 is ONE HUNDRED PERCENT 💯 .

I've had a lot of downtime in the past, cause by traffic spikes. The worst one was when my font post landed on Hacker News in the middle of the night. I woke up and my blog had been down all night. It did generate a lot of traffic from other sites, but sadly very little from Hacker News.

Honestly I think the hardest problem about provisioning my blog for traffic spikes is fighting my Swedish modesty that always tells me no one is going to read my blog posts anyway. But some stuff on the technical side I've learned:

Set up Pingdom to track uptime, with text message alerts.
Wordpress on a private server is slow as hell. I switched to Jekyll and wrote a conversion script to move everything. Honestly not a big fan of Jekyll, but seems like that's what most people use.
Jekyll on a private server is also kind of slow. Use something like Github pages or (I do this) Cloudfront+S3.

In retrospect, I should have just used a static site generator and S3+Cloudfront from scratch. Don't run your own server, it's just a recipe for failure. Cloudfront is also nice because it lets you set up HTTPS termination very easily, but if you don't care about HTTPS, you can also just use S3 and run it on HTTP. I've also heard good things about Cloudflare. My AWS bill for Cloudfront and S3 is about ~$5-10/month these days, so not a lot of money.

At some point I also became obsessed with page speed. The choice of the very minimalistic theme made it easy, so after some minor tweaks (like inlining all CSS), I'm now proud to announce that I have a 99/100 score on Pagespeed Insights ⚡.

Some stuff I've learned about writing

There's definitely some superlinear relationship between how much effort I put into things, and how much traffic I get. Putting in 2x efford can give you 10x or 100x the traffic, if you are writing something people care about.
As a corollary, I'm fairly sure you end up with far more hits if you spend twice as much on each article but publish half as often.
The title of the blog post matters a LOT. One of the most interesting blog posts I've written in my own opinion, was called Language pitch. Someone posted it to the Hacker News frontpage, but it fell off almost immediately. Clickbait titles matter! I suspect you get a 5x traffic boost if you put “deep learning” or something similar in the title.
I added AMP to my site, but not really sure if it was worth it. It's already ridiculously fast, so it's just annoying with the slow updates from the AMP cache, that MathJax doesn't work, etc.
HTTPS is also not super important for blogs, and it does slow down the requests a bit. But I like the philosophy of always using HTTPS and I think it might give a marginal boost to search engine rankings.
RSS is a no-brainer, and I baffles me that some blogs don't have it. Without it, you can't subscribe using something like Feedly.
I think, but have no evidence for it, that meta tags (like OpenGraph and Twitter) matter a lot. Once I set the metadata correctly to get big-ass previews in Twitter, I feel like I'm getting a lot more traffic from Twitter.
Blog and Twitter sort of reinforces each other. Days when I get a spike of visitors on my blog, I also get a lot of new Twitter followers. Conversely, tweeting a link to my blog generally drives a fair amount of traffic.
Something similar applies to my Github projects. Blogging about projects and linking from projects back to the blog seems to reinforce and drive traffic to both.
I set up a Mailchimp email collection form a month ago as an experiment. It automatically sends a email summary of all blog posts every week and took about 20 minutes to set up. I've been pretty surprised with the amount of people signing up. Getting about 2-3 per day, which doesn't seem like a whole lot, but that's potentially 500-1,000 people in a year. I got the idea from Julia Evans.
I haven't considered adding comments so far. I find the comments of blogs that I read to be mostly a distraction, and I also dread writing posts with few or no comments. But maybe one day.
Blogging seems to help a bit professionally. I don't think it's helped me directly (it certainly hasn't gotten me a job or gotten me promoted) but it does help me a bit in my current job with recruiting. A large fraction of people I talk to have seen my blog. I hink it acts as some form of compound marketing, where it's sometimes easier to get a foot in the door with a candidates because they have read my blog in the past.
The original purpose of this blog was to improve my English/writing skills, and I'm fairly happy with it as such. Practicing how to put a thought into a few hundred words succinctly is a good skill to have.
A really nice side effect is that I can use this blog as a excuse to mess around with new stuff. For instance, writing the deep chess article was a great way to make myself learn Theano. It's a nice motivator for finishing my side projects.
I probably should focus this blog a bit more. Software engineering? Machine learning? Management? But I honestly wouldn't have written all this stuff it wasn't for the fact that I enjoy writing about whatever I'm thinking about, so regular readers have to live with random side topics… e.g. this particular blog post :)

New benchmarks for approximate nearest neighbors

2018-02-15T00:00:00Z

UPDATE(2018-06-17): There are is a later blog post with newer benchmarks!

One of my super nerdy interests include approximate algorithms for nearest neighbors in high-dimensional spaces. The problem is simple. You have say 1M points in some high-dimensional space. Now given a query point, can you find the nearest points out of the 1M set? Doing this fast turns out to be tricky.

I'm the author of Annoy which has more than 3,000 stars on Github. Spotify used a lot of vector spaces models for music recommendations, and I think as people embrace vector space models more and more, we'll see more attention to fast approximate searches.

What's bothered me about the research is that there's a thousand papers about how to do this, but very little empirical comparison. I built ANN-benchmarks to address this. It pits a bunch of implementations (including Annoy) against each other in a death match: which one can return the most accurate nearest neighbors in the fastest time possible. It's not a new project, but I haven't actively worked on it for a while.

Recently, two researchers (Martin Aumüller and Alexander Faithfull) published a paper featuring ANN-benchmarks and were nice enough to include me as a co-author (despite not writing a single word in the paper). They contributed a ton of useful stuff into ANN-benchmarks which made me realize that my tool could be some kind of “golden standard” for all approximate nearest neighbor benchmarking going forward. So I decided to spend a bit more time making ANN-benchmarks ridiculously easy to use for researchers or industry practitioners active in this field.

My hope is that there is a group of people who care about approximate nearest neighbor search, and hopefully people everyone can stick to the same benchmarks going forward. That would be great, because it makes everyone's lives easier (kind of like how ImageNet made it easier for the deep learning crowd).

What's new in ANN-benchmarks?

What are the things I've added in the last few months? A bunch of stuff:

All algorithms are now Dockerized. This means you don't have to install a bunch of stuff on the host computer, and deal with all the mess that that entails. All you need to do to add a new algorithm is to create a Dockerfile and some more config. Very nice!
It comes with pre-computed datasets. I've collected a bunch of different vector datasets (MNIST and many other ones), split in train and test sets, and computed the nearest neighbors for the test set. Everyone can just download the dataset and use it.
I finally got the Travis-CI test working kind of (it's still a bit flaky).

I re-ran all benchmarks, which took a few days and about $100 in EC2 costs. The results depend on what dataset you use, but are somewhat consistent. I'm not going to get into the details. All you need to know for now is that the further up and to the right is better. Feel free to check out ANN-benchmarks for more info.

glove-100-angular:

sift-128-euclidean:

fashion-mnist-784-euclidean:

gist-960-euclidean:

In almost all the datasets the top 5 are in the following order:

HNSW (hierarchical navigable small world) from NMSLIB (non metric search library) knocks it out of the park. It's over 10x faster than Annoy.
KGraph is not far behind, which is another graph-based algorithm
SW-graph from NMSLIB
FAISS-IVF from FAISS (from Facebook)
Annoy (I wish it was a bit faster, but think this is still honorable!)

In previous benchmarks, FALCONN used to perform very well, but I'm not sure what's up with the latest benchmarks – seems like a huge regression. If any of the authors are reading this, I'd love it if you can figure out what's going on. FALCONN somewhat interesting because it's the the only library I've seen that gets decent results using locality sensitive hashing. Other than that, I haven't been very impressed by LSH. Graph-based algorithms seem be the state of the art, in particular HNSW. Annoy uses a very different algorithms (recursively partitions the space using a two-means algorithm).

A final word

Going forward, if I see a paper about fast approximate nearest neighbor queries, and it doesn't include proper benchmarks against any of the top libraries, I'm not going to give a 💩! ANN-benchmarks makes it too easy not to have an excuse for it!

I'm looking for data engineers

2018-01-28T00:00:00Z

I'm interrupting the regular programming for a quick announcement: we're looking for data engineers at Better. You would be the first one to join and would work a lot directly with me.

Some fun things you could work on (these are all projects I'm working on right now):

Building a forecasting model using MCMC to predict volume the next few months. Productionize it in the form of a job that posts an updated visualization to Slack every night.
Migrate our data warehouse to Redshift.
Write and productionize a web scraper to ingest a bunch of financial third party data.
Work with business people to figure out what to measure and define KPIs for the company.
Fit Gamma distributions to conversion data to understand the time lag and conversion rates.
Work on mixed-integer optimization problems to find the best loans for borrowers.
Work with the product team to understand our conversion funnel and friction points to the user.
Build an ETL pipeline that downloads phone call mp3s, transcribes them to text, and ingests them into the database.

This position is very engineering-heavy at its core, and the main qualification is solid programming skills. I like to see at least a couple of years of professional software experience, and a programming skills that go back at least five years.

Other than that, we don't require any particular skills, and certainly no fancy academic credentials. However, a curiosity about what makes a startup succeed is great, and if you know some basic statistics that doesn't hurt (but certainly isn't required).

What does Better do? I've written a blog post about why I went into the mortgage industry if you're interested. In short, it's an enormous industry and also a very terrible one. We think we can fix it, by making mortgages fast, easy, and inexpensive. And we're not fixing it in a sketchy way (for instance by lending to people who can't afford the loan). We focus on the seven million Americans who get a normal plain mortgage every year, and we are making the experience better by solving all the pain points using technology.

Sounds interesting? Definitely let me know at erik at better.com!

(The binary globe picture is somewhat ironic. I have fascination for “data” stock photos.)

Books I consumed in 2017

2018-01-17T00:00:00Z

Turns out having a toddler isn't super compatible with reading. I used to read ~100 books/year as a teenager, but it has slowly deteriorated to maybe 20-30 books, at most. And I don't even finish all of them because life is too short! Some books are just not that interesting. So what were some of the books worth mentioning?

Economics / Venture capital / Misc

I'm a sucker for behavioral economics, and I usually keep it pretty superficial. So I wasn't surprised that Misbehaving by Richard Thaler was a pretty nice book. It probably resurged in rankings after his well-deserved Nobel Prize in Economics at the end of last year.

A book on history and economics that keeps coming up is Technological Revolutions and Financial Capital. I'm always a bit skeptical of these books trying to find predictable pattern in history (reminds me of pseudoscientific Kondratiev waves and crackpot stuff like that) but still, there are probably some legit patterns that we can learn from. Worth reading.

Unbanking of America is a great book about financial services for poor people in America. It went a long way explaining the alternative banking systems they use and why it makes sense for them.

The Business of Venture Capital is a pretty dry book about venture capital, and a bit dated, but it's definitely the best one out of a few books I read trying to wrap my head around how the industry works. A very different book was Reset by Ellen Pao. The book does not paint a rosy picture of behavior in the VC industry, but I also can't help but getting the feeling that Pao throughout the book is a bit… naïve? I read the book partly as a story of a high performing “work hard and everything will follow” person oblivious of the power structures and the politics that exists basically everywhere. As such, I think it was an interesting book, and something I can sympathize with myself.

A hyped up book that I couldn't stand was The Beginning of Infinity. I gave up after a few hundred pages. The author loves using the word “parochial” about five times on every page and goes on long rants about cosmology and many other things. I studied physics in school, but one thing I despise about physics people is they have a tendency to assume everything in physics generalizes to everything and that physics students are omnipotent mathemagicians who can sprinkle their mathematical models on basically every possible sociological/economical/whatever phenomenon. Similarly I gave up on How Much Inequality Is Fair which is some arbitrary mathematical model of equality the author cooked up that is just one out of thousands of plausible models you can cook up. But enough ranting about physics!

Operations research

I read a whole pile of books on how to structure work efficiently. Partly because our mortgage process has a lot of parallels to a manufacturing pipeline, partly because software development has that too.

The Goal is a classic “business fiction” book that is worth reading for anyone with a manufacturing pipeline. I think of it as the base case: low variability, find the bottleneck kind of process, and the “theory of constraints” probably works well in those cases. The whole book can be summed up as “find the bottleneck and sometimes you have to do slightly inefficient things to expand it”. The Principles of Product Development Flow build upon The Goal pretty nicely and handles the case of substantial variance (like in the case of software engineering).

Matching Supply with Demand is a bit more theoretic and introduces some of the basic math of queue theory. Reengineering the Corporation goes through a few case studies of how to streamline work processes.

Business / management

The Manager's Path is by far the best book I've read about technical management. Highly recommended to anyone who writes code and reads book.

Mostly out of curiousity, I read The Science of Success by the notorious Koch brothers. It's not a masterpiece, and I'm not going to recommend it, but I found myself agreeing with basically their whole theory on “market based management”. The book is basically a whole case for making more economically rational decisions, which is hard to disagree with.

Shoe Dog was a great summary of the early days of Nike and I'm left with tremendous respect for the copious amounts of hustle going into it. Not sure what's going on with your supplier? Jump onto the next plane to Japan and show up at the factory next day. Losing the Signal was a fun story about Research In Motion (makers of Blackberry) and their decay after living in denial about the IPhone. Co-opetition was a pretty good book about “collusion” (despite some dumb blunders early on like predicting the IPhone will fail)

I got a bit tired of management books at some point so I turned to other areas for inspiration. I ended up reading Attacks by the infamous German general Rommel. I don't think I can recommend it, but it still left me with some thoughts. First of all, the guy is a massive psycho. But what strikes me is to what lengths Rommel goes to lead by example. At some point, he gets excited about putting his bayonet skills to practice when he rushes through the woods with his troops towards the French. He doesn't hesitate to crawl ahead of his platoon through the muddy trenches to ambush the enemy. Over time he's promoted several times but I think his obsessive attention to everything that matters on the battlefield makes him a lot more effective as a general. I've seen leaders in tech who stopped coding way too early and I see how their leadership skills end up suffering because their technical judgement was never fine tuned.

One military leader with a very different perspective is Stanley McChrystal. His book My Share of the Task describes a leader who does a great job de-bottlenecking information and setting the high level strategy right (I loved this book Team of Teams). But the whole book leaves me with the feeling that he just never gets it. Never having been a foot soldier, he fails to empathize with his troops on the ground. The whole war in Afghanistan makes it seem like a terrible startup without any product market fit (the Afghans don't want the Americans to help them) who keeps raising more money (asking for more troops).

The chasm between a general like McChrystal and Rommel is pretty striking as a top-down ivory tower style management vs a decentralized bottom up approach. There's some interesting things to learn from both perspectives.

Stuff I listened to

My 2017 music consumption was fairly label-centric. I finally felt like Sweden regained its rumor as a reliable producer of electronic music. Both Studio Barnhus and Northern Electronics are two labels releasing a long string of fine techno (and they couldn't be more different!). Another label I enjoyed quite a lot was Lobster Theremin from UK.

Plotting author statistics for Git repos using Git of Theseus

2018-01-03T00:00:00Z

I spent a few days during the holidays fixing up a bunch of semi-dormant open source projects and I have a couple of blog posts in the pipeline about various updates. First up, I made a number of fixes to Git of Theseus which is a tool (written in Python) that generates statistics about Git repositories. I've written about it previously on this blog. The name is a horrible pun (I'm a dad!) on Ship of Theseus which is a philosophical thought experiment about what happens if you replace every single part of a boat — is it still the same boat ⁉️ 🤔

So anyway, here's one of the plots you can generate for Kubernetes — a somewhat arbitrarily picked repository.

So what's new? I've updated the color scheme a bit, but also added the option to plot author statistics:

And it doesn't stop there! Here are some other minor updates:

I published the whole thing to PyPI which also means that the installation is far simpler: just run pip install git-of-theseus.
The pip package also installs binaries that lets you run the analyses in a more straightforward way: just run git-of-theseus-analyze on the command line.
By default it now only analyzes file types of certain extensions that indicate source code (by leveraging pygments)
You can also normalize stats using the --normalize flag. See plot below:

That's it! As I mentioned I got more where this came from. Some future blog posts will cover:

ann-benchmarks which is a tool to benchmark approximate nearest neighbor methods. Very niche, but very useful within its niche. I just spent a lot of time precomputing datasets and Dockerizing all algorithms.
convoys a new tool I built to model and plot time-lagged conversion. Fun stuff with Gamma and Weibull distributions.
champy which is a halfway implementation wrapper that lets you formulate and solve linear programming, mixed integer programming, and constraint programming problems in a much nicer way (IMO) than any other library I've encountered. Don't hold your breath for this one — it's pretty far from being production-grade.

EDIT(2018-01-16): added a few more notes

Toxic meeting culture

2017-12-29T00:00:00Z

I spent six years at a company that went from 50 people to 1500 and one contributing factor leading to my departure was that I went from a “maker” to a person stuck in meetings every day. It wasn't that I wanted to do that, but everyone else kept dragging me into meetings.

There's about 47 million blog posts about why meetings suck and I'm not going to pile more onto that heap. For the record – a well run meeting is great! But thinking back I think there was a number of things that absolutely completely sucked about the meeting culture. Rather than focusing on how to run meetings better, let's try to backtrack the issues back to the organizational culture that created this mess? What is it? Here's some armchair philosophizing:

People feeling good about being in meetings

People are hardcoded to seek proxies where you do something and you feel good about it right away. We do that because delayed gratification is a hard thing. We're constantly at risk of failing the marshmallow test and succumb to the urge of getting satisfaction right now. Software engineers like to crank out code, because years of coding have made it feel really good. So what's the issue with that? It leads to things that might not have anything to do with business value. Vanity refactoring, spending too much time on technically interesting problems, etc.

So why do managers feel good about being in meetings? Because they seek vague proxies for feeling like they “got things done”. Coming out of a long day of meetings, with discussion with people, validates their feeling of purpose in an organization.

How to solve: promote people based on their actual work output and the value of their decisions.

“Decision” meetings that turn into shitty brainstorming meetings

It seems like there's substantial evidence that it's a bad idea to put a bunch of people in a room to come up with anything creative. But people still try. Silly people!

I suspect the biggest losers of these poorly run meetings are (a) introverts (b) people who combine lots of different signals into a single conclusion (aka foxes) (c) people with language or cultural barriers. I suffer from all three of these, which sucks. Putting a bunch of people to brainstorm in a room and most likely an alpha male Ivy League guy is going to argue for their thesis eloquently in a way that leaves you speechless. And it sucks for people like me who like to think for a bit and weigh different signals together. My first three years in the US, I struggled just getting anything said in a unmoderated discussion. Americans love to interrupt and they expect you to! Anyway, I digress…

How to solve: never have meetings with more than 4 people if you are planning to brainstorm to reach to a conclusion. Be incredibly suspicious of people having meetings with more than 4 people trying to make a decision.

Fear of leaving people out

Any recurring meeting with senior people will slowly be debased by invitee inflation. But that devalues the meeting and wastes peoples times.

How to solve: start telling people: thanks for not inviting me to a meeting! Cap all recurring meetings to at most 10 times.

Load factor and iteration cycle

One thing I noticed about many (especially ineffective) managers is they basically let their calendar availability be a prioritization filter. They would let their calendars slowly fill up, until it was impossible to find any time on that. At that point people would either (a) accept that they have to wait until a meeting in the week after the next to make a decision (b) give up and try to figure out some other ways to make progress.

This is kind of similar to what happens with a system with high variability and near 100% load factor. The cycle time shoots up, from 0 to infinity as the load factor goes towards 100%. Basically the average time it takes to make decisions might be a few hours or a day or two with 25% calendar load, but with 75% calendar load it's almost impossible to find a time on everyone's calendars. As a result, the time it takes to make decisions shoots up to weeks. Since decisions tend to be highly dependent on each other, you basically slow down decision making across the whole organization.

How to solve: don't defer decisions to meetings. Make decisions on the spot, communicate it over email, and use the meeting to discuss it.

Learning from users faster using machine learning

2017-12-12T00:00:00Z

I had an interesting idea a few weeks ago, best explained through an example. Let's say you're running an e-commerce site (I kind of do) and you want to optimize the number of purchases.

Let's also say we try to learn as much as we can from users, both using A/B tests but also using just basic slicing and dicing of the data. We are looking at how many people convert (buy our widgets) but a constant problem is there's just too much uncertainty.

How can we learn faster? In particular, is there a way to incorporate additional data somehow? This struck me as a very universal problem, so I tweeted this and got some interesting pointers.

Reading material

Jeremy Stanley pointed out: “If you can place a dollar value on a unit of each metric, then you could use the total as a single metric”. I like that approach because it's super simple and it's easy to understand how it works.

John Myles White sent a link to a blog post: Using covariates to increase the precision of randomized experiments. I don't quite follow it, but my understanding is that it's more of a way to reduce noise caused by uneven assignment between the test and control group. Eyad Sibai pointed out a KDD paper: Boosted Decision Tree Regression Adjustment for Variance Reduction of Online Controlled Experiments. The idea is simple. Instead of using say “purchased a widget” as an outcome metric, try to predict based on user attributes whether the user is going to purchase the widget. Then, use the deviation between the prediction and the real value (whether the user purchased the widget) as the target metric in the test. Using this approach, they reach the same level of statistical significance with 63% less data. Nice!

Uri Shalit sent a link to a paper: Correlated Bivariate Continuous and Binary Outcomes: Issues and Applications. The paper looks cool, but is way over my head, sadly.

Adrian Palacios sent a link to a new Netflix blog post: Innovating Faster on Personalization Algorithms at Netflix Using Interleaving. This research focuses on a very narrow problem: ranking video recommendations. By scrapping A/B testing for “interleaving”, they show that they get to significance about two orders of magnitude (100x) faster! That's extremely impressive, but it's not clear to me if it generalizes to any other type of tests.

A simple toy model

Enough literature study. I wanted to try this in practice. My idea is: create a model that predicts whether someone is going to purchase a widget given a lot of additional data. And instead of using the actual target metric (what fraction of people bought widgets) we use the predicted metric, using our machine learning model. So for instance as inputs to the model we throw in all kinds of features, and then try to predict the target (did the user buy the widget or not?)

I experimented with a few different models and the one that seemed to work the best was the most basic model you could ever think of: linear regression. The inputs are all binary variables that denote whether the user hits certain pages in the conversion flow. In total there's 70 different features (we have a lot of special pages only some users hit). The predictor is a binary variable indicating whether the user converted. I train the model to minimize the squared loss.

One cool feature is we can apply dropout on the inputs, since that makes it possible to include the target itself as an input. It turns out doing that and using extreme dropout actually seems to work really well. I ended up dropping 90% of the inputs during training. The model is a few lines using Keras:

inputs = keras.layers.Input(shape=(X.shape[0],))
inputs_dropout = keras.layers.Dropout(0.90)(inputs)
outputs = keras.layers.Dense(1, activation='linear')(inputs_dropout)
model = keras.models.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=keras.optimizers.SGD(lr=1e-2), loss='mse')
model.fit(X, y, batch_size=16, epochs=300)

Results

I “simulated” a conversion rate A/B test by picking three random subsets of users to our site in some way I'm not going to disclose. We want to understand if the conversion rate is different between the three subsets.

Let's first run the analysis the “traditional” way. We look at the fraction of users who make it through the entire conversion flow, and we plot the conversion rate with a confidence interval. As we get more and more data, the confidence interval for each group shrinks:

Then, I train a model on data from earlier this year so that it's properly out of sample. Let's plot the same thing again, but replace all the “real” values with their predicted values. Each predicted value is obtained by feeding in all the 70 binary values for each user.

We can see that the confidence intervals are much tighter and that the conversion rate curve is much smoother:

This is actually pretty cool! In this case, it looks like we can actually get a confidence interval that's almost 50% smaller, which means we can get to statistical significance about 4x faster. If we plot the conversion rates and the confidence intervals for a larger set of groups, we can see that the uncertainty is consistently smaller using the predicted values:

Conclusion

Consider this blog post a bit of a wacky experiment – I think the outcome is super interesting, and worth thinking more about.

One downside of this model is that the reduction of variance doesn't come for free. We're basically trading a bit of bias for variance. The predicted conversion rate might have a tighter confidence interval, but it's no longer guaranteed to converge to the “correct” value. Is that acceptable? I don't know. Another issue is you can have some kind of “covariate shift” where the data distribution changes over time. For instance what if your conversion flow completely changes.

I haven't spent enough time understanding this, and I haven't made up my mind if this tool is going to be something I'm planning to use for real data. This is admittedly a bit of a half-baked idea that I would love to get some feedback on!

Postscript

The almighty John Myles White tweeted a link to this blog post but raised a number of concerns with the methodology. We had bit of back and forth but I think I was able to address the concerns. I think it's useful to rehash some of the arguments for other people:

Rather than focusing on some particular target metric (eg conversion rate), let's focus on the underlying “user value”.
“user value” is a continuous variable rather than binary, so should have lower variance (everything else equals).
We define “user value” by running a regression model that find a linear combination of many different user metrics in a way that predicts the original target.
So basically we learn to replace one value with a lower variance version of itself (but with slight bias).

Some more notes:

Why using a neural network for this? It's basically a linear regression, but we need some kind of regularization. I have a mild preference for dropout since it's dimensionless.
Why do we need regularization? Without it, the linear regression would just learn the trivial mapping where the target predicts the target and everything else is ignored.
You can see the bias-variance tradeoff pretty clearly here. If you set the dropout to 0, you get zero bias, high variance. As you increase the dropout rate, you increaes the bias, but lower the variance.
Sergey Feldman also pointed out this paper: Double/Debiased Machine Learning for Treatment and Structural Parameters. Seems interesting.

Annoy 1.10 released, with Hamming distance and Windows support

2017-11-26T00:00:00Z

I've been a bit bad at posting things with a regular cadence lately, partly because I'm trying to adjust to having a toddler, partly because the hunt for clicks has caused such a high bar for me that I feel like I have to post something Pulitzer-worthy. But things are always cooking, so let's break this pattern with a quick notice on something I've been working on!

Annoy 1.10.0 is out

Annoy is a library I built at Spotify that helps your search for approximate nearest neighbors in high-dimensional spaces. This is super useful if you use vector models, which Spotify uses a lot. Every track/album/artist/playlist/user ends up being a vector in some high dimensional space (typically 40D, sometimes more). The problem is that searching in that space is a nontrivial art (if I recall correctly, it's expected but not proven to be to be NP-complete).

Annoy solves this issue by relaxing the search to be approximate. You can usually get 90% or 99% recall with only 1% of the runtime of an exhaustive search. This is great in many applications like recommendations where the cost of a false negative isn't the end of the world.

Annoy 1.10.0 features mind-altering things like Hamming distance added by Martin Aumüller. Hamming distance is great when your vectors can be represented in binary form (every coordinate is either 0 or 1). This means that vectors can be stored very efficiently as 64-bit integers and distance can be computed using primitives like __builtin_popcountll which I think is a single CPU cycle on modern machines. The tree-building method right now only consider axis aligned splits (effectively making it a k-d tree) but I'm hoping to experiment with a few other heuristics at some point in the future.

The other main thing that 1.10.0 adds is Windows support with a proper CI pipeline, contributed by Timothy Riley. Annoy has had some semi-broken Windows support for a very long time, but several people have reported that it doesn't work. Since I haven't had access to any Windows machines, it's been tricky for me to debug. The Windows build only works on Python 3.6 (but quite frankly: I'm a big proponent of Py3 – and my sympathy for people on Py2 is very limited).

What else?

I got an email saying Annoy powers a database for condensed matter physics. See corresponding paper 1 and paper 2. Always fun when things end up in unexpected fields.
There's a long list of companies using Annoy (including Spotify). Instacart is a new entry to that list. They use it to recommend groceries.
A paper about approximate nearest neighbor benchmarks was recently accepted at SISAP and the authors were nice enough to include me as a co-author. This relates to a similar open source project I have: ann-benchmarks. There's a lot going on right now with that project that will be its own blog post in the future, but one thing worth mentioning so far is I've built a number of benchmark datasets for approximate nearest neighbors that I encourage you to use if you're interested!
What's up next for Annoy? There's a work in progress pull request for threaded index building which should speed things up a lot.
I'm speaking about Annoy at the EGG2017 conference in NYC on Nov 30. Feel free to drop by and say hi! It will cover basically a slightly updated version of my serious of blog posts from before: part 1, part 2, and part 3. Expect a number of dad jokes about dimensionality and slides like the one below.

Why conversion matters: a toy model

2017-10-30T00:00:00Z

There are often close relationships between top level business metrics. For instance, it's well known that retention has a super strong impact on the valuation of a subscription business. Or that the % of occupied seats is super important for an airline. A fun little toy model that I can up with generates a curious relationship between conversion rates and revenue.

The intuition

Let's look at an ecommerce company. I'm working at a company that does mortgage online, for instance.

My intuition is roughly:

Higher conversion rates means more volume. That's the obvious first order effect.
Higher conversion rates means better unit economics, meaning you can now acquire a bunch of customers that previously did not make sense. You can keep growing volume until the marginal acquisition cost catches up with the new break even point.

Let's formalize it and make some assumptions. We need to first assume something about how aquisition costs increases with volume. Marginal CAC (customer acquisition cost) is not constant, because you have to spend a little bit more effort for every lead you acquire. So we want to pick some function that grows pretty slowly.

Note that it's very important not to confuse the marginal acquisition cost (the cost of acquiring user $$ n $$) with the average acquisition cost or the total acquisition cost. The total acquisition cost is the integral of the marginal acquisition cost so it will always grow more than linearly (since the marginal acquisition cost is non-decreasing).

The model

I'm going to go out on a limb here, and this is the model: the acquisition cost of lead $$ n $$ is proportional to $$ n^{0.4} $$. Let's go through the math and then let get back to the (somewhat arbitrary) choice of function.

This function has the property that it's reasonably slow-growing. With $$ n^{0.4} $$, the cost of acquiring lead #2000 is roughly 32% more than acquiring lead #1000.

Note that I'm saying the cost of lead $$ n $$, not user. The difference is that not all leads convert into users. The difference is the conversion rate. So the total cost of acquiring a user is $$ n^{0.4} / r $$ where $$ r $$ is the conversion rate.

We're acquiring leads as long as we make money from them, so up until the marginal acquisition cost equals some constant (basically revenue minus auxiliary costs). So $$ n^{0.4} / r = C $$ and it follows that $$ n = (Cr)^{2.5} = \mathcal{O}( r^{2.5}) $$. The weird $$ \mathcal{O} $$ symbol is just fancy notation from computer science that means that you can ignore all the constants.

So this is pretty interesting. Basically it says if we improve the conversion rate by 20%, the total volume will increase by 58%. This is a highly nonlinear relationship between conversion rate and volume.

What about the total profit?

Ignoring all the fixed costs of running a company, the profit for each unit is some constant minus the acquisition cost:

$$ \int_0^n \left( C_1 - C_2 m^{0.4}/r \right) dm = \left[ C_1 m - C_2/1.4 m^{1.4}/r \right]_0^n = C_1 n - C_2/1.4 n^{1.4} / r $$

Plugging in the earlier expression for $$ n $$ we get

$$ = C_1 (r^{2.5}) - C_2/1.4 ((Cr)^{2.5})^{1.4} / r = C_3 r^{2.5} = \mathcal{O}(r^{2.5})$$

Magically this comes out to the same thing here – if you increase the conversion rate from say 5% to 6% (an increase by 20%), the total gross profit increases by 58%. If you double the conversion rate (which isn't entirely unreasonable for an early-stage startup with an unoptimized conversion funnel), then the gross profit goes up by 5.7x. Kind of sweet, and again the same nonlinear relationship between two variables.

Graphical interpretation above. The “flatter” the acquisition curve is, the more the dollar gain is. That's because a flat acquisition cost curve means we're going to have to move the breakeven quantity very quickly to the right.

More ruminations

You don't think $$ n^{0.4} $$ is reasonable? Fine, make it $$ \sqrt{n} $$. You can do the same thing but instead of $$ \mathcal{O}(r^{2.5}) $$, you end up with $$ \mathcal{O}(r^2) $$.

Of course, it doesn't have to be a polynomial – logarithms are fine, as are exotic animals like the Lambert W function. In fact, in all those cases, the results are even more dramatic. The key thing here is that if we take a slow-growing function and invert it, we end up with a fast-growing function. As long as we pick a function that grows less than linearly, we get a superlinear relationship between conversion rate and gross profit.

So what's the right choice of function? I really think this depends on the industry. If you're selling pet rabbit tiaras, your total market size is pretty tiny, and the acquisition cost is going to shoot up very drastically once you exhausted it. The inverse of a function that shoots up is a function that stagnates, so your revenue as a function of conversion rate is almost flat.

But let's say you're selling mortgages (I am!) or groceries or gasoline. Then the market size is enormous and the acquisition cost will grow slowly going from user 1000 to user 10,000 to user 100,000. Anywhere we see a large company, that's a sign that the marginal acquisition cost has to grow quite slowly (otherwise it would be prohibitively expensive to compete with smaller companies) and thus as a result you can also see that conversion rates will matter in a highly nonlinear way.

Feel free to poke a hole in this theory and by all means please let me know if you do!

On the Equifax breach and how to really prevent identity theft

2017-09-26T00:00:00Z

A funny thing about being a foreigner is how you realize people take broken things for granted. I'm going to go out on a limb here claiming that the US has a pretty dumb banking system. I could talk about it all day, but right now I want to focus on a very particular piece of it: how to verify your identity online.

Of course, since the Equifax breach, people are freaking out about the fact that your SSN are floating around all over internet. That's bad. But what's even worse is that the system was built to fail like this eventually. SSN is a terrible secret to authenticate with. The first three numbers are based on the area, credit checks usually allow one or even two digits to be wrong, and at some point we're going to have to recycle social security numbers from dead people. Sounds fun. So what should we do? I think the correct thing is to assume that social security numbers are a publicly known number.

There's nothing weird about this. Sweden has a similar number, and while sharing is not recommended, it's certainly not a big deal.

Let's talk about 🇸🇪

So how do you verify your identity online in Sweden? Through this wonderful thing called BankID – a service that claims 7.5M active users. This is out of a population of 10M people, meaning basically everyone has it.

How does it work? Basically as a two-factor authentication system. You install an app on your phone. Any time you need to identify yourself online, you usually start with your “SSN”:

After that, the website asks you to launch the app on the phone:

On the phone, you now have to approve the request by typing in a password:

Sweden's version of the IRS uses it if you want to pay your taxes online. You can even use it to buy stuff. The most sick thing though, is that BankID has an API. So if you're some random third party whatever provider, and you want to verify the identity of a person, you can integrate it. There's a bunch of node packages even.

How do you get a BankID though? Online banks offer to set you up. And online banks always use two-factor authentications in Sweden, usually through a physical device that you have to pick up at a bank branch (where you have to go visit and show an ID card or drivers license to pick it up… the card carries biometric information so is very hard to forge).

So why does this work? Basically everyone in Sweden has a bank account. There's only a handful of banks, which are pretty much colluding to some extent, but on the other hand the government has regulated all their fees down to basically zero, meaning they don't make all their money screwing lower income people. So I think Sweden ended up in some kind of weird Nash equilibrium where there's so few of them that collaborating on an ID service is not very hard, and they are regulated enough to realize they might as well take their fees and try to build useful consumer products out of it. It's not just this. They also built their own version of Venmo that something like 70% of the population uses.

Anyway, I think the probability that this will ever happen in the US is roughly zero, sadly. The banking industry is too fragmented and many people don't have bank accounts, so we can rule out that player. The federal government would never do it, but maybe adopt it if someone else does it. So I think realistically maybe the only one that could pull something off is if the state like California adopts it. More specifically maybe the Tax Service Center or DMV. But I don't know, and I'm not going to describe a comprehensive launch plan here. All I can do is dream of a time where the US would actually do digital infrastructure efficiently.

The number of letters in the word for each number

2017-09-06T00:00:00Z

Just for fun, I generated these graphs of the number of letters in the word for each number. I really spent about 10 minutes on this (ok…possibly also another 40 minutes tweaking the plots):

More languages!!

I love how Spanish has a few super compact words: “cien mil” for 100,000 for instance. Only eight letters, versus English “one hundred thousand” (20 letters).

I don't know much about French but I think they have some kind of weird system based on 20s. Which by the way also Danish has.

If your stereotype of German is long words, you won't be disappointed. Siebenhundertsiebenundzwanzigtausentsiebenhundertsiebenundzwanzig. But I also think that fascination is somewhat misguided – German (and many languages like Swedish) just compounds words when other languages would put a space in between. Big deal.

But anyway, speaking of stereotypes, look at the regularity of this chart. Ordnung muss sein. Turns out the reason is mostly that the German words for multiple of ten all have the same length: zwanzig, dreißig, vierzig, fünfzig, …

Overall, I kind of like the jagged form of the curves… there's something fractal about it.

Roman numerals… because I don't have anything better to do:

Finally here's the cumulative average length of each language, all on one chart:

It's a bit interesting to note that English has longer words than any of the other languages. And Arabic seems most compact, which is sort of interesting.

All of this was done using the num2words Python library. Full code below:

from num2words import num2words
from matplotlib import pyplot
import numpy
import roman

def l(lang):
    return lambda i: num2words(i, lang=lang)

def r(i):
    if i == 0: return 'nulla'
    if i < 5000: return roman.toRoman(i)
    else: return ''.join(c + '\u0305' for c in roman.toRoman(i//1000)) + \
        (i%1000 and roman.toRoman(i%1000) or '')

data = []
for lang, func, language, color in [
        ('ro', r, 'Roman', 'orange'),
        ('en', lambda i: num2words(i).replace(' and', ''), 'English', 'blue'),
        ('fr', l('fr'), 'French', 'red'),
        ('de', l('de'), 'German', 'black'),
        ('es', l('es'), 'Spanish', 'yellow'),
        ('ar', l('ar'), 'Arabic', 'green'),
        ('ru', l('ru'), 'Russian', 'purple')]:
    words = [func(i) for i in range(1000000)]
    fig = pyplot.figure(dpi=288, figsize=(9, 7))
    ax = fig.add_subplot(111)

    lens = numpy.array([len(word) for word in words])
    ax.semilogx(lens, color=color)
    data.append((language, color, lens))

    for p in range(0, 6):
        lo, hi = 10**p, 10**(p+1)
        if hi > len(words):
            break
        x_max = max(range(lo, hi), key=lambda x: len(words[x]))
        kwargs = dict(horizontalalignment='right',
                      arrowprops=dict(shrink=0.05,
                                      width=2.0,
                                      headwidth=5.0,
                                      headlength=2.0,
                                      facecolor='black'))
        ax.annotate('%d: "%s" (%d)' % (len(words[x_max]), words[x_max], x_max),
                    xy=(x_max, len(words[x_max])),
                    xytext=(x_max, len(words[x_max])+10), **kwargs)

    pyplot.xlim([1, len(words)])
    pyplot.ylim([0, 100])
    pyplot.title('Number of letters of the %s word for each number' % language)
    pyplot.tight_layout()
    pyplot.savefig('num-letters-%s.png' % lang)

fig = pyplot.figure(dpi=288, figsize=(9, 7))
ax = fig.add_subplot(111)
for language, color, lens in data:
    avgs = numpy.cumsum(lens[1:]) / (numpy.arange(1, len(lens)))
    ax.semilogx(numpy.arange(1, len(lens)),
                avgs,
                color=color,
                label=language)
pyplot.xlim([1, len(lens)])
pyplot.ylim([0, 100])
pyplot.legend()
pyplot.title('Cumulative average number of letters of the word for each number')
pyplot.tight_layout()
pyplot.savefig('num-letters-avg.png')

The software engineering rule of 3

2017-08-29T00:00:00Z

Here's a ~~dumb~~ extremely accurate rule I'm postulating* for software engineering projects: *you need at least 3 examples before you solve the right problem*.

This is what I've noticed:

Don't factor out shared code between two classes. Wait until you have at least three.
The two first attempts to solve a problem will fail because you misunderstood the problem. The third time it will work.
Any attempt at being smart earlier will end up overfitting to coincidental patterns.

(Note that #1 and #2 are actually pretty different implications. But let's get back to that later.)

What's he talking about? Example plz

Let's say you're implementing a class that scrapes data from banks. This is an extremely dumbed down version, but should illustrate the point:

class ChaseScraper:
    def __init__(self, username, password):
        self._username = username
	self._password = password

    def scrape(self):
    	session = requests.Session()
	sessions.get('https://chase.com/rest/login.aspx',
	             data={'username': self._username,
		           'password': self._password})
	sessions.get('https://chase.com/rest/download_current_statement.aspx')

Now, you want to add a second class CitibankScraper that implements the same interface, but changes a few implementation detail. In fact let's say the only changes are that Citibank has different URLs and that their form element have slightly different names. So we add a new scraper

class CitibankScraper:
    def __init__(self, username, password):
        self._username = username
	self._password = password

    def scrape(self):
    	session = requests.Session()
	sessions.get('https://citibank.com/cgi-bin/login.pl',
	             data={'user': self._username,
		           'pass': self._password})
	sessions.get('https://citibank.com/cgi-bin/download-stmt.pl')

At this point after many years of being taught that we need to keep it “DRY” (don't repeat yourself) we go ermahgerd, cerd derplication!!! and factor out everything into a base class. In this case it means inverting the control and let the base class take over the control flow:

class BaseScraper:
    def __init__(self, username, password):
        self._username = username
	self._password = password

    def scrape(self):
    	session = requests.Session()
	sessions.get(self._LOGIN_URL,
	             data={self._USERNAME_FORM_KEY: self._username,
		           self._PASSWORD_FORM_KEY: self._password})
	sessions.get(self._STATEMENT_URL)


class ChaseScraper(BaseScraper):
    _LOGIN_URL = 'https://chase.com/rest/login.aspx'
    _STATEMENT_URL = 'https://chase.com/rest/download_current_statement.aspx'
    _USERNAME_FORM_KEY = 'username'
    _PASSWORD_FORM_KEY = 'password'


class CitibankScraper(BaseScraper):
    _LOGIN_URL = 'https://citibank.com/cgi-bin/login.pl'
    _STATEMENT_URL = 'https://citibank.com/cgi-bin/download-stmt.pl'
    _USERNAME_FORM_KEY = 'user'
    _PASSWORD_FORM_KEY = 'pass'

This would let us remove a lot of lines of code. It's one of the most compact ways we can implement these two bank statement providers here. So what's wrong with this code? (Apart from the general antipattern of implementation inheritance).

The problem is we're overfitting massively to a pattern here! What do I mean with overfitting? We're seeing patterns that really don't generalize well.

To see this, let's say we add a third provider that is slightly different. Maybe it's one or more of the following:

It requires 2-factor authentication
Credentials are sent using JSON
Login is a POST rather than a GET
It requires visiting multiple pages in a row
The statement url is generated dynamically based on the current date

… or whatever, there is another 1000 ways this could break down. I hope you see the problem here. We thought we had a pattern after the first two scrapers! It turns out there really wasn't that much that generalized to the third provider (and more generally, to the nth). In other words, we overfit.

What does Erik mean by overfitting?

So overfitting is a term for when see patterns in data and those patterns don't generalize. When coding we're often hyper-vigilant about optimizing for code deduplication, we detect incidental patterns that may not be representative of the full breadth of pattern that we would see if we knew all the different applications. So after implementing two bank scrapers we see a pattern that we think applies more generally, but really it doesn't.

Note that code duplication isn't always such a bad thing. Engineers often focus way too much on reducing duplicated code. But care has to be taken to distinguish between code duplication that's incidental versus code duplication that's systemic.

Thus, let me introduce the first rule of 3. Don't worry so much about code duplication if you only have two classes or two functions or whatever. When you see a pattern in three different places, it's worth thinking about how to factor it out.

Rule of 3 as applied to architecture

The same reasoning applies to system design but with a very different conclusion. When you build a new system from scratch, and you have no idea about how it's eventually going to be used, don't get too attached to assumptions. The constraints you think you really need for the 1st and the 2nd implementation seem absolutely crucial, but you're going to realize that you got it all wrong and the 3rd implementation is really the one where most of the things are right. Ok, this is obviously all extreme blanket statements here. Don't use my advice for brain surgery or nuclear fission.

As an example, Luigi was the third attempt at solving the problem. The first two attempts solved the wrong problem or optimized for the wrong thing. For instance the first iteration relied on specifying the dependency graph in XML. But this turned out to be super annoying for the reason that you really want the ability to build the dependency graph programmatically. Conversely a bunch of things in the first two attempts that seemed really useful, like decoupling outputs from tasks, ended up adding far more complexity only to support some obscure edge cases.

What would have seem like obscure niche cases in the first iteration because very central in the final iteration, and vice versa.

I was reminded of this when we built an email ingestion system at Better. The first attempt failed because we built it in a poor way (basically shoehorning it into a CRUD request). The second one had a solid microservice design but failed for usability reasons (we built a product that no one really asked for). We're halfway through the third attempt and I'm having a good feeling about it.

These stories illustrate the second rule of 3 – you're not going to get the system design right until the third time you build it.

More importantly, if you are building the first implementation of some hairy unknown problem, don't assume you're going to nail it. Take shortcuts. Hack around nasty problems. You're probably not going to keep this system anyway – at some point it's going to break. And then the second version breaks most of the time. The third though – that's when you perfect it.

Notes

Hacker news discussion
Reddit discussion on /r/programming/
People on the internet pointed out that this rule already exists [1] [2] [3]. I wasn't aware of any of those, but it's highly likely I've read it at some point a long time ago. Not trying to misappropriate ideas that have been around for a long time!

Machine, Platform, Crowd

2017-08-19T00:00:00Z

I just bought Machine, Platform, Crowd: Harnessing Our Digital Future and discovered that it mentions my blog – in particular the post When machine learning matters.

Ok, I lied a little bit. I didn't discover it serendipitously. Someone actually emailed me saying I was mentioned, and so I ordered the book for same-day delivery. But I was seriously planning to read the book anyway – having read both The Second Machine Age and Rage Against the Machine – they are great books and I'm not being biased.

What's next? Hoping my blog will be mentioned in a Broadway musical in a few years.

As a complete side note, Erik Brynjolfsson has a conspicuously similar name to mine. Just for fun, I computed the Levenshtein distance of my name against every Wikipedia article. Erik Brynjolfsson has distance 6, and as it turns out is the closest one out of all Wikipedia articles. Now you know!

Google diversity memo, global warming, Pascal's wager, and other stuff

2017-08-14T00:00:00Z

There's about 765 million blog posts about the diversity “memo” that leaked out of Google a couple of weeks ago. I think the case for any biological difference is pretty weak, and it bothers me when people refer to an “interest gap” as anything else than caused by the environment. Maybe because I have a daughter, maybe because I have too many female friends who told me stories how they were held back or discriminated against.

But disregarding my own opinion here, something else kept me annoyed for days. It seems like the all the arguments and counterarguments are very hung up on science and proof and it struck me as a very binary view of the world. It's great that we have research, but as long as people can cite studies showing almost anything, I'm not sure it really settles the debate. Anyway, I think there's a weird meta-argument that is also somewhat interesting when you think of it in terms of probabilities instead. I think it lays out the case for action almost no matter what causes you think the gender imbalance has.

Let me explain what I mean. Let's say the gender imbalance could be explained x% by biology and y% by the environment (nature vs nurture). So obviously it adds up to 100%. x and y could even be negative (eg women have a higher ability than men, but peer pressure and discrimination and whatever makes y larger than 1). Or maybe you think it's the other way around… I welcome you to my blog either way.

Now, an ultra hardcore conservative might say that it's 100% explained by nature and the whole mass of their probability distribution at x = 1, and a super progressive liberal would do the other way around. But come on… if you really had to bet money on it, would you bet your entire fortune that x is exactly 0? Let's say the odds are that you make one dollar if you're right, and lose all your money if you're wrong. So in general, everyone's belief is a probability distribution, like something like this:

Of course, we're never going to figure out the true value of x, but let's assume some alien is able to replicate Earth inside a simulator and keeps tweaking various parameter so they can figure out x with 9 decimals precision. And they come to Earth one day and offer to sell a contract that pays $x. What would you pay for that contract? I would probably buy it at -$0.3 and sell it at $0.3 personally – some old fashioned person might buy at $0.3 and sell at $0.8 or something. I don't know. Most people would assign some probability mass across a wide interval, reflecting some kind of uncertainty.

This hypothetical setup reminds me of my feelings when I read arguments trying to disprove or prove global warming. It's all fine, and I'm a big supporter of research. But at the end of the day we're still going to end up with some probability distribution. Sometimes I wonder if the focus on the “truthiness” prevents action. Instead of getting together across the spectrum and saying that x has some uncertainty, let's act accordingly, we get stuck trying to debate if x is exactly 0.0 (no human contribution to the Earth's temperature) or exactly 1.0 (all of the temperature increase in the last 100 years is caused by humans).

It gets more interesting when you weigh the uncertainty with the cost of action/inaction. So for every course of action, integrate over the probability distribution and multiply with the impact of action minus the cost of action.

In the case of climate change, let's say we can prevent human extinction from happening with 1% probability. That's worth spending a lot of money on! Similarly, regardless of your thoughts biological determinism, diversity efforts seems like a pretty good thing to focus on. Worst case it's an insurance, best case it's an investment.

Notes

I updated this blog post to incorporate my own values since I realized I feel too strongly to write a neutral one. I also updated the choice of probability distribution (Beta is not a good choice).
A funny thing when you do these cost analyses is that it's basically some weird form of Pascal's wager in disguise. Pascal's “bet” was that not believing in God had pretty limited downside but potentially infinite downside (at least my layman recollection).
We actually do discuss this uncertainty sometimes – for instance spending money preventing some uncertain number of terrorist victims. It's not that we talk about the actual probabilities, but no one is hung up about trying to prove or disprove that there will be a terrorist attack. There's a debate about exactly how much money we should spend, but I don't think anyone is suggesting it should be $0.
Of course it gets a lot more complicated if you actually try to do the math, since you can't really assign probabilities. It's some kind of Knightian uncertainty, which is sort of what Donald Rumsfeld referred to as “Unknown unknowns”.
The argument about global warming isn't exactly a novel idea, but surprisingly I haven't heard it many times. Maybe I have lame friends.

Fun with trigonometry: the world's most twisted coastline

2017-07-12T00:00:00Z

I just spent a few days in Italy, on the Ligurian coast. Even though we were on the west side of Italy, the Mediterranean sea was to the east, because the house was situated on a long bay. But zooming in even more, there were parts of the coast that were even more twisted – to the point where it had turned a full 360 degress so you ended up having the sea to the west again.

Anyway, that made me curious – what's the world's most twisted coastline? If you trace the coastline along the Eurasian landmass, and keeps track of its direction, does it ever turn more 360 degrees? 720 degrees? 1040 degrees? Or, in radians, $$ 2\pi, 4\pi, 8\pi \ldots $$?

The data

You can download coastline data from OpenStreetMap. It turns out it's not perfectly joined, so I ended up using the land polygon data instead. The slight drawback is that larger islands/continents are broken down into many polygons. Should not make an enormous difference. All in all there's 587,205 distinct land polygons, each with a few hundred to thousands of vertices.

The math

I have something weird to admit. I actually kind of enjoy trigonometry. Let's review some basic facts. Each land polygon is closed, and the sum of all the exterior angles adds up to roughly $$ 2\pi $$ radians. This is basic geometry:

Why not exactly $$ 2\pi $$? We're on a sphere, i.e. a non-Euclidean geometry. In those places, classic theorems are no longer true. Luckily, the curvature of the Earth is not substantial at a smaller scale, so we don't have to worry about it.

I'm using pyshp to read the data. First step is to convert lon/lat to unit vectors in 3D, which I find far easier to work with:

def ll_to_3d(lat, lon):
    lat *= math.pi / 180
    lon *= math.pi / 180
    x = math.cos(lat) * math.cos(lon)
    z = math.cos(lat) * math.sin(lon)
    y = math.sin(lat)
    return numpy.array([x, y, z])

I did something similar when I computed a world map of ping latencies.

The only other magic sauce is that we need to compute the exterior angle or how much we “turn” when we go from vector $$ \mathbf{a} $$ to vector $$ \mathbf{b} $$ and then turn towards vector $$ \mathbf{c} $$. When $$ \mathbf{a, b, c} $$ are close to each other on the surface, you can ignore the curvature of the earth and think of them as just sitting on a plane. We want to know the exterior angle between $$ \mathbf{b-a} $$ and $$ \mathbf{c-b} $$. Turns out we can exploit the property of the cross product.

$$ \left| \mathbf{u} \times \mathbf{v} \right| = \left| \mathbf{u} \right| \left| \mathbf{v} \right| \mathbf{n} \sin \theta $$

where $$ \theta $$ is the angle. The vector $$ \mathbf{n} $$ is a unit vector pointing out of the earth if the turn is clockwise, and into the earth if it's counter clockwise. We can figure that out by taking the dot product with $$ \mathbf{b} $$ (which is a unit vector and should be essentially parallel to the cross product). Not quite done yet. $$\sin^{-1}$$ only returns a value within $$ \left[ -\pi/2, \pi/2 \right] $$. We need to handle turns that are bigger than this as well. So we need a separate case for when the turn is so sharp that it's going “backwards”. See code:

def mag(v):
    return numpy.dot(v, v)**0.5


def spherical_angle(a, b, c):
    n_sin_theta = numpy.cross(b-a, c-b) / (mag(b-a) * mag(c-b))
    alpha = math.asin(numpy.dot(n_sin_theta, b))
    if numpy.dot(b-a, c-b) >= 0:
        return alpha
    else:
        return numpy.fmod(2*math.pi - alpha, 2*math.pi) - math.pi

It was easy to verify that it works – for polygons with thousands of edges it still returns an exterior angle sum very close to $$ 2\pi $$.

There's a bit more work to take the series of cumulative angles and normalize it so that we can compute deviations. The whole script ended up being less than 100 lines so another example of a blog post being longer than the underlying script. (Eg. see the Language pitch post).

The results

I applied a bit of discretion when reviewing the results. The top 2 most winded coastlines are some swamp in UK. Google maps doesn't line up with the Openstreetmaps data and so I disqualified these entries and a few more.

The most twisted coastline is just outside Tauranga, New Zealand:

The second most is in South Australia, seemingly in the middle of nowhere.

The third most is on Cape Cod, MA, which is amazing because I sort of expected Cape Cod to rank pretty high. Although Openstreetmap and Google have pretty different coastlines so honestly the exact location seems a bit unclear:

#4 is some random place in Nova Scotia, Canada

Top 20

I removed a whole bunch of these entries due to ambiguous coastlines – basically whenever Openstreetmaps didn't align with Google:

GM	OSM	Lat/Long	Where
GM	OSM	-37.6940, 176.2087	Tauranga, New Zealand
GM	OSM	-34.6407, 135.3727	Australia
GM	OSM	42.0246, -70.1844	Cape Cod, USA
GM	OSM	45.9493, -60.5767	Nova Scotia, Canada
GM	OSM	1.9901, -157.4740	Kiribati
GM	OSM	32.9330, 129.7944	Nagasaki, Japan
GM	OSM	43.5846, 145.3271	Hokkaido, Japan
GM	OSM	54.2867, 13.6907	Rügen, Germany
GM	OSM	34.5192, 10.5364	Tunisia
GM	OSM	26.4628, -82.0632	Cape Coral, USA
GM	OSM	34.6861, 137.2857	Tokyo, Japan
GM	OSM	47.2325, -53.9598	Newfoundland, Canada
GM	OSM	55.1082, 10.0945	Funen, Denmark
GM	OSM	35.2210, -75.6807	North Carolina, USA
GM	OSM	46.8320, -64.0313	Prince Edward Island, Canada
GM	OSM	64.9621, -51.5111	Nuuk, Greenland
GM	OSM	55.2200, -7.7219	County Donegal, Ireland
GM	OSM	66.0704, -23.1252	Ísafjörður, Iceland
GM	OSM	-43.8452, -176.4251	Chatham Islands, New Zealand
GM	OSM	52.2898, -174.3173	Atka, Alaska, USA

Notes

Obviously the polygon resolution matters – coastlines are fractal and the higher resolution, the more twists you get.
I was surprised that the twistedness was so small, even for the most extreme points. The top one was about $$ 4\pi $$, i.e. two full turns.
I actually suspect the largest twistedness is still bounded. Even if we could measure with infinite precision, it might be an infinite series with a sum that converges.
All code is on Github, as usual.

Optimizing for iteration speed

2017-07-06T00:00:00Z

I've written before about the importance of iterating quickly but I didn't necessarily talk about some concrete things you can do. When I've built up the tech team at Better, I've intentionally optimized for fast iteration speed above almost everything else. What are some ways we did that?

Continuous deployment

My dubious claim is that we might be the only financial institution in the world to deploy continuously. I actually ended up getting quoted in the Economist about this specifically. We deploy to production probably 50-100 times every day. Once a pull requests is merged into master, we run a fairly extensive test suite of a few thousand unit tests and a few hundred Selenium tests. We have spent a lot of time optimizing the time it takes to run these tests so it's really just about 15 minutes. If all tests pass, we deploy to production.

We use Buildkite for CI and run all our services on top of Kubernetes, which (among a million other things) supports blue/green deployments so that there is no downtime during deployments.

Testing

Continuous deployment is freedom under responsibility and it isn't possible without rigorous testing. We have about 85% unit test coverage (I think the sweet spot is about 90%. 100% is unrealistic). Manual testing is only done by the product manager, generally when a feature has already been live in production for a while, to make sure that it's according to the spec.

Do we ever release bugs to production? Of course. But mean time to recovery is usually more important than mean time between failures. If we deploy something that's broken, we can often roll back within minutes. And since we ship very incremental changes, the average bug is often limited in impact. Bugs in production are often related to code that was written in the last few days, so it's fresh in mind and can be fixed quickly.

No “sprints”

Two-week or three-week sprints are mini waterfall and sacrifices a lot of flexibility for the purpose of providing external stakeholders a bit more predictability. But if you work on a customer facing product, users have no expectation that you're going to update the product at any point in time. (Even with external stakeholders, I think predictability is overrated. It's just a way to avoid sales people overselling.)

A continuous flow of tasks means we can launch a v1, v2, and v3, all on the same day, where v2 included features that we learned from users in v1 and v3 were based on user feedback on v2.

Small tasks

Excuse me for geeking out, but an interesting result from random matrix theory is that in high dimensional spaces, local minima are rare (the reason is that most points where the derivative is zero are really saddle points). I think software engineering mostly takes place in a very high dimensional world where hill climbing by splitting up tasks into small, incremental pieces and shipping each of them separately is the fastest way to deliver value.

In contrast, one of the most scary thing in software engineering is “inventory” of code that builds up without going into production. It represents deployment risk, but also risk of building something users don't want. Not to mention lost user value from not shipping parts of the feature earlier (user value should be thought of as feature value integrated over time, not as the feature value at the end state).

Feature flagging is a last option, and we use it sparingly. Even worse, is having feature branches. They are devil's work and should be abolished. Git-flow is a terrible invention and when we tried it at Spotify, people spent something like 50% of their time just rebasing code.

Long-lived pull requests are frowned upon for this reason. A pull request should be merged within a few hours, ideally, and should be at most a few hundred lines. We have built our own system to assign reviewers to pull requests and notifying the Slack channel. The results are clear from the stats below – this represents the time from the point where a PR is created to the point where it's merged, taken from our monorepo:

Cross-functional people and teams

Some companies have separate backend and frontend teams. Or, even worse, I once talked to a company that had a “machine learning theory team” in a different city than the “machine learning production team”. Don't do this. It slows down iteration speed and adds coordination overhead.

If you want to optimize for a tight feedback loop, cross-functional teams make a lot more sense than teams split up by skills.

This applies to individual engineers as well. Every engineer at Better is a full-stack engineer that can take any feature in the backlog and ship it. Most of the time, the complexity is really in the backend, and so the vast majority of our team skews towards backend developers. But no one has any issue writing CSS and pushing pixels when needed. A typical task is 80-90% backend and 10-20% frontend. By having a single engineer working on a feature, we can ship a lot quicker. Most engineers in the team are not the Simone Biles of CSS, but they can do it and get the job done, and it's usually not a big part of the work of shipping a task.

At a fast moving consumer facing startup, you can't afford specialization. Not just do full stack engineers iterate faster, but there's also more flexibility built in. You don't know where in the stack the team is going to spend the next week.

Before I paint a dogmatic picture, I want to point out that we have hired a few specialized roles. We do have a test automation engineer, an operations person, and a few dedicated frontend engineers. We needed a bit more “expertise” for a few particular areas. It did take a while to get there, and even these engineers still spend some time across the whole stack.

What else?

There's a long tail of smaller things that definitely make a huge difference:

How can we keep the scope small and design the product process around the learning process? Much ink has been spilled on the topic of MVP, a “Minimum Viable Product”.
Data is obviously super important How do we actually learn from the incremental features we ship? I'm talking both hard metrics here, and soft qualitative stuff.
What about the trade-off between product quality and shipping time?

I could write about this all day long. Instead, wanted to wrap up with some notes on why cycle time matters so much.

Iterate or die

First, let me point out that optimizing for fast iteration speed is not the same thing as throughput. In Little's law, throughput is $$ \lambda $$, and iteration speed is the inverse of $$ W $$. The relationship between $$ \lambda $$ and $$ W $$ is complex and sadly I haven't found any good resources on it. On a Google journey you can find some decent resources from chip manufacturing among other things:

Looking at the chart it's clear that you can lower throughput just a slight bit below the theoretical capacity and get orders of magnitude lower cycle time (i.e. higher iteration speed). But chip manufacturing is large scale manufacturing processes where there's not even any learning process to talk about. Once you want to learn fast on top of having high throughput, it's a no brainer to operate slightly below theoretical throughput capacity.

Sorry for getting a bit theoretical so let's rephrase it. Imagine you're a fast food chain that needs to make one thousand hamburgers an hour. You need to start baking the bread at some point, grill the patties, cut the lettuce etc. Everything can be done in huge batches and planned in advance. Certain software project may resemble this. For instance rewriting a big application from C++ to Java.

But far more often, a software project is like trying to find a completely new hamburger recipe. In that case, keeping the batches small, and learning from feedback continuously is key. You can make 500 or even 800 burgers an hour and make the batch size and the cycle time 10x smaller. Forcing you to keep the inventory low is a whole obsession of lean manufacturing, and it's mostly because you can respond to customer demand much faster (the other reason was that inventory was a substantial cost in Japan in the 1950s. But I digress).

Anyway. In terms of organization – you can keep inventory much lower if people are responsible to make whole burgers rather than one person chopping the lettuce, one person making the buns, etc. And by keeping the feedback loop tight, you keep changing the combination of spices and learn from the feedback you get. Your recipe can evolve 10x or 100x faster. This is ultimately how you outcompete everyone else.

🍔

Conversion rates – you are (most likely) computing them wrong

2017-05-23T00:00:00Z

How hard can it be to compute conversion rate? Take the total number of users that converted and divide them with the total number of users. Done. Except… it's a lot more complicated when you have any sort of significant time lag.

Prelude – a story

Fresh out of school I joined Spotify as the first data analyst. One of my first projects was to understand conversion rates. Conversion rate from the free service to Premium is tricky because there's a huge time lag. At that time, labels were highly skeptical that we would be able to convert many users, and this was a contentious source of disagreement. We had converted a really small fraction of our users, and we kept growing our free users like crazy. The conversion rates was standing still, if not going down.

The “insight” I had was when I started breaking it up into cohorts. For instance, look at all users that joined on May 1 and track their conversion rate over time. The beautiful thing that happened was that the conversion rate keeps growing and growing over time. People converted at an almost uniform rate over the first few years. It was amazing to see. Some of the old cohorts that had used the service for 2+ years had some crazy high conversion rates, like 40-50%. This insight it implied that conversion rates wasn't a big problem, and it was only becuase we were growing exponentially that the current conversion rate looked “artificially” low.

My lesson here is that conversion rates are sometimes pointless to try to quantify as a single number. Sometimes it's a useful metric, but in many cases it's not. Spotify's conversion rate is not that useful to know in itself, since the user base is not in equilibrium. As long as the user base keeps growing, and as long as there's a substantial lag until conversion, you really can't say anything by trying to quantify it into a single number.

An example – exit rate for startups 2008-2015

Let's go through an example of all the bad ways to look at conversion and then arrive at what I think of as the “best way” (in my ~~humble~~ correct opinion). Just for fun, I scraped a bunch of startup data from a well-known database of startups. Not going to get into all the gory details of the scraping, except that scraping is fun and if you do it too much your scraper gets banned… I probably could have used some lame boring data set, but data analysis is 38.1% more fun if you can relate to the data.

I have about 1,836 companies in the data set that were invested at some point, and 243 (13%) that exited at some point (either IPO or acquisition). So let's for instance ask ourselves, how is the conversion rate going? Are newer companies exiting at a higher rate than older companies? Or is it getting harder over time to exit? The naïve way is to break it up by year founded and compute the conversion rates:

Except for 2008, it looks like the conversion rate is going down. Why? Is it harder and harder for companies to exit? Let's look at the time to exit as well.

Here we see the opposite trend of what we saw in the previous chart – it seems like it's getting easier and easier for companies to exit!

So what's going on? If you think about it for a bit it's pretty clear – we are mixing data from 2008 (where the companies have had 9 years to convert) with data from 2016 (where companies have had a year or less). We don't know what companies from the 2016 group that will convert in the future.

Both of these charts underlines that there's often no such thing as a single “conversion rate” and no such thing as a “time to conversion”. In the case where conversion has some clear upper time limit, you might get away talking about those metrics. For instance, it's probably fine to measure landing page conversion rate by looking at how many people clicked a link within an hour. But in many cases, including the case of startup exits, as well as Spotify free to Premium, “conversion rate” and “time to conversion” is nonsensical and cannot be defined.

The right way to look at conversion rates – cohort plots

To compare conversion rates, it makes a lot more sense to compare the at time T, where T is some time lag such as 7 days or 30 days or 1 year or whatever. For instance in order to compare the conversion rates for the companies in the 2012 and 2014 cohort, compare what percentage of them has converted within 24 months.

We extend this to all times T, and plot it as a function of T. Does it take longer to exit for startups that started in 2014 compared to 2008? Let's take a look:

I'm not sure what's the “official name” of a plot like this, but generally people refer to it as a cohort plot. For every cohort, we look at the conversion rate at time T. Note that since we don't know anything about the future, we can't say much about the 2016 cohort beyond ~5 months – it includes some companies started in Dec 2016 so we simply don't have full data after 5 months. This is all great and checks a lot of boxes:

We can compare conversion rates for different cohorts and understand if it's getting better or worse ✅
We can see if certain cohorts convert faster ✅

I generally think of this approach as “as good as it gets” in most situations. Except my only issue with it is that the idea of measuring “conversion at time T” means we can't use too much recent data. For instance, it would be much better if we could look at 2017 data and see how well it's converting? I like metrics to have fast feedback loops so surely we can do better? Turns out we can.

The 😎 way to look at conversion rates – Kaplan-Meier

Kaplan-Meier is a non-parametric estimator originally used to estimate the survival function. Turns out the survival function is 1 minus the conversion rate, so it's the exact same thing essentially. Non-parameteric is good if you have no idea what the underlying distribution you are modeling is.

The best part of the Kaplan-Meier is that it lets us include data for which we simply haven't observed anything past a certain point. This is best illustrated if we broaden each cohort a bit so that they contain a larger span. Let's say we're trying to understand the conversion rate of the 2008-2011 cohort vs the conversion rate of the 2012-2015 cohort. The hypothesis would be that we want to understand how the exit rate has changed over time:

This is a super simple plot to draw except for the confidence interval. Just literally divide the number of exited companies by the total number of companies to get a rate.

The problem here is that we can't say anything about the second cohort beyond ~1.5 years because that would require saying something about future data. This cohort contains companies up to and including Dec 31, 2015, which have had slightly less than 18 months of history to convert. On the other hand, the oldest companies in this cohort are from Jan 2012, so they have had a lot of time to convert. Surely we should be able to plot something more for this cohort. The Kaplan-Meier estimator lets us work with this issue by being smart with how it treats “future” data for different observations (“censored” observations, as it's called in survival analysis lingo):

The implementation is absolutely trivial, although I used the lifelines packages in Python to get this and we also get a snazzy confidence interval for free (this is a bit harder to do). So the conclusion here is that yes – it seems like newer companies don't convert at the same rate as older.

If you want to implement Kaplan-Meier yourself, the idea is basically to compute a conversion “survival rate”. If we start out with 100 items and one of them convert at time 1, the survival rate is 99%. We keep computing those rates and multiply them together. When data is “censored”, we just remove from the denominator:

n, k = len(te), 0
ts, ys = [], []
p = 1.0
for t, e in te:
    if e:
        # whether the event was "observed" (converted)
	# not observed means they may still convert in the future
        p *= (n-1) / n
    n -= 1
    ts.append(t)
    ys.append(100. * (1-p))
pyplot.plot(ts, ys, 'b')

Kaplan-Meier lets us get a bit more out of each cohort. Look at what happens if we plot one cohort for each year:

Epilogue

In a previous post, I built a tool to analyze the survival of code. Given that it's obviously survival analysis, I went back and updated the tool tool to plot survival rates (of code) using Kaplan-Meier since it was such a tiny diff. Here's the survival curve of individual lines of code of Git itself:

Cool to see that a few lines of code are still present after 12 years!

Conclusion

When people talk about conversion, and if there's time lag involved, remember: it's complicated!

EDIT(2019-09-26): check out a kind-of v2 to this blog post: how to use Weibull and gamma distributions to model conversion rates.

Notes

A fantastic blog post talks about churn prediction from a machine learning perspective. Much more math focused than this posts. From the post: The hacky way that i bet 99.9% of all churn-models use is to do a binary workaround using fixed windows (referring to “conversion at time T” as the target variable).
In the plots that are not using Kaplan-Meier, we can compute the confidence intervals using scipy.stats.beta.ppf([0.05, 0.95], k+1, n-k+1)). Generally good to visualize the uncertainty.
I mentioned that non-parametric methods are generally good. To be clear, they can be bad because they don't let you impose priors and other things that can sometimes let you regularize the model.
On that topic, I actually found an interesting Bayesian survival analysis using PyMC3 that looks cool. Haven't had the energy/time to fully comprehend it.
I also wanted to point out there are situations where Kaplan-Meier doesn't work. As soon as we're dealing with anything more complicated than a conversion rate (from state X to state Y) then it breaks down.
For instance, let's analyze the Freddie loan level dataset to understand the state of mortgages. At some point in time, a borrower can prepay or default. And for a lot of the more recent observations we don't have enough history to determine the final outcome. Since we have two different end states (defaulting or prepayment) we have to resort to something else. The simplest way is to just compute the normalized share over all the observations that are still active at time T:
As usual, the code is on Github

The mathematical principles of management

2017-04-09T00:00:00Z

I've read about 100 management books by now but if there's something that always bothered me it's the lack of first principles thinking. Basically it's a ton of heuristics. And heuristics are great, but when you present heuristics as true objectives, it kind of clouds the underlying objectives (and you end up with weird proxy cults like the Agile movement 👹 – not that I disagree with it, I just wish they could derive it from a more systematic understanding of project management).

The other thing you need is a model of reality. I have an almost dogmatic belief that there is a mathematical model describing everything. Doesn't mean that your model is correct of course. And I'm not super interested in the actual math here – more the dynamics. But there's a set of models, each more and more complex, that describe reality more and more accurately. And I don't mean it in a naïve, everything-is-math kind of way. I totally believe that humans are irrational, and all that stuff. But there's some stuff that can be predicted, and the uncertainty can usually be predicted too, as well as human irrationality/psychology, market behavior, and many other things.

Anyway, my book would be structured roughly in order of these models, adding more and more detail to how reality functions and how to make optimal business decisions. I think of it as layers of an onion – every layer is an extension of the previous model where we add more and more complexity.

I even designed a cover! This is going to sell like god knows what.

I'm planning to publish it about 2040, once I've mastered all the pieces. No, but seriously, I would love to read a book like this. I'm still fairly new to the game so here are just some very rough sketch of what I want the topics to be like. If a book could write about management in this way, I'd pre-order 100 copies in a heartbeat.

Decision making with perfect information
- This is the most basic case of decision making and easiest to model
- An example here: what's the breakeven time if we upgrade our widget making machine for $100,000 so that it can make 1,000 more widgets per day?
- Some of the key concepts in this chapter are:
  - ROI (return on investment) and how to prioritize across different projects
  - Diminishing return as a function of investment (concavity)
  - Marginal ROI vs average
  - Price sensitivity
  - How to allocate time across several different projects, thinking about it as a constrained optimization problem (Lagrange multipliers and the principle of optimizing striving for equal marginal ROI)
  - Opportunity cost (example: why almost all ideas are bad once you factor in the opportunity cost)
  - Pipeline/constraint thinking (all the Toyota stuff & Theory of Constraints goes here, as well as the “Lean” and “Agile” movements).
Decision making given uncertainty
- This is our first extension of the most basic model and it's already getting a bit trickier
- Topics:
  - Prior beliefs and Bayes’ rule (example: why common sense and experience is a great prior)
  - Explore vs exploit (side topic: Thompson sampling)
  - Rapid iteration vs long term planning (case study: hardware vs software development)
  - Proxy metrics (optimizing for shareholder value is hard, so let's pick some metric that's easier to measure/move but still has a high correlation)
  - A/B testing
  - “Known unknowns vs unknown unknowns”
People management – agency problems
- This is the first part of managing people – how do you deal with the fact that their needs are not always aligned with the company? I have a feeling a look through history could be quite useful
- Topics (by no means exhaustive!):
  - Marxism and the theory that history is a struggle between employers and employees
  - Taylorism and the first wave of “Scientific Management”.
  - Why do startups have free lunches and ping pong tables and hedge funds don't? (This is a theory I have – has to do with the fact that performance is a lot harder to measure at startups.)
  - Why do mediocre managers prefer long term projects?
  - Why a culture of instilling risk aversity hurts company performance? (Punishing managers disproportionally for making mistakes means risks are not taken, even when the expected value is positive.)
  - How performance bonus incentivizes risk taking? (This is sort of the opposite of the above point)
  - Why it's so hard for companies to change? (Because managers have a vested interest in status quo (and are only looking for Pareto improvements))
People management – information asymmetry
- The second part of managing people analyzes another reason why people make poor decisions – it's because they didn't have full information. This happens more at big companies
- There's probably some really interesting stories from military tactics here. Leading troops under battle pushes decision making to its extremes and forces incredible dentralization of power. I suspect this is one of the better analogies for how to think about information asymmetry from the bottom to the top – i.e. you need to trust the grass roots to run autonomously.
- A much more boring (but still important) topic is how to run an efficient meeting culture. Meetings (and emails etc) are the way humans tranfer information between each other and it has terrible bandwidth. Think about it as a 1,000 person company where each person is a modem that can do like 30 baud. How do you organize to propagate information the fastest from top to bottom (and then back up?)
- Topics: TBA (because this is an area where my skills lack. I told you this book would be published in 2040!)
People management – bounded rationality
- Here we're starting to get into behavioral economics. This is fun. This as the remaining piece that explains why managers make suboptimal decisions – they are not acting like rational economical agents.
- It's also psychology, of course. How to ~~trick~~ inspire people into doing things. Why people disagree. How to get people to change. And all that stuff.
- Topics: TBA. I also haven't structured my thoughts on this – will get back to you in 10-20 years with some more ideas. But basically it's all the Kahneman/Tversky/Thaler stuff – Wikipedia's List of cognitive biases but more fun.
Operating in a market
- The competitive advantage angle is basically Michael Porter stuff but I find his books excruciatingly boring and so let's talk about it in other terms. I think there are really good stories to illustrate how this works (side note but this book is pretty good: Co-opetition).
- I also think an understated market is the market of people aka recruiting. So some of the more nuanced points of hiring top performance would end up here.
- Topics:
  - Moats: network effects (Metcalfe's law), scale advantages, proprietary technology, regulatory capture, etc.
  - Collusion and defection
  - Other competitive advantages: brand
  - Suppliers and wholesale transfer pricing
  - Induced demand
  - First mover advantage
  - Case studies: TBA (but I want LOTS of them here! fun war stories! YAY!)

Please pre-order my book! It will be delivered straight into your brain around the year 2040 by Amazon's brain implant chip.

PENCIL NOT INCLUDED!

The eigenvector of "Why we moved from language X to language Y"

2017-03-15T00:00:00Z

I was reading yet another blog post titled “Why our team moved from <language X> to <language Y>” (I forgot which one) and I started wondering if you can generalize it a bit. Is it possible to generate a N * N contingency table of moving from language X to language Y?

Someone should make a N*N contingency table of all engineering blog posts titled "Why we moved from <language X> to <language Y>".
— Erik Bernhardsson (@bernhardsson) January 25, 2017

So I wrote a script for it. You can query Google from a script and get the number of search results using a tiny snippet. I tried a few different query strings, like move from <language 1> to <language 2>, switch to <language 2> from <language 1> and a few more ones. In the end you get a nice N * N contingency table of all languages:

Here's where the cool part begins. We can actually treat this as probabilities from switching between languages and say something about what the future language popularities will be. One the key is that the stationary distribution of this process does not depend on the initial distribution – turns out this is basically just the first eigenvector of the matrix. So you really don't have to make any assumptions about what's popular right now – the hypothetical future stationary state is independent of this.

We need to make this into a stochastic matrix that describes the probabilities of going from state $$ i $$ to state $$ j $$. This is easy – we can interpret the contingency matrix as transition probabilities by just normalizing across each row – this should give a rough approximation of the probability of switching from language $$ i $$ to language $$ j $$.

Finding the first eigenvector is trivial, we just multiply a vector many times with the matrix and it will converge towards the first eigenvector. By the way, see notes below for a bunch of more discussion on how I did this.

Go is the future of programming (?)

Without further ado, here is the top few languages of the stationary distribution:

16.41%: Go
14.26%: C
13.21%: Java
11.51%: C++
9.45%: Python

I took the stochastic matrix sorted by the future popularity of the language (as predicted by the first eigenvector).

Surprisingly, (to me, at least) Go is the big winner here. There's a ton of search results for people moving from X to Go. I'm not even sure how I feel about it (I have mixed feelings about Go) but I guess my infallible analysis points to the inevitable conclusion that Go is something worth watching.

The C language, which turned 45 years old this year, is doing really well here. I did a bunch of manual searches and in many cases a lot of the results are really people writing about how they optimize certain tight loops by moving code from X to C etc. But is that incorrect? I don't think so. C is the lingua franca of how computer works and if people are still actively moving pieces of code to C then it's here to stay. I seriously think C will be going strong by its 100th birthday in 2072. With my endorsements for C on LinkedIn, I expect recruiters to reach out to me about C opportunities well into the 2050's (actually taking that back – hopefully C will outlive LinkedIn).

Other than that, the analysis pretty much predicts what I would expect. Java is here to stay, Perl is dead, Rust is doing pretty well.

Btw, this analysis reminds me of this old tweet

Very interesting graphing showing rate of switch between R and Python for data analysis pic.twitter.com/moYFgrHCBJ
— Big Data Borat (@BigDataBorat) April 24, 2014

Javascript frameworks

I did the same analysis for frontend frameworks:

I expected React to come out on top here, but interestingly Vue is doing really well. I'm also surprised how well Angular stacks up – anecdotally it seems like a mass exodus away from it.

Databases

I started looking at ride sharing apps, deep learning frameworks, and other things, but the data is far more sparse and less reliable. Will keep you posted!

Notes/caveats

See discussion on Hacker News and /r/programming
This blog post was another inspiration: Why I switched from language 1 to language 2.
Here's how to scrape Google and get the number of search results
Unfortunately Google rate limits queries by IP, but I ended up using Proxymesh to scrape it for all the N * N combinations 🤓
Note that I searched for exact queries by putting it in quotation marks: eg “switch from go to c++”
The attentive reader might ask why Javascript wasn't included in the analysis. The reason is (a) if you are doing it on the frontend, you are kind of stuck with it anyway, so there's no moving involved (except if you do crazy stuff like transpiling, but that's really not super common) (b) everyone refers to Javascript on the backend as “Node”
What about the diagonal elements? There is of course a really big probability that people just stay with a certain programming language. But I'm ignoring this because (a) turns out search results for things like stay with Swift is 99% related to Taylor Swift (b) the stationary distribution is actually independent of adding a constant diagonal (identity) matrix (c) it's my blog post and I can do whatever I want 🙉
On (b), it is true that $$ e(\alpha S + (1-\alpha)I) = e(S) $$ where $$ e(\ldots) $$ denotes the first eigenvalue and $$ I $$ is the identity matrix. This doesn't exactly map to reality – the probability that you stay with a certain language may be different across languages.
The method of repeated multiplications to get the first eigenvalue is called Power iteration.
Is this model with eigenvectors a super-accurate description of reality? Probably not. The old quote by George Box comes to mind: All models are wrong, some are useful.
I also know the chain has to be ergodic and a bunch of other things, but in reality that's basically always the case.
Code is available on Github.

Why I went into the mortgage industry

2017-02-17T00:00:00Z

I just realized last Thursday that I have spent two full years at Better, incidentally on the same day as we announced a $15M round led by Kleiner Perkins. So it was a good point to reflect a bit and think back – what the F led me to abandon my role managing the machine learning team at Spotify? To join some random startup in the world's most boring industry? So here's my justification why I love being where I am:

Follow the smart people with the good ideas

Back in 2008, I joined a then unknown company called Spotify. At the time I had an offer from Google and a couple of other places. Obviously all my friends and family thought I was really dumb. Google is the coolest place in the world to work at! Ball pits!! But what made me join Spotify was (a) I loved their beta product (b) all the smart people from school (KTH) that I looked up to had joined. So I thought, whatever, I'll try it. I want to work with these guys.

I ended up staying there for about six years during which this obscure company that started in Stockholm transformed the music industry. Anyway, the lesson I learned was: go work for the smartest people you can find, and with a business idea you believe in. Everything else is secondary. Even the business idea is probably secondary, since smart people often end up making the business work (Spotify is arguably an example of this).

The mortgage industry is pretty terrible

… and that's why I love it. It takes about 60 days to get a mortgage in the US. The average loan file is 800 pages. Basically mortgage lending is pdf manufacturing, and almost everything is manual.

Travel agents and stock brokers were obviated by the internet when people realized it's both more convenient and cheaper to disintermediate the human, and connect the users directly to the system. But mortgage lending is still done by a human loan officer pushing around your loan in a CRM and making roughly a percentage point of the loan amount in commission. This is a tax on the information asymmetry between borrowers and the “system” – an extremely complex industry with more regulatory bodies than anyone could keep track of.

Mortgage brokers from the movie The Big Short

I think it's funny when people talk about AI automating a bunch of human processes. I don't disagree, but huge parts of the mortgage industry can be automated by a bunch of web scraping scripts. This is a process that's supposed to be an objective decision based on factual data points about a user. Let humans do what humans are good at, use computers for everything else.

Why hasn't it been automated? It wasn't always hard to get a mortgage, and you might have heard that there was a minor problem with that. After lots of added regulation (most of it pretty sane), it's very hard to get a mortgage. But the word “hard” is overloaded in English – can refer to both “high standards” and “tedious”. We focus only on very prime customers, and there's no reason why they should have to jump through a million hoops.

Sometimes it's good to be a bit contrarian

One theory I have is that you should take advantage of “talent arbitrage”. Smart people over the last 20 years have all moved into startups and hedge funds, but indirectly that also means that all other industries will be disproportionally underserved by talent. There is a separate bigger altruistic/utilitarian argument to be made that smart people increasingly went into industries with small or no externalities – no bigger benefit to other people. I don't think high frequency trading or hyper-optimizing ad CTR is necessarily terrible in itself, but what's bad is it's the opportunity cost of sucking smart people away from jobs with larger externalities.

But the silver lining is I think this will change as the opportunities pop up for smart people to go into these “underdeveloped” fields like education, healthcare, and banking. Being the smartest team in a mediocre industry means you have a huge competitive advantage. Interestingly I see a lot of action in New York targeting these industries, but I don't know if it's a larger trend (yet).

Mortgages are nearly untouched by technology

Tech and tech startups have changed our lives in a lot of amazing ways. Industries like media, communication, and entertainment have gone through enormous changes as a result of distribution costs going to zero and formerly fragmented markets are now dominated by winner-takes-all effects. But looking around in the world, there are huge industries waiting for the same transformation. The industries I just mentioned, education, health care, and financial services, are enormous and have barely changed in the last 50 years.

We're finally starting to see some really interesting tranformations of financial services – active fund managers are being replaced by index funds, financial advisors are being replaced by robo-advisors, and hedge funds across the board are cutting their management fees. Maybe we're past peak financial intermediation? I think it's highly likely. It's an extremely fragmented market in the US and still extremely based on the physical branch network. Is the industry's “Amazon moment” just around the corner? Hopefully.

Case 1: One of Chase Mortgage's tech providers are bragging about only 3 days of downtime per year

Case 2: Some closing agent provider bragging that they can IMPRESS CUSTOMERS by delivering PHYSICALLY SIGNED documents on a CD

We have a head start

I really think there's a new way of thinking about building consumer products. It's a code that tech startups cracked in the last ten years. Launch early, launch often, iterate, and learn from your consumers. Big banks are absolutely terrible at this. A typical release schedule is six months, with binders describing the steps. During that time, my team has deployed new incremental changes to production 5,000 times. No wonder why banks are completely incapable of building consumer products.

On the other side of the spectrum, mortgage lending is not something you can start doing in a garage. We basically ended up going out and taking over an existing lender, in order to get all the licenses and other thing we needed. Sadly, regulation massively favors incumbents, but I guess the flip side is regulatory capture can be good if you're on the right side of it.

Consumers hate the current system

The weighted average NPS of the top banks is negative eight. Getting a mortgage is roughly as fun as getting a root canal. Because we've automated almost the entire process, we can manufacture a mortgages far faster, with a lot less hassle than other lenders. No more physical documents, and no more uploads. We try to fetch everything from API's and plug into a fully automated rules engine. The most beautiful part is that we actually save money by doing all these things, since we're cutting out huge chunks of manual labor, as well as the risk for issues.

The mortgage process is the biggest financial transaction that most people go through in their life times. The current system basically leverages the information asymmetry in order to justify the exorbitant commissions. Bait and switching is rampant, as well as under the table kickback for referrals (outlawed by RESPA). We want to get rid of the sketchiness, but also reduce this information asymmetry through tools and product simplifications to make sure consumer make better-informed choices (robo-advisors are very much a source of inspiration here).

We're actually making money

We aren't building a mobile game or a social network for dogs. We make several thousand dollars in revenue per mortgage, with great unit economics. So we've made substantial money from day 1.

This is a big industry

There's about 7M mortgages every year in the US. The mortgage industry is several orders of magnitude larger than the previous industry I used to work in, the music industry.

American residential mortgage backed securities (roughly $16T total value) are a larger asset class than US equities. Yet the pricing mechanism is extremely crude – package a bunch of mortgages into a bundle, and price them by a couple of variables like zip, credit score, loan-to-value, and debt-to-income. What if we could take this $16T asset class and make it into a liquid market where prices are determined by models and updated in real time? What if we could auction off mortgages (or slices of mortgages) on a market as they are originated, like the adtech space works? The capital markets side has a lot of opportunities.

We're just getting started

The cost of terrible mortgage lending isn't just the outrageous commissions. There's also a huge “tax” on mortgage financed real estate transactions because of the risks. Basically cash offers command a premium of somewhere between 5% and 15% of the amount (1) (2). This is the cost consumers have to take on because the seller (and the seller's realtor) doesn't want to deal with the risk of a transaction falling through. This “tax” is roughly $50-100B per year that American consumers have to pay (on top of already outrageous realtor commissions). So going beyond just the speed and ease of the transaction, our next goal is to reduce the risk to essentially zero.

What's next?

First of all, you should let me know if you're interested in joining! It's been a great trip over the last two years, and I can't wait to share some of the stuff we're working on.

Language pitch

2017-02-01T00:00:00Z

Here's a fun analysis that I did of the pitch (aka. frequency) of various languages. Certain languages are simply pronounced with lower or higher pitch. Whether this is a feature of the language or more a cultural thing is a good question, but there are some substantial differences between languages.

Hertz (or Hz, or $$ s^{-1} $$), is the standard way to measure audio frequency. Typical human speech ranges between 50 Hz and 300 Hz. Most men typically range between 85-180Hz, and most women between 165-255Hz. If we look at a spectrum of audio for English speakers, we get a pretty substantial difference between the genders. This is of course not super surprising to anyone with a pair of ears.

Comparing languages

Let's look at the top three languages in the data set:

Estimating the peak frequency by language lets us compare all languages:

Cool! Basically Hungarian has a very low pitch and Chinese has a very high one. The difference isn't actually super big, it's about half an octave on a piano. Still, it's something that's very noticeable to a human. Just to give you an idea of how much half an octave is, I altered the pitch of Ronald Reagan's famous “tear down this wall” speech by $$ 2^{\pm 1 / 4} $$ so that the difference between the two variants is half an octave. (One octave is always the double frequency, so half an octave is $$ 2^{1 / 2} $$ frequency.

Just to clarify the graph you are seeing above: each circle represents the peak frequency from a bootstrapped aggregate. The dark circles represent the median of those – arguably the most representative frequency. The purpose is to get an idea of the uncertainty in this esimates, and I stole this idea from 538. Note that the individual colored circles do not represent individual speakers – it's really just various bootstrap estimates of the peak frequency.

Within languages

We can also look at the origin of the speaker and see how it differs. For English speakers, it turns out that the difference is quite small:

However Spanish speakers exhibit pretty substantial differences depending on the location, with Spain having a very low pitch (this seems right to me) and Peruvian women the highest (I have no idea whether this confirms any stereotype). Interestingly, the difference is tiny for males and huge for females.

There wasn't a ton of audio featuring the Francophones unfortunately:

There are some other global languages that would have been interesting to look at – Arabic comes to mind. Unfortunately the data I had was a bit limited, and there just wasn't enough data to compare those languages across countries.

The scandie situation

As a “scandie”, I'm not surprised by some of these. Finnish is know for a very low pitch (making fun of F1 driver Mika Häkkinen's English accent was a staple of Swedish comedy for a couple of years), and Norwegian is known for its high-pitched voice (going up and down, like the mountains). Danish is in the middle (but more notably, has basically degenerated into a series of guttural sounds). Scandinavian languages (I included Finnish in there even though it's not a Scandinavian language):

Asian pronunciation

Let's check out Chinese vs Japanese vs Korean:

Chinese and Japanese stand out, both being very high-pitched languages.

There's a really weird thing going on here, that I can't explain. Chinese has a really different distribution for males. I suspect this has to do with the fact that it's a tonal language, but I'm not sure why it applies to males only. Let's plot all the Chinese variants to investigate this further.

(By the way, these are languages that ignorant people like me have barely heard of, yet even the smallest of them (Min) has 60M speakers.)

It's quite interesting to see how the Chinese variants seem to have far greater difference in their spectra than between English/Spanish/Russian (see above). I suspect it has something to do with how different variants of Chinese features different number of tones – Mandarin has 4, Wu has 7 etc. The stark difference between male and female speakers (not just different pitch, also a different shape) is still a mystery.

One weird thing

If you're really paying attention to the graphs, you'll notice some weird peaks in the lower end of the spectrum (around 50-60Hz). I didn't really think of it but Andreas Öman pointed out that these are actually a really interesting artifact. Can you think of it? 🤔

Ok – I'll give you one more hint. Look at this chart of female English speakers and the difference betweek UK and US. Pay attention to the weird spikes on the far left of the curve

These are actually AC frequencies ⚡ which occur at 50Hz in Europe vs 60Hz in the US! Turns out you can detect it in the audio spectrum of sound clips pretty well.

Ok, but what's the point?

Here's a hypothesis: there might be a “natural” pitch of each language that is optimized for pronunciating words with the least amount of effort. Maybe the most convenient way to speak Chinese is to do it with a high pitch? Maybe Hungarian phonology really favors a low pitch? I don't know.

Another hypothesis: English is my second language, but it's the language I use by far the most (since I live in New York). There was a very definite point when I realized that I had to change my voice to get to the next level with my accent. Oddly enough it was actually while studying German (my third language). It felt awkward at first to alter my voice to the point where I didn't feel like it was myself talking. But on the other hand I could hear myself sounding so much more German (if you know what I mean). Having been through this transformation I decided to change my “English voice” as well.

This is quite speculative, and a linguist may disagree, but I suspect that one reason it's hard to acquire a native language accent is that it's hard to fully change the pitch of the voice. For instance, Mandarin speakers who picked up English as a second language seem to speak English with a slightly higher pitch. Finnish speakers seem to speak English with an unusually low pitch. Etc.

The pitch is of course not the only thing that constitutes a flawless accent – mastering all the nuances of a new languages takes years, if ever. But I suspect carrying over the pitch from another languages makes it harder to get all the finer points of the phonology.

</speculation>

How did I perform this analysis?

I built a very tiny scraper to download audio clips from a web site that shall not be named (I probably violated their ToS severely). Letting it loose for a few hours got me ~100k audio clips in different languages/genders/origins. I restricted it to 10 clips per speaker in order to get more variety.

After that, I converted all the clips to .wav and started playing around with pitch detection. I hadn't expected to this to be such a sinkhole of effort but it turns out pitch detection is a hard problem.

As a preprocessing step, I found the 2.0s in the clip with the most intensity, and I added a Butterworth bandpass filter with a [50, 300] Hz frequency. A somewhat frustrating bug is that scipy.signal.butter is not numerically robust so I basically just perform a low pass filter then followed by a high pass filter instead. From the little I remember from signal processing in school, this should be equivalent. The bandpass filter seems a bit arbitrary, but the way I rationalize it is that I guess it's form of Bayesian prior distribution of where you expect the fundamental frequency to pop up.

With the help of some online resources. I decided to go for a pure FFT approach. Basically just look at the spectrum and find the peak frequency. The problem with FFT is that various harmonic frequencies pop up in the resulting spectrum, so if the clip has a frequency of 100 Hz then you will see peaks on 200 Hz, 300 Hz, etc. For this reason, an autocorrelation approach seemed marginally better on individual clips, but I realized quite quickly that individual clips are too noisy to be meaningful. Here's a sample of audio clips featuring 10 female English speakers and their FFT spectra:

Instead, I decided to aggregate the frequency spectra from all clips and perform stats on the aggregated data. Again from what I remember from school, FFT is a linear operation, $$ f(x+y) = f(x) + f(y) $$, so it seems fine to compute the FFT spectrum individually for each clip, then add them together and try to look for the peak frequency. This disregards the fact that FFT returns complex numbers, but I think you can ignore the phase shift in the spectrum (the angle of the complex outputs) and just add up the $$ \mbox{abs}(f(x)) $$ components. I could be wrong (I'm swimming in deep DSP water here).

This seemed to do the trick, and then on top of it I wasted a bunch of hours trying to learn Pandas. I also decided to use bootstrapping to compute some uncertainty estimate of the frequency estimation, so that instead of summing up all the spectra, I sum up a bootstrapped aggregate and look at the peak frequency. This helps me understand the uncertainty of the peak frequency estimate, which turns out to be pretty substantial. (There's probably a bunch of statisticians that would spin in their graves reading this blog posts.)

Anyway, the code is on Github in case you are interested! The whole thing is ~250 lines so somewhat ironically this blog post actually ended up being far longer!

Functional programming is the libertarianism of software engineering

2017-01-10T00:00:00Z

This is a pretty dumb post, in which I argue that functional programming has a lot of the bad parts of libertarianism and a lot of the good parts:

Both ideologies strive to eliminate [the] state. (ok, dumb dad joke)
Both ideologies are driven by a set of dogmatic axioms rather than a practical goal:
Libertarianism wants to reduce the government because any involvement distorts free markets. I always struggled to see what the underlying objective function is (it doesn't seem to be maximization of people's utility). 🤔
Functional programming wants to reduce side effects and make everything pure, often by enforcing onerous type systems. But why? Again I don't see an ultimate objective here. IMO it should start from the principle that the goal of a programming language should be to make the programmers as productive as possible. For instance, the little research that exists has shown that most bugs have little to with typing and I'd expect something similar to apply to mutable state. In fact the largest class seems to be poor error handling (ok, typing isn't necessarily related to FP, but in practice I find that strong typing and FP have highly overlapping fan clubs).
Both camps invoke obscure cases in history as a proof of success: libertarianists (more so anarchists I guess) often talks about Spain during the civil war, Somalia, or sometimes Singapore. Haskell acolytes are very eager to bring up Facebook's spam filtering.
YET – and this is the surprising part imho – both ideologies are ~90% correct (source: my opinion). Which really surprises me given that they start from a (imo) arbitrary set of axioms.
Even if you are a die hard bolshevik, you benefit from an understanding of how interventions distort markets, how incentives matter, and how entrepreneurship is the driver of progress.
Even if you are coding in Visual Basic, you can level up your skills by learning FP: making functions pure when needed, avoid state, avoid reassign variables, avoid mutable data structures, write pipelines of data transformations, and all that jazz that FP has taught us to cherish.

End of stupid post.

The half-life of code & the ship of Theseus

2016-12-05T00:00:00Z

As a project evolves, does the new code just add on top of the old code? Or does it replace the old code slowly over time? In order to understand this, I built a little thing to analyze Git projects, with help from the formidable GitPython project. The idea is to go back in history historical and run a git blame (making this somewhat fast was a bit nontrivial, as it turns out, but I'll spare you the details, which involve some opportunistic caching of files, pick historical points spread out in time, use git diff to invalidate changed files, etc).

In moment of clarity, I named “Git of Theseus” as a terrible pun on ship of Theseus. I'm a dad now, so I can make terrible puns. It refers to a philosophical paradox, where the pieces of a ship are replaced for hundreds of years. If all pieces are replaced, is it still the same ship?

The ship wherein Theseus and the youth of Athens returned from Crete had thirty oars, and was preserved by the Athenians down even to the time of Demetrius Phalereus, for they took away the old planks as they decayed, putting in new and stronger timber in their places, in so much that this ship became a standing example among the philosophers, for the logical question of things that grow; one side holding that the ship remained the same, and the other contending that it was not the same.

It turns out that code doesn't exactly evolve the way I expected. There is a “ship of Theseus” effect, but there's also a compounding effect where codebases keep growing over time (maybe I should call it “Second Avenue Subway” effect, after the construction project in NYC that's been going on since 1919).

Let's start by analyzing Git itself. Git became self-hosting early on, and it's one of the most popular and oldest Git projects:

This plots the aggregate number of lines of code over time, broken down into cohorts by the year added. I would have expected more of a decay here, and I'm surprised to see that so much code written back in 2006 is still alive in the code base – interesting!

We can compute the decay for individual commits too. If we align all commits at x=0, we can look at the aggregate decay for code in a certain repo. This analysis is somewhat harder to implement than it sounds like because of various stuff (mostly because newer commits have had less time, so the right end of the curve represents an aggregate of fewer commits).

For Git, this plot looks like this:

Even after 10 years, 40% of lines of code is still present! Let's look at a broader range of (somewhat randomly selected) open source projects:

It looks like Git is somewhat of an outlier here. Fitting an exponential decay to Git and solving for the half-life gives approx ~6 years.

Hmm… not convinced this is necessarily a perfect fit, but as the famous quote goes: All models are wrong, some models are useful. I like the explanatory power of an exponential decay – code has an expected life time and a constant risk of being replaced.

I suspect a slightly better model would be to fit a sum of exponentials. This would work for a repo with some code that changes fast and some code that changes slowly. But before going down a rabbit hole of curve fitting, I reminded myself of von Neumann's quote: With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. There's probably some way to make it work, but I'll revisit some other time.

Let's look at a lot of projects in aggregate (also sampled somewhat arbitrarily):

In aggregate, the half-life is roughly ~3.33 years. I like that, it's an easy number to remember. But the spread is big between different projects. The aggregate model doesn't necessarily have super strong predictive power – it's hard to point to a arbitrary open source project and expect half of it to be gone 3.33 years later.

Moar repos

Apache (aka HTTPD) is another repo that goes way back:

Rails:

Beautiful exponential fit!

Node

Wanna run it for your own repo? Again, code is available here.

The monster repo of them all

Note that most of these repos took at most a few minutes to analyze, using my script. As a final test I decided to run it over the Linux kernel which is HUGE – 635,229 commits as of today. This is 16 times larger than the second biggest repo I looked at (rails) and took multiple days to analyze on my shitty computer. To make it faster I ended up computing the full git blame only for commits spread out at least 3 weeks and also limited it to .c files:

The squiggly lines are probably from the sampling mechanism. But look at this beauty – a whopping 16M lines! The code contribution from each year's cohort is extremely smooth at this scale. Individual commits have absolutely no meaning at this scale – they cumulative sum of them is very predictible. It's like going from Newton's laws to thermodynamics.

Linux also clearly exhibits more of a linear growth pattern. I'm speculating that this has to do with its high modularity. The drivers directory has by far the most number of files (22,091) followed by arch (17,967) which contains support for various architectures. This is exactly the kind of things you would expect to scale very well with complexity, since they have a well defined interface.

Somewhat off topic, but I like the notion of how well a projects scales with complexity. A linear scalability is the ultimate goal, where each one marginal feature takes roughly the same amount of code. Bad projects scale superlinearly, and every marginal feature takes more and more code.

It's interesting to go back and contrast Linux to something like Angular, which basically exhibits the opposite behavior:

The half-life of a randomly selected line in Angular is about 0.32 years. Does this reflect on Angular? Is the architecture basically not as “linear” and consistent? You might say the comparison is unfair, because Angular is new. That's a fair point. But I wouldn't be surprised if it does reflect on some questionable design. Don't mean to be shitting on Angular here, but it's an interesting contrast.

Half-life by repository

A somewhat arbitrary sample of projects and their half-lifes:

project	half-life (years)	first commit
angular	0.32	2014
bluebird	0.56	2013
kubernetes	0.59	2014
keras	0.69	2015
tensorflow	1.08	2015
express	1.23	2009
scikit-learn	1.29	2011
luigi	1.30	2012
backbone	1.48	2010
ansible	1.52	2012
react	1.66	2013
node	1.76	2009
underscore	1.97	2009
requests	2.10	2011
rails	2.43	2004
django	3.38	2005
theano	3.71	2008
numpy	4.15	2006
moment	4.54	2015
scipy	4.62	2007
tornado	4.80	2009
redis	5.20	2010
flask	5.22	2010
httpd	5.38	1999
git	6.04	2005
chef	6.18	2008
linux	6.60	2005

It's interesting that moment has such high half-life, but the reason is that so much of the code is locale-specific. This creates a more linear scalability with a stable core of code and linear additions over time. express is an outlier in the other direction. It's 7 years old but code changes extremely quickly. I'm guessing this is partly because (a) lack of linear scalability in code (b) it's probably one of the first major Javascript open source projects to hit mainstream/popularity, surfing on the Node.js wave. Possibly the code base also sucks, but I have no idea 😊

Has coding changed?

I can think of three reasons why there's such a strong relationship between the year the project was initiated, and the half-life

Code churns more early on in projects, and becomes more stable a while in
Coding has changed from 2006 to 2016, and modern projects evolve faster
There's some kind of selection bias where the only projects that survive are the scalable stables ones

Interestingly, I don't find any clear evidence of #1 in the data. The half-life for code written earlier in old projects are as high as late code. I'm skeptical about #3 as well because I don't see why there would be a relation between survival and code structure (but maybe there is). My conclusion is that writing code has fundamentally changed in the last 10 years. Code really seems to change at a much faster rate in modern projects.

By the way, see discussion on Hacker News and on Reddit!

Are data sets the new server rooms?

2016-11-01T00:00:00Z

This blog post Data sets are the new server rooms makes the point that a bunch of companies raise a ton of money to go get really proprietary awesome data as a competitive moat. Because once you have the data, you can build a better product, and no one can copy it (at least not very cheaply). Ideally you hit a virtuous cycle as well, where usage of your system once it takes of gives even more data, which makes the system even better, which attracts more users…

The behavior of machine learning models with increasing amounts of data is interesting. If you're building a machine learning based company, first of all you want to make sure that more data gives you better algorithms.

But that's a necessary, not sufficient condition. You also need to find a sweet spot where

It's not too easy to collect enough data, because then the value of your data is small
It's not too hard to collect enough data, because then you're going to spend way too much money to solve the problem (if ever)
The value of the data keeps growing the more data you get

In the recommender system world (where I spent 5 years) it's not uncommon for algorithms to basically converge after say 100M or 1B data points. It depends on how many items you have, of course. Some class of models converge before they are even useful, in which case obviously there's no value in more data. Xavier Amatriain has an excellent Quora answer that I urge you to check out if you want to learn more.

Anyway let's simplify the problem. Let's consider the behavior of some algos:

The blue model represents problems where it's really easy to get good data pretty cheaply. For instance, a cat vs dog classifier is not a useful piece of tech because the value of getting that training data is roughly $0. I would worry about this for any company building a general purpose image classifier, for instance. Or if you're building a recommender system with 10k items it might be good enough with 10M ratings already. Having 100B ratings isn't necessarily more valuable.
The red model can happen in cases where your data comes from a different distribution or your loss function isn't close to what the product requires. In those cases more data is useless at some point. If you're building a movie recommender system by scraping web text it might just converge to a decent but not good enough model. (Here's another hypothesis: maybe collecting passive data from driving a car isn't enough to learn how to actively drive a car?)
The green model is when your problem may require such a ridiculous amount of data that it's not practical. For instance building a general purpose question and answer service that can solve all the questions in the world isn't that hard from a ML perspective if you have an infinite amount of data of questions and answers. But it's probably going to be useless with less than terabytes or petabytes of input data. If I tried to build a virtual assistant, this would be my biggest concern.

Here are some sweet spots where I think you can build up a data set, but it's hard. Hard is good because it means once you did it, you have a moat:

Detect fraud in transaction data
Predict which loans are going to default
Detect crimes from security footage

Hard to remember? Here's a handy table I made

I think the general idea is pretty valid. But is it 100% correct? Probably not. Is it oversimplified? Oh yeah, to the extreme.

Pareto efficency

2016-10-25T00:00:00Z

Pareto efficiency is a useful concept I like to think about. It often comes up when you compare items on multiple dimensions. Say you want to buy a new TV. To simplify it let's assume you only care about two factors: price and quality. We don't know what you are willing to pay for quality – but we know that everything else equals:

The cheaper the better.
The higher quality the better.

This means we can rule out some TV's immediately. If TV number 1 is both cheaper and better quality than TV number 2, then there's no point buying number 2. Other TV's we can't compare because they excel at different things (price and quality). We obtain a partial ordering. If you keep removing TV's that are dominated this way you end up with a set of TV's on the Pareto frontier. The really nice thing is we don't have to worry about what scales we're using. We also don't need to know how the consumer's tradeoff function looks like. All we need to care about is there is a way to rank quality and there's a way to rank price.

Wikipedia features the typical of definitional math that's extremely hard to get any intuition from:

The Pareto frontier, $$ P(Y) $$, may be more formally described as follows. Consider a system with function $$ f: \mathbb{R}^n \rightarrow \mathbb{R}^m $$, where $$ X $$ is a compact set of feasible decisions in the metric space $$ \mathbb{R}^n $$ and $$ Y $$ is the feasible set of criterion vectors in $$ \mathbb{R}^m $$, such that $$ Y = { y \in \mathbb{R}^m:; y = f(x), x \in X;} $$. We assume that the preferred directions of criteria values are known. A point $$ y^{\prime\prime} \in \mathbb{R}^m; $$ is preferred to (strictly dominates) another point $$ y^{\prime} \in \mathbb{R}^m; $$, written as $$ y^{\prime\prime} \succ y^{\prime} $$. The Pareto frontier is thus written as

$$ P(Y) = { y^\prime \in Y: ; {y^{\prime\prime} \in Y:; y^{\prime\prime} \succ y^\prime, y^{\prime\prime} \neq y^\prime ; } = \emptyset }. $$

This is pretty useless to me. I think it's much more important to develop a visual intuition for how it works. You can define it in high dimensional spaces but let's look at a two dimensional space first. So think of it as a list of TV's and the axes are “quality” and “price”. The only important thing is that one TV dominates another TV if it's better quality and better price. That's pretty intuitive to me. If you remove all the TV's that are dominated by another TV, you end up with the Pareto frontier. These are the TV's that are the only meaningful choices.

One note about the TV example: lower prices are of course better than higher prices. So you can think of the two factors as (-price, quality) just so that right and up is always better.

Sometimes Pareto frontiers are drawn like this:

The red dots are the TV's that are on the Pareto frontier. You don't want to buy a TV that's not on the Pareto frontier since there's always another one that's both higher quality and better price. The lines between the dots are just for illustrative purposes, they are not really a part of the frontier. I actually think this makes a lot more sense to draw it this way:

In this case let's say Samsung has a TV that's on the Pareto frontier (one of the red dots) and they are launching a new generation. They can move it up and right (all consumers would prefer that!). This improvement would be Pareto efficient or a Pareto improvement. They can move it down and left (this would be worse for all customers!). Or they can move it up+left or down+right and depending on the consumer they may or may not like it.

Benchmarks are a good use case for computing the Pareto frontier. Often it's hard to break it down to a single number. I maintain benchmarks for approximate nearest neighbor libraries, where there's a tradeoff between accuracy and speed. It's useful to plot all points on a 2D graph:

Another thing I used it for relatively recently was to present the best mortgages to a consumer. Basically there's only two attributes: interest rate and points/credits. We actually compute the Pareto frontier for our rate table. In this case smaller numbers are better along both dimensions. This is really just the same thing, just flipping the comparisons.

I don't really know a whole lot about Pareto efficiency, but here are some notes on the technical aspect of it:

If you're comparing items along two dimensions, you can do it in $$ \mathcal{O}(n\log n) $$. How? You sort the items along the x axis, traverse through. Any time the y value of the current point is larger than the y value of the previous point, you remove the previous point (it kind of resembles Graham scan for convex hull, which I guess isn't so surprising – finding the Pareto frontier is basically Convex hull in som weird degenerate geometry).
If it's more than two dimensions, you can do it in $$ \mathcal{O}(n^2) $$. Just compare each pair of items and remove any strongly dominated one. I doubt there's a faster algorithm but maybe? I'm sure there's some obscure Ph. D. dissertation improving the bound to $$ \mathcal{O} (n^{2 - 2/d} \log n \log \log n \log^{*} n) $$ with an algorithm that is useless for practical purposes unless $$ n > 10^{24} $$.
The Wikipedia article about Multi-objective optimization talks a lot about Pareto efficiency. Something worth reading up on later.
The probability that a point is on the frontier goes towards 1 as the number of dimensions grows. I don't know the exact relation but simulated this numerically (this assumes each dimension is independent):

Here are some various mental models I have that involve Pareto efficiency:

I often think of Pareto efficiency in terms of decision making. The closer you get to the Pareto frontier (of the space of possible solutions), the harder it is to make any decision Pareto efficient. Meaning for almost all decisions, you're going to have to sacrifice something. For instance when you do a big refactoring of a system it's easy to get hung up on trying to preserve all features while adding a few new ones. In reality this is going to be extremely hard or impossible. If you can do it possibly it's because you forgot to include some other axis in your analysis, like code complexity. It's like pushing a balloon into a box.
I think loss aversion may sometimes be explained by people trying to make Pareto efficient decisions.
You could model why big companies are so slow to change course. Say Conglomerated Fruits, Inc is close or on the Pareto frontier in terms of how they allocate resources to fruit production. Now, turns out consumers are suddenly demanding Goji berries a lot more than bananas. Rationally the company should fire the Chief Banana Officer, shut down the banana division and reinvest all their money in goji berries. But because the Chief Banana Officer sits on the management team, this becomes and extremely hard thing to do. My conclusion from this silly example is that you should really think twice before assigning the responsibility of a functional area to a single person.
Here's another example: Amalgamated Travel Agency, Inc. runs a very profitable business selling flight tickets over phone. Some internet startups start selling flight tickets online. Of course, they think, a flight ticket is a huge transaction, and people want to talk to another human. Turns out buying flight tickets online is better along almost every other dimension except one. Amalgamated Travel Agency, Inc. dies because they don't realize customers don't need a Pareto efficient product improvement.
A simple model for why buying decisions are so hard is that it involves Pareto effiency – market economy will drive out all TV's that are dominated, leaving only the TV's on the Pareto frontier. That makes it a lot harder as a consumer because now every choice will become a trade-off.
I loved this blog post from 538 about Marco Rubio and the Pareto Frontier.
When a deal is Pareto efficient and improves the outcome for everyone, it's called a Coasean bargain.
I think it's helpful to think of competitive advantage in terms of Pareto efficiency. In a commodity industry like oil drilling, the number of dimensions to compete on is very small (eg. 1). So you will end up with a few very big companies. Whereas in something like clothing there's a lot of dimensions, so you should expect a more fragmented market.
For some moderate amount of dimensions, such as $$ d = 20 $$, if you have about $$ 7 \cdot 10^9 $$ items, then most of them will be Pareto efficient (assuming independence between factors). So let's think about it for a while. It means that if we pick 7 billion people (the world's population) and 20 somewhat independent factors (IQ, length, vision, income etc), then most of the world's population is Pareto efficient. So it's hard/impossible to find another person who is better than you at everything :sunglasses:

State drift

2016-09-08T00:00:00Z

I generally haven't written much about software architecture. People make heuristics into religion. But here is something I thought about: how to build in self-correction into systems. This has been something just vaguely sitting in my head lacking a clear conceptual definition until a whole slew of things popped up today that all had the exact same issue at its core. I'm going to refer to it as state drift lacking a better term for it.

What is state drift?

State drift is when there's two components that synchronize state. Actually, doesn't even have to be two, and the “state” may be somewhat virtual, not in an explicit sense. But to make it easy let's say we have a producer who maintains a state, and the producer sends deltas to an observer, that updates its own state. If care isn't taken, the state in the observer will start to drift away from what it's supposed to be. Without a self-correction mechanism, you are forever screwed.

These three real world cases came up independently of each other today:

Service updates over email – sadly some of our vendors have suboptimal API's (this is the mortgage industry) and we have to resort to email parsing to get certain updates. But emails get lost and formats change and suddenly you realize your view of the outstanding service order doesn't match the vendor's view.
Webhooks – have a lot of other annoying issues (more on that later) but in particular there is rarely any redelivery guarantees. A lot of API's look call back using webhooks when some data is ready (for operations that take longer than a second) but what if that callback gets lost in the ether?
Websockets to push state to clients – we use a single page app and some of the state is pushed from the backend to the frontend, meaning the frontend has its own view of what the backend state looks like.

Why can't you just write code with 100% uptime? :trollface:

Ok great thanks! I should have thought about it.

But yeah the problem is that if you lose a single delta, if there's no way to self-correct, it's game over. Sadly it doesn't matter if you have 99.99% uptime. Conversely, a system with a level of self-correction built in can afford quite a lot of errors and it will recover very nicely. In a lot of scenarios, you don't even have a choice. TCP connections randomly drop, and messages will get lost. Ideally you want both solid redelivery guarantees and some way to recover from corrupted state.

How to solve it?

Just briefly wanted to mention that I'm not a super big fan of webhooks, and I will elaborate a bit later.

I think the best solution when you have webhooks (or any delta pushing mechanism) is to treat the updates opportunistically, and use a separate mechanism to reconcile state every once in a while. This can often be done in the most simple way – every once in a while, copy the entire state from the producer to the observer.

Of course, you can make this more advanced – afaik Git uses Merkle trees to synchronize state. Another example is rsync which has an algorithm that compares two directory listings before it copies any files. These are state synchronization mechanism that are designed to detect and fix any discrepancies.

Luigi (mentioning it mostly because I'm the author) has a particularly simple reconciliation mechanism – the file system. Every time a workflow starts, it checks what's been done and what's not been done by seeing what exists (although this can be any user-provided checkpoint, in practice it's local files, HDFS, S3 or something similar). It pushes everything to the server during scheduling. During the work phase, it pushes deltas to the server, and the server tells the worker what to work on next. Occasionally either of those things fail, but we can ignore without much concern. The fact that a full state synchronization happens every scheduling means the state drift is limited.

Now that I think about it, the same principle applies to video compression. If there were no key frames, the state drift would increase slowly over time. Key frames bring the state drift back to 0.

Full state reconciliation might be way too expensive, so another complementary strategy is to make sure there are very strong delivery guarantees of the deltas. Systems Kafka which uses at distributed durable commit log to store a message queue. If a consumer dies, they can re-consume all messages from a certain point in time. It would be awesome if websockets could work the same way so that if the TCP connection is lost, we know where to resume from.

Speaking of webhooks…

Not a super big fan. Here are some reasons

I have a simple script to call some remote server but suddenly I need to run a HTTP server inside it
… and I also need to poke holes through a bunch of firewalls
A huge problem is the lack of redelivery guarantees. If you don't have 100% uptime, which is basically impossible, then you will lose data. If that data is not possible to recover, you have a problem.
In practice often the URL of the webhooks is hardcoded somewhere, meaning you can't decouple the caller/callee. Ideally you want multiple clients to be able to call and receive updates from the same API. The easiest solution to this is to make the webhook URL part of the request payload.
Debugging issues is much harder because you don't know where something got lost. If something is always pulling in one end, you know immediately when it's break. With push, that doesn't happen.

… but webhooks are here to stay, so let's accept it. Here's a free open source project idea (I'm actually somewhat tempted to build this): Provide a simple service that lets you expose a webhook to the world, but store all incoming requests in a durable queue. Support long polling (think tail -f) for updates as well. I think it would be almost trivial to build something like this, using a Lambda worker in AWS together with SQS, but I have little experience so I'll defer it for now. But feel free to steal :)

When machine learning matters

2016-08-05T00:00:00Z

I joined Spotify in 2008 to focus on machine learning and music recommendations. It's easy to forget, but Spotify's key differentiator back then was the low-latency playback. People would say that it felt like they had the music on their own hard drive. (The other key differentiator was licensing – until early 2009 Spotify basically just had all kinds of weird stuff that employees had uploaded. In 2009 after a crazy amount of negotiation the music labels agreed to try it out as an experiment. But I'm getting off topic now.)

Music distribution is a trivial problem now. Put everything on a CDN and you're done. The cost of bandwidth and storage has gone down by an order of magnitude, not to mention the labor cost needed to build and maintain it.

Anyway, at some point in 2009 we realized that we had far bigger challenges at Spotify than building a music recommendation system. So instead, I switched gears and ran the “Analytics team” for 2 years. We did the first A/B tests, ad delivery optimizations, provided data points crucial to bizdev deals, etc.

Not until 2013 did we feel like it was time to focus on music recs. So I switched back and built up a team around that. The feeling was that we already solved the “tablestakes” problems around music distribution and music management. Those problems had become easy to solve for anyone. The next differentiator would be more advanced features that deliver user value and are harder for competitors to copy. So we focused a lot on ML again.

Which brings me to this conclusion

In the majority of all products, machine learning will not be a key differentiator in the first five years.

Most machine learning is sprinkles on the top

The first few years of product iteration is about getting the “tablestakes” out of the way. The ROI of those are just vastly bigger. I lead the tech team at a startup and we are nowhere near using any kind of sophisticated machine learning, two years into the process. There are a few promising opportunities where we want to use it. I absolutely think it's going to be a huge competitive advantage for us. But right now far more simpler things matter. Spending a few days working on the conversion funnel is guaranteed do deliver far more business value.

Rarely is machine learning the fundamental enabler of a product. It's often an enhancer. This unfortunately means that the machine learning team isn't a team that creates the core business value and has a crucial strategic role. It will be the team that comes in after 5-10 years once the “basic” features have been built and then squeezes out another 10% MAU by A/B testing the crap out of the product. Despite the current AI hype, most of the big shops focus on relatively mundane things. Google is trying to get you to click on more ads, Facebook to use the newsfeed more. It's all incremental improvements on top of a product that already existed for 10 years.

{:width="1024px”}

Obviousy the image above has nothing to do with this post. I just thought it was funny. Sorry.

Pick your competitive advantage

How can we get around this? How can we build a company that's founded based on machine learning first? I suspect ML in itself is very rarely a competitive advantage. Any machine learning company needs to find a sustainable non-ML advantage. Do you have a fantastic set of image filters? Great, use that tiny head start, launch an app and build a social network. Do you have a really good fraud detection system? Go out and sign up enterprise customers that feed you data back.

Machine learning can be a first mover advantage. But there's a high likelihood whatever insight you have will be independently discovered and published at the next NIPS/KDD/ICML. You need to turn it into something sustainable – having data, or lots of users, or very sticky enterprise contracts, or something else.

Besides the core machine learning, other technology can definitely be a competitive advantage. Building super nasty integrations with vendors, or figuring out the control engineering of the suspension system of a self driving car. Those are proprietary assets where there's little open research. For the pure machine learning I think we'll see a separate force of commoditization of machine learning in those areas, where the technological differental between companies coverges towards zero. Knowing how to build a convolutional neural network will not be a valuable asset. Hooking it up to a surveillance system and building video distribution system could be a really key piece of technology.

Don't underestimate the power of data. Scraping the web doesn't create valuable asset. But if you can obtain highly valuable unique data then that's a huge competitive advantage. Another type of data I think people underestimate is in people's heads – learnings from real production usage. Eg. Netflix has iterated movie recommendations for 10 years. They know their shit. It's hard building a better recommender system even if you magically had ten times the data that Netflix has.

What seems to happen in reality is that the human capital becomes real asset. Here's a list of some acquisitions. It's clear to me these acquisitions were 90% acqui-hire – about human capital being redeployed to something else. Google and other big players has shown that they are willing to pay a huge premium for smart teams (throwing out a fun conspiracy theory just for the sake of it: Google is going to acqui-hire any team with smart people just to create a talent monopoly.) These companies all had built some cool tech, but the price paid really represented the scarcity of skills. I expect that scarcity to vanish gradually.

Subway waiting math

2016-07-09T00:00:00Z

Why does it suck to wait for things? In a previous post I analyzed a NYC subway dataset and found that at some point, quite early, it's worth just giving up.

This isn't a proof that the subway doesn't run on time – in fact it might actually proves that the subway runs really well. The numbers indicate that it's not worth waiting after 10 minutes, but it's a rare event and usually involves something extraordinary like a multi-hour delay. You should roughly give up after some point related to the normal train frequency, and 10 minutes is not a lot at all. Conversely if the trains ran hourly, it probably would had been worth waiting an hour or more. My analysis gave me a lot of respect for the job MTA is doing.

But there's another effects that greatly impacts waiting time. The variance. It turns out that the the statistics of waiting makes it very sensitive to variance.

How variance destroys your waiting time

Let's consider this toy model. Let's say the time between subways is drawn from the set of elements $$ T_1, \ldots, T_n $$. So the average time between subways is just $$ \frac{1}{n} \left( T_1 + T_2 + \ldots + T_n \right) $$.

Now let's say we arrive at a random point in time. With probability $$ T_j / \left( T_1 + T_2 + \ldots + T_n \right) $$, the next time to the subway is $$ T_j $$ from the previous one. I think of it as the following schematic:

Let's think about how long you have to wait given that you arrive at some random point in time. If the next train is train $$ j $$ then you end up having to wait some random amount of time distributed uniformly between $$ 0 \ldots T_j $$, the average of which is $$ T_j / 2 $$.

Multiplying the probabilities with the averages, you can see that the average time will be

$$ \frac{1}{2} \frac{T_1^2 + T_2^2 + \ldots + T_n^2}{T_1 + T_2 + \ldots + T_n} $$

It's basically the average value of the following graph:

Digging into the statistics we can actually rewrite the quantity as:

$$ \frac{1}{2} \frac{E[T^2]}{E[T]} = \frac{1}{2} \frac{E[T]^2 + Var(T)}{E[T]} = \frac{1}{2}E[T] \left(1 + c_v^2\right)$$

Here $$ c_v $$ denotes the relative standard deviation – the standard deviation divided by the average. This makes it a bit more clear what happens. The average waiting time is scales with the variance of time between subway trains. If there is absolutely no dispersion, then all subways arrive by the exact same rate, and the the average waiting time would be exactly half the time between subway trains.

The key thing here is to note that we have squares in the numerator. A single large value of $$ T_j $$ can have an outsized impact on the average time you have waited. For instance if $$ n = 2 $$ then the average waiting time behaves almost as $$ 1 / 2 \max(T_1, T_2) $$:

If $$ n = 10 $$ and all $$ T_i = 1 $$ minute then the average waiting time is 30 seconds. But let's say we have a single $$ T_{10} = 10 $$ minutes – then the average waiting time is 2 minutes 52 seconds. So by adding a single 10x train delay with 10% probability, the *average* waiting time goes up by a factor of 5.7x! This illustrates how a single bad egg can destroy the average performance of the system and create a kind of “bottleneck” effect.

Even a 1% risk of a 10x delay pushes up the average waiting time by a whopping 1.83x – this is still while running 99% of the trains on time.

These charts above were artificial toy datasets with only two different outcomes, but we can generalize it to any probability distribution.

How much does the variance matter in reality?

I crunched the MTA dataset and ended up with the following numbers:

In most cases it makes the average waiting time longer by about 2x – poor Staten Islanders however are experiencing some really bad waiting time caused by the SI's variance. This data is only during the day. The interesting conclusion here is that average time waiting could be cut in half by just making the subway trains run more evenly. You wouldn't have to add a single train to the system or drive them faster. Just make sure they are more evenly spaced out.

I was hoping L maybe would exhibit terrible stats here, finally proving the popular belief that L sucks. Unfortunately I'm still not able to see any quantitative evidence supporting this. Does L really deserve its terrible reputation? I'm starting to think it's really one of the best run subway lines in NYC.

Note 1: I published an version of this blog post earlier this week but it contained a few errors that have been fixed.
Note 2: This post contains a bunch of math that isn't rendered properly in RSS. Sorry about that!

Approximate nearest news

2016-06-02T00:00:00Z

As you may know, one of my (very geeky) interests is Approximate nearest neigbor methods, and I'm the author of a Python package called Annoy.

I've also built a benchmark suite called ann-benchmarks to compare different packages. Annoy was the world's fastest package for a few months, but two things happened.

FALCONN (FAst Lookups of Cosine and Other Nearest Neighbors) is a new library based on Locality Sensitive Hashing
NMSLIB authors came up with an impressive set of improvements to their algorithms.

The quest for the fastest nearest neighbor algorithm intensifies. (Except, I expect Google has already found a method that's 10x faster and works at a 1,000,000x larger scale).

What makes me excited is that ann-benchmarks has become the standard benchmark for approximate nearest neighbor algorithms. I'm glad, because open objective benchmarks drives progress. Both FALCONN and NMSLIB have been using ann-benchmarks for their tests and the authors contributed code back to ann-benchmarks to support their libraries.

The situation looks like this right now:

If you look at the higher precision results (0.8 and up), Annoy is now the fourth fastest library. FALCONN and NMSLIB (both SW-graph and hnsw) are better. In fact, hnsw is faster by almost an order of magnitude, which is very impressive. This is a new algorithm that was recently published in a paper: Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs (the paper also features a benchmark against Annoy).

Using Annoy to visualize large datasets

A cool thing to do when you have a high dimensional data set is to embed it in 2D or 3D for visualization. There are algorithms that preserve similarity during the embedding, the most widely used one being t-SNE.

A great paper came out that improves on t-SNE for 2D and 3D embedding of large high dimensional data sets: Visualizing Large-scale and High-dimensional Data. The authors of the paper propose a new method that they implement using Annoy – it also mentions ann-benchmarks. Unfortunately I haven't seen any code yet (I've been tempted to build something) but there's an R implementation that doesn't use Annoy.

For any readers of this blog who is (a) interested in approximate nearest neighbors (b) based in Philadelphia (I expect the intersection of those sets to be $$ \approx \emptyset $$): I'm going to talk about Annoy at the DataPhilly meetup on June 15. It's going to be quite similar to my talk at the NYC Machine Learning meetup.

What is your motivation?

2016-05-24T00:00:00Z

I've been trying to learn Clojure. I keep telling people I meet that I really want to learn Clojure, but still every night I can't get myself to spend time with it. It's unclear if I really want to learn Clojure or just want to have learned Clojure?

Which makes me thing about my teenage years. I really wanted to make music on my computer. I would spend a couple of hours per week messing around with various tools. This was in the heydays of DSL and I downloaded gigabytes of music on Napster. I just really like the thought of making something that was as good as I heard. Fast forward fifteen years and this guy roughly sums up why I never succeeded:

Moodymann isn't just a character, he's also a producer from Detroit who explains his motivation:

I'm not into this to press up a mass amount of records. I'm not into this to travel the motherfucking world. I'm not into this to impress anybody. I'm into this for my own heart and soul. A lot of people, after work, you got to go home, you take a bath. A lot of people go home, you fuck your wife. A lot of people go home, you cut your grass. I go home, and I fuck that motherfucking MPC all fucking night.

The key points here is that (a) Moodymann derives instrinsic joy from making music (b) it's a prime example of a kind of Lutheran work ethic that lets you excel at what you're doing. Of course, regardless of my intent to learn Clojure, someone else will actually enjoy doing it all fucking night (paraphrasing Moodymann) and eventually they will be 100x better than me.

I've met roughly 100 people who say they want to learn machine learning. But do they? Or do they just want to have learned machine learning? I don't know. But I know that I've never learned anything nontrivial without having fun learning it. So it seems like the trick to learn anything is really to find a way to enjoy learning it. Not try to find any shortcuts. If you want to learn Spanish, try to look for a way to enjoy it. Then do it all night long.

The other extreme

Turns out there's another extreme of this. When the joy from doing something is so strong that it overshadows the whole utility of it. I've met about 1000 developers who just love to work with cool things. Building algorithms, messing around with machine learning, whatever. People roughly fall on a spectrum in terms of motivation: are they motivated by delivering value or are they motivated by the tool?

I've learned to avoid hiring people in the latter category. I love ridiculously smart people, and I don't mind at all if they like to geek out and study Haskell in their spare time. But they turn their whole career into a quest to advance their knowledge of functional programming then there's some fundamental misalignment between the company's interests and their interests.

Another example of this phenomenon is a trap I fell into myself. I was coding for many many years and built up this innate sense of satisfaction every time I implemented something. Being a new manager I really struggled with the lack of satisfaction from actually doing management. As a result I would fall back into the zone writing code and ignore everything that really mattered. It took me years to build up the sense of satisfaction of doing something that mattered at a higher level.

What is your motivation?

My mental model is something like this: there are all these long term goals like become successful. But planning for that would take an incredible amount of mental effort so we end up establishing some proxies for that. In a day to day setting these proxies manifests themselves as a sense of accomplishment. You get 2 units of satisfaction when you write code for an hour and you get 1 unit of satisfaction being in a meeting for 1 hour. Or whatever. But those numbers can be completely out of whack with reality. Sometimes they grossly understated and sometimes grossly overstated.

Dollar cost averaging

2016-04-26T00:00:00Z

(I accidentally published an unfinished draft of this post a few days ago – sorry about that).

There's a lot of sources preaching the benefits of dollar cost averaging, or the practice of investing a fixed amount of money regularly. The alleged benefit is that when the price goes up, well, then your stake is worth more, but if the price goes down, then you get more shares for the same amount of money. According to Wikipedia, it “minimises downside risk”, about.com says it “drastically reduces market risk”, and an article on Nasdaq.com claims that it's a “smart investment strategy”.

This is nonsense if you think about it. You never benefit from a price going down. The stock market is mostly efficient and if it goes down by 1% then it's as likely to go down further as it is to come back up.

Separately, it's generally a false dichotomy to talk about dollar cost averaging as an investment strategy, since it makes it seem like there are other equally viable strategies available. For the average person who wants to save money by investing it in to the stock market, there's really no choice between strategies. Given that the stock market is expected to go up, the earlier you invest, the better. So the best strategy is always to invest your money as soon as you can.

So if you invest say $100 every week, is that inherently a lot less risky than going all-in with a $1M investment and holding it? Are we more protected from market downturns? I downloaded S&P 500 data going back to 1870 and wanted to take a look.

The key thing here is to use the total return, i.e. with dividends reinvested, or you will understate the yield. You can get the data here.

Comparing lump sum vs dollar cost averaging is a bit weird since the cash flows are so different. One way you can do it is to compute the internal rate of return. Let's look at five year investment horizons and what the value of a dollar cost averaging strategy would give us, versus a lump sum investment

We see that there's a improvement in using DCA although it's quite small – an additional return of about 25 basis points every year. However the simulation shows that dollar cost averaging actually is more likely to be in the red five years later, 12.9% compared to 11.0% for a lump sum investment.

If you look at it a bit closer, it turns out none of these differences are statistically significant. The conclusion here is that the difference, if it exists, must be very small

How to compute IRR

Computing internal rate of return (also called annual percentage rate) is a fun little numerical problem. You want to find the rate $$ r $$ such that the cost of payment stream is equal to zero:

$$ \sum_i c_i (1 - r)^{-i} = 0 $$

where $$ c_i $$ is the payment/income at time $$ i $$. Or if you use continuously compounding rates you have the equivalent relation $$ \sum_i c_i e^{-i} = 0 $$. Either way, it's the same problem as finding the roots of a polynomial. numpy.irr has a pretty terrible implementation of this that's extremely slow. I ended up open sourcing a very simple implementation that uses binary search.

Notes

Vanguard put together a presentation getting to the same conclusion. Another article from Betterment is also quite good.
All code is available as a gist.

Why organizations fail

2016-04-18T00:00:00Z

One of my favorite business hobbies is to reduce some nasty decision down to its absolute core objective, decide the most basic strategy, and then add more and more modifications as you have to confront the complexity of reality (yes I have very lame hobbies thanks I know).

Software engineers want to deliver features quickly without adding too much tech debt. But what's the point of delivering features? Business value. And what's the point of not adding too much tech debt? It's so that we don't add a bunch of crap that slows down future iteration speed, i.e. we don't want to slow down the future rate of delivering business value (modulo some kind of discounting factor). There is no intrinsic value of beautiful code and no intrinsic value of tech debt. I probably sound like a nihilist, but everything really just boils down to: what's the most efficient way to deliver business value?

So let's start with the assumption that it's everyone's job to deliver business value. Why can't we just make everyone's job to do this? Strip all titles and all management structure – just tell everyone to act in the company's best interest. My dream is that one day I made myself useless. Everyone in the team just comes in every morning and asks themselves: what is the highest ROI thing I can do today? And they just do it. Nothing else is needed. Why can't that just work?

It's of course an incredibly naïve idea – but where does it fail, exactly? It turns out there's in two places. Nothing more, nothing less:

Incentive problems (agents do not want to act in the organization's interests)
Bounded rationality problems (agents do not have the necessary information to do so)

I can't stress enough that there is no other reason. I had this exact theory for many years but wasn't able to express it in the same eloquent way as above, which is something I copied verbatim from a fantastic paper I just encountered: Why Organizations Fail: Models and Cases.

With the two bullets above our model is basically complete. The beauty of thinking about it this way is that it breaks down management recursively. As a manager the best way to get value out of other people is to:

Make sure people's interests are aligned with the company's
Make sure everyone in the team has the necessary information they need

Everything else follows. Anything else that 1. and 2. is useless to spend time on as a manager. For instance it explains why micromanaging is useless: all you need to do is give people the information they need to make the right decisions, and reward good behavior. Similarly the model also highlights why it's important to delegate projects with their full context: without enough information people will make the wrong decisions. And it points out that you should reward people for delivering business value: rewarding people for anything else and you misalign people's interests with something else than the company's.

The paper above is pure game theory and I love its mathematical clarity. Of course it's just a model of reality and like any game theoretical model it assumes (incorrectly) that humans are 100% rational. But I think it's a good first order approximation of what to do and it's a good reality check.

NYC subway math

2016-04-04T00:00:00Z

Apparently MTA (the company running the NYC subway) has a real-time API. My fascination for the subway takes autistic proportions and so obviously I had to analyze some of the data. The documentation is somewhat terrible, but here's some relevant code for how to use the API:

from google.transit import gtfs_realtime_pb2
import urllib

for feed_id in [1, 2, 11]:
    feed = gtfs_realtime_pb2.FeedMessage()
    response = urllib.urlopen('http://datamine.mta.info/mta_esi.php?key=%s&feed_id=%d' % (os.environ['MTA_KEY'], feed_id))
    feed.ParseFromString(response.read())
    print feed

I started tracking all subway trains one day and completely forgot about it. Several weeks later I had a 3GB large data dump full of all the arrivals for 1, 2, 3, 4, 5, 6, L, SI and GC (the latter two being Staten Island railway and Grand Central Shuttle).

Let's do some cool stuff with this data!

For instance, here are a bunch of subway trains for a while on the 1 line:

The reason I started looking at this data was to understand if to what extent waiting for a subway is “sunk cost” vs. an investment. In particular, what is the optimal strategy if you're waiting for the subway? My intuition told me that there's a T such that the expected additional time you have to wait goes down as you approach T, but then goes up afterwards. Until T, every second you wait gets you closer to the next subway. After T, there's most likely some random issue with the subway and you should just give up.

Turns out there is such a thing. But let's start by just looking at a plot of subway delays. The distribution of time between two trains $$ P(t) $$ looks like this:

This a probability distribution made from the distplot function in Seaborn. It's a histogram (with 1 minute bins) combined with a kernel density estimation of the probability distribution.

An interesting thing is that the distribution is multimodal with the biggest peak around 5 minutes and another around 20 minutes. I suspect this reflect rush hour vs night traffic. There's also a peak just after 0 which I suspect is just what happens during rush our traffic when subways end up clustering.

Note that this is not the distributions of waiting times, which is a bit different. If you assume that you are equally likely to arrive at any subway stop at any time of day, then the waiting time until the next subway looks like the distribution below. This represents a probability distribution where at any time of day, you pick a subway line, go to a random subway station, and wait for the next train.

This distribution is a bit more regular. The most likely time you have to wait (the mode) is actually about 1 minute, although the mean and the median are much larger.

Erik please digress and talk about the relationship between the two curves

Complete side note, but I realized in general you can take the distribution of time between events $$ P(t) $$ and can convert to the distribution of time to the next event using the relation

$$ Q(t) = \frac{ \int_t^\infty P(s) ds }{ \int_0^\infty sP(s) ds } $$

My math is a bit rusty so please don't use this for heart surgery. But it seems to work – if you plug in a Dirac delta $$ P(t) = \delta(t-d) $$ then you get the uniform distribution back: $$ Q(t) = 1/d, 0 \le t \le d$$. In the data above I just implemented it in a dumb way by sampling.

Waiting time by line

Let's plot the average time to arrival by line. This is limited to the lines in the API. Let's switch to a violin plot using Seaborn.

Interestingly, L stacks up pretty well against the other subway lines, despite its notorious delays (and websites such as is the L train fucked). The median waiting time is the smallest out of all the lines, and even the extreme case compares favorably.

(Btw the key data set of chart is MTA's offical color schema. Did you know that the color of L is not a perfect gray but actually #A7A9AC – marginally more blue? Amazing)

Waiting time by time of day

Obviously time of day is an extremely important factor here so let's look at the waiting time by time of day. Each point in time gives us a probility distribution over waiting time so let's plot some of the quartiles and how it changes over the day!

The 50 percentile line (blue) describes the median time you have to wait based on the time of day. The 90 percentile line (yellow) describes how long you have to wait if you are unlucky and a 90% event happens. It depends on your risk averseness what line you pick – if you have to make it to a flight you should probably pick the 90th percentile, but if it doesn't matter if you are late, pick the 50th.

Not shockingly, the waiting times peak in the wee hours – in particular the 90th percentile shoots up around 4AM. The 7AM-7PM window is very stable, and then it shoots up again.

Waiting for subway and sunk cost

Let's say you wait for the subway for 10 minutes and it hasn't arrived yet. Should you give up? Probably not. But if you have waited for the subway for an hour, there's probably no point. Up to a certain point waiting for the subway is an investment in getting home sooner.

It also depends on your risk averseness again – if you need to make it to a flight, you might just give up and get a cab at some point. So given that you spent $$ t $$ minutes so far waiting for the subway, what's the additional time you're going to have to wait?

There's a tricky bias here, because the times where you waited longer tends to skew towards nights. This would be a confounding factor. So I limited the data set to 7AM-7PM above.

The interesting conclusion is that after about five minutes, the longer you wait, the longer you will have to wait. If you waited for 15 min, the median additional waiting time is another 8 minutes. But 8 minutes later if the train still hasn't come, the median additional waiting time is now another 12 minutes.

So when should you give up waiting? One way to think about it is how much time you think it's worth waiting. The time you already waited is “sunk cost” so it doesn't really matter. What matters is how much additional time you are willing to wait. Let's assume you want to optimize for a wait time that's less than 30 min in 90% of the cases. Then the max time you should wait is about 11 minutes until giving up (this is at the point where the yellow line cuts the 30 min mark).

This reminds me a bit of project management. The longer a project has been going on, the longer the expected value of additional time is. Whatever resources you spent is sunk cost but what matters is the most likely estimation of project completion going forward. But of course the more overdue a project is, the longer that estimate is.

Of course, there's nothing “magic” about these kinds of distributions. There are certain probability distributions where waiting is an “investment” – the expected time until the next event goes down for every second you wait. There is exactly one type of probability distributions where waiting doesn't affect the time until the next event at all. This is the exponential distribution and the particular property is referred to as memorylessness. Then, there's “fat-tailed” distributions where the expected time to next event goes up for every second you wait. The NYC subway distribution exhibits all those behaviors in different parts of the curve.

All code is available here if you are curious!

(addendum: this post got some traction – see Reddit thread and Hacker News discussion)

Exploding offers are bullshit

2016-03-16T00:00:00Z

I do a lot of recruiting and have given maybe 50 offers in my career. Although many companies do, I never put a deadline on any of them. Unfortunately, I've often ended up competing with other companies who do, and I feel really bad that this usually tricks younger developers into signing offers. On numerous occasions, I've gotten an email halfway through the interview process

Erik,
I'm very sorry, but I'm not going to move forward with the interview process.

Another company gave me an offer and they need a decision by Thursday.

Best regards,
XYZ

Every time, I have to explain to candidates that exploding offers are bullshit. I don't even know where to start:

It's clearly just a bluff. Companies have no leverage in the situation. They would never lose a candidate that wants a few more days to think. At this point, they already spent thousands of dollars in interviews and sourcing, and there is no way they have any sort of leverage whatsoever. In a setting where you have one job and many applicants, exploding offer might make sense, but this is not the Great Depression and the power balance works the other way around.
It's taking advantage of people who are the least confident about the recruiting process and are the least comfortable asking for more time. You could argue it's a form of price discrimination where companies end up paying less than they would otherwise do for people who are less confident about themselves. Imagine that exploding offers would be illegal – some companies would now have to pay more to get the people that they would otherwise get.
It creates a race to the bottom that no one benefits from. Other companies have to start using exploding offers too. If a candidates feels pressured into accepting an offer that they would not have accepted otherwise, that's real value that gets destroyed. The candidate loses because they end up accepting an inferior offer. For the companies it nets out zero or negative – company A gains a role that company B lost, for a slightly less cost, but with a candidate that's slightly less motivated.
(More speculative) Shady behavior is rewarded. Companies who try to trick candidates are more likely to be shady in other ways – this is another reason how candidates lose out.

This is corroborated by some experiments (although who knows, I have my doubts about any behavioral studies):

“Across multiple studies, we find that a large portion of proposers issue exploding offers even though this results in substantially lower payoffs to themselves." – from Exploding offers can blow up in more than one way
“At the market level of analysis, the results suggest that exploding offers lower the quality of matching outcomes." – from Power balance and the rationality of outcomes in matching markets

That's why I never give exploding offers.

Every time I get an email like the one above, I tell the candidate that they should just hold off and ask for a few more days. There's only upside in having one more option on the table. Often it works out fine, sometimes it doesn't. Of course, in a lot of cases my company wasn't the top choice anyway, and the candidate ends up joining whatever company they wants. My conversion funnel is as leaky as most companies. But every time you remove one more choice, you remove the possibility that that choice was actually the best one.

I remember once more than ten years ago when I had never had a “real” job and a company had scheduled an interview. The only problem was that I realized it was on my last day of a trip. I ended up spending about $100 extra to get back one day earlier (and of course the interviewer was sick that day and cancelled on me last minute). I clearly misunderstood who really had the leverage here.

Law firms recruiting at law schools had the same problem that reated a race to the bottom. In the end they ended up self-regulating, realizing that an uncoordinated competition would benefit no one – certainly not the students, who would be pressured into accepting roles without considering all options, and for the companies, that would hire a long time before graduation where data was less reliable.

“There's also concern that students will focus too early — the summer after their first year — on where they might want to work once they graduate instead of exploring their options before getting locked in. NALP's voluntary recruiting guidelines allow firms to interview and extend summer-associate job offers early; however, firms must leave those offers open until 28 days after on-campus interviews begin at the student's law school. The rule is intended to prevent so-called exploding offers that expire before students have time to weigh all their options. Firms have largely adhered to the rule and have not pressured students to accept early offers before the traditional recruiting season begins, career-services administrators said."

Even though self-regulation worked out well for law firms, I doubt that it will ever work in technology. But junior developers should be aware that exploding offers are really just bullshit. Everyone would be better off if exploding offers disappeared. Please call the bluff when you see it.

(Edit: See the Hacker News discussion)

Meta-blogging

2016-03-12T00:00:00Z

(This is not a very relevant/useful post for regular readers – feel free to skip. I thought I would share it so people can find it on Google.)

My blog blew up twice in a week earlier this year when I landed on Hacker News. The first time I was asleep so I didn't notice that the site went down. The second time I did notice, and scrambled to reconfigure Apache & MySQL to handle the load.

I decided it was time to move off Wordpress.

This is something I had been thinking for a long time – I hated the layout with my old site, especially the typography. I probably could have spent a bunch of time tweaking it but it also feels philosophically wrong to run a LAMP stack in 2016 for something as simple as a blog.

After some research I chose Jekyll. I liked the idea of static content and it seemed very lightweight. To do this I had to

Clone Poole
Install the Wordpress to Jekyll exporter
Run this simple postprocessing script I had to build for some cleanup
Fixup a couple of things with Poole (in particular add height: auto; to the img tag in _sass/_base.scss, everything else was cosmetic superficial stuff, mostly to get nice serif fonts that seems to be the shit recently).

Additionally I reconfigured some stuff around related pages (brew install gsl, gem install rb-gsl, set lsi: true in _config.yml). LSI is a pretty crappy algorithm for NLP but unfortunately there's nothing better out there.

Since the entire state is stored in the file system, I also use Github to host a repo with the blog in it. I'm still hosting the blog on a Digitalocean droplet, but might kill it and just switch to Github Pages at some point.

Page load time went down drastically after deploying it:

This was all good until and I was running happily on my new blog until I was checking Feedly. First it claimed that I was no longer subscribed to my own blog, then when I added it, my number of subscribers had dropped from 300 to 9. WTF! I realized it's because my previous RSS was broken. I ended up setting up some 301 redirects using Apache. First of all, in /feed/.htaccess

RedirectMatch 301 /feed/(.*) http://erikbern.com/atom.xml

This will redirect /feed/ to atom.xml. Then I moved index.php to wordpress.php and set this up in /.htaccess:

RewriteEngine On

RewriteCond %{QUERY_STRING} feed=
RewriteRule (.*) /atom.xml? [R=301,L]

RewriteCond %{REQUEST_URI} ^/$
RewriteCond %{QUERY_STRING} p=
RewriteRule (.*) /wordpress.php [R=301,L]

The first pattern will redirect /?feed=xyz to atom.xml. The last one is for old links to my blog – it will do a double 301, from /?p=841 to /wordpress.php?p=841 to /2014/11/29/deep-learning-for-chess/.

I'm now up to 230 readers on Feedly – not sure wtf is going on but seems better than 9. Please resubscribe if you lost it!

Iterate or die

2016-03-02T00:00:00Z

Here's a conclusion I've made building consumer products for many years: the speed at which a company innovates is limited by its iteration speed.

I don't even mean throughput here. I just mean the cycle time. Invoking Little's law this is also related to the total inventory of features not being deployed yet.

In a hypothetical scenario, clone two teams of identical engineers and split them up in two groups A and B. Actually, clone B another nine times so that they are 10x larger. Give them the exact same tools and the same problems to solve, but team B can only deploy code every 3 months whereas A deploys multiple times per day. I bet team A will outperform B in terms of delivering business value, even though they are 10 times smaller.

I don't have a proof for this, it's really just a grumpy coder speculating. The way I visualize it is applying stochastic gradient descent on a function that keeps on changing. If you used SGD, you know that sometimes a lower learning rate can help you converge faster, especially if you can evaluate the function a lot more often.

Iterate or die

A large set of companies do not have iteration in its DNA. I'm particularly excited about this because I'm trying to compete in the banking space.

Let's consider how a big bank builds a new product. First, they would conduct customer research. Then, come up with a set of requirements. Then, make up a big ass GANTT chart and put 100s of engineers on it. 1-2 years after the project is started, they launch it, together with a huge spend on marketing. Under the year that has passed, nothing has been learned that wasn't known from the start. Every assumption that was made from scratch will take a year to validate.

It's like looking at a map and then trying to run through a forest for ten hours. Inevitably you are going to be very far from your target.

A modern day software shop is not a manufacturing plant. It probably resembles applied research a lot more. You can't plan research. You need to embrace uncertainty and iterate based on new information all the time. Forget about the six month plan you have.

Sometimes it's good to stop and look at the map. Your total speed is lower, but you can adjust your directions all the time.

Apparently the army realized this and have invented a bunch of methods like F3EA: Contemporary warfare challenges the practice of targeting and the philosophy of its purpose, promoting a shift from targeting for effect to targeting to learn. It's interesting that the US army has learned this the hard way: a small group of insurgents changing tactics very quickly is a serious threat to the 10x larger American war machine.

Building modern software

What kind of things do you learn? There's a lot of schools here. Highly anecdotal qualitative can be great directional evidence. Once you get closer to the optimum, you need to shift to A/B tests. Some people preach that the product and tech team should do customer support. Other people think the CEO's job is to email the first 1000 customers. At my company, we spend a lot of time looking at user sessions in FullStory and it's been a fantastic tool to see issues. I also talk to customers over our chat feature. Sometimes quantitative methods work, sometimes qualitative.

The other day a customer told me he ran into this bug and within an hour, I had deployed a fix. This is literally a 5000x faster iteration cycle than a big bank (5000 hours is a bit more than 6 months). It's the equivalent of walking through the forest, constantly checking the surroundings and making sure it matches the map. We might walk a bit slower (because our tech team is 10x smaller than most banks), but we learn new things at a much higher rate.

Seriously if I had more time I would start a Tumblr featuring people drawing horizontally flipped content on glass

I think most of failed IT projects can be traced back to this. It reminds me of Webvan's epic failure. Instead of iterating quickly and learning, the company embarked on a “scale at any cost” strategy. Or the new system for the Swedish police (in Swedish) that had to be shut down after wasting $1B taxpayer money. Or the epic IT project to reform the UK health records that spend about £12B before shutting down.

What I'm saying isn't exactly inter-universal Teichmüller theory, but interestingly even modern companies struggle with this. Spotify generally iterates pretty well but has had its fair share of big bang projects that would go on for a year without learning anything from real users. The product them would keep hypothesizing and develop an even more elaborate model for the bet.

But Erik. What about Agile?

A process where you deploy code at best every 3 weeks is really mini-waterfall. I like the basic tenets but I find that it encourages cargo cult behavior and distracts from the focus on the end result – creating a tight feedback loop and building a learning machine. Many companies use Agile as a way to deliver software often, but not as a way to learn quickly. If you're not constantly monitoring usage and adjusting your path then you are not learning.

Why is my team moving slowly?

One antipattern is to compensate lack of productivity with hiring. This is likely to get throughput up marginally, but cycle time down by a lot. I have seen companies with broken organization models, like having a “ML theory” team in NYC and an “ML implementation” team in another state. Good luck iterating quickly and learning fast. An organization has to be designed to minimize the information latency.

Another antipattern is to attribute project failure to lack of planning, and add even more planning. This 2005 article Why software fails is a great example of the wrong conclusion. The authors of the article should obviously have asked me – projects rarely fail because of bad planning – they fail because planning might have been futile in the first place. There's only so much you can prepare for.

Erik? Why so many strong feelings?

I think a lot about this because it's the only way my company can win. We're a bank and I'm pretty sure we are the only bank in the world that does continuous deployment, deploying code 20-30 times every day.

The word “paradigm” has an annoying ring to it, but actually think there is a new one of how to build a consumer product. Tech companies spent the last 10 years figuring out how to build a learning machine. If your company is based on scale advantages, that's great, but unless you learn how to iterate quickly, you're going to be dead.

My issue with GPU-accelerated deep learning

2016-02-03T00:00:00Z

I've been spending several hundred bucks renting GPU instances on AWS over the last year. The speedup from a GPU is awesome and hard to deny. GPUs have taken over the field. Maybe following the footsteps of Bitcoin mining there's some research on using FPGA (I know very little about this).

I don't think there's a coincidence that GPUs that are built for graphics turn out to be great for image classification using convolutional neural networks. When you are dealing with pixel data packed into 2D arrays it's possible to parallelize all operations very efficiently.

My issue is that the complexity of each minibatch is $$ \mathcal{O}(n) $$ where $$ n $$ is the number of parameters. The larger models you are dealing with, the bigger this issue becomes.

Word2vec uses a clever technique called hierarchical softmax to achieve $$ \mathcal{O}(\log n) $$ (more details here). I have no idea how to implement this on a GPU and I suspect it's impossible. Here's where the CPU shows its strength – traversing a logarithmic datastructure takes a lot of branching and can't be expressed as a batch operation.

Logarithmic data structures happens to be a field I'm pretty excited about, particularly for vector models and multi-class prediction problems. I'm the author of Annoy, which a library for high dimensional nearest neighbor queries, so it's something I've spent some time thinking about.

For collaborative filtering and natural language processing, GPU architectures are highly constraining. I suspect once you hit a billion parameters or so, more specialized networks that use logarithmic datastructures will outperform for NLP and CF. The speedup from the brute force GPU approach will be offset by the smarter datastructures that a CPU can handle. I haven't seen any research on this but seems to me like a huge opportunity. In particular, I would love to see hybrid architectures that can use a GPU for the “dense” networks and CPU for the “sparse” networks.

Some more font links

2016-01-25T00:00:00Z

My blog post about fonts generated lots of traffic – it landed on Hacker News, took down my site while I was sleeping, and then obviously vanished from HN before I woke up. But it also got retweeted by a ton of people.

This clearly constitutes another proof of how effective animated gifs are. There's some stuff out there on the internet that I think is about 10x cooler than my blog post:

Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow – blog post from a few weeks ago, doing something similar but modeling the strokes of Chinese characters as vector paths.

Learning a Manifold of Fonts – something I found the other day

Recurrent neural network handwriting generation demo – very cool RNN approach, similar to the Chinese character experiment above.

Some more Chinese characters generated using a neural network – in this case it's a DCGAN (deep convolutional generative adversarial network) which is probably a better architecture than what I was using.

Avería – several people sent me this link about generating an “average” font

Analyzing 50k fonts using deep neural networks

2016-01-21T00:00:00Z

For some reason I decided one night I wanted to get a bunch of fonts. A lot of them. An hour later I had a bunch of scrapy scripts pulling down fonts and a few days later I had more than 50k fonts on my computer.

I then decided to convert it to bitmaps. It turns out this is a bit trickier than it might seem like. You need to crop in such a way that each character of a font is vertically aligned, and scale everything to fit the bitmap. I started with 512 * 512 bitmaps of all character. For every font you find the max y and min y of the bounding box, and the same thing for each individual letter. After some more number juggling I was able to scale all characters down to 64 * 64.

The result is a tensor of size 56443 * 62 * 64 * 64. Exercise for the reader: where does the number 62 come from? I stored it as a tiny little (13GB) HDF5 file that you can download here: fonts.hdf5..

If you take the average of all fonts, here's what you get:

Hopefully by now it should be clear where the number 62 came from.

The median is a lot less blurry than the average:

Both mean and median are well-formed and legible! However individual fonts are all over the place:

I guess I practically begged for it, stealing fonts from various sketchy places all over the web. In particular most of the fonts don't even have lower case versions of the letters. A minority of fonts miss certain characters and will just output rectangles instead. And look at the ridiculous Power Ranger figure for the lower case “c”!

Training a neural network

Now, let's train a neural network that generates characters! Specifically what I wanted to do is to create a “font vector” that is a vector in latent space that “defines” a certain font. That way we embed all fonts in a space where similar fonts have similar vectors.

I built a simple neural network using Lasagne/Theano – check out the code here. It took an insane amount of time to converge, probably because there's so much data and parameters. After weeks of running, the model converges to something that looks decent.

Some notes on the model

4 hidden layers of fully connected layers of width 1024.
The final layer is a 4096 layer (64 * 64) with sigmoid nonlinearity so that the output is between 0 (white) and 1 (black).
L1 loss between predictions and target. This works much better than L2 which generates very “gray” images – you can see qualitatively in the pictures above.
Pretty strong L2 regularization of all parameters.
Leaky rectified units (alpha=0.01) of nonlinearity on each layer.
The first layer is 102D – each font is a 40D vector joined with a 62D binary one-hot vector of what is the character.
Learning rate is 1.0 which is shockingly high – seemed to work well. Decrease by 3x when no improvements on the 10% test set is achieved in any epoch.
Minibatch size is 512 – seemed like larger minibatches gave faster convergence for some weird reason.
No dropout, didn't seem to help. I did add some moderate Gaussian noise (of sigma 0.03) to the font vector and qualitatively it seemed to help a bit.
Very simple data augmentation by blurring the input randomly with sigma sampled from [0, 1]. My theory was that this would help fitting characters that have thin lines.

All of the code is available in the erikbern/deep-fonts repo on Github.

After convergence, we end up having a nice 40D embedding of all 50k fonts. Looks like it ends up being roughly a multivariate normal – here's the distribution of each of the 40 dimensions:

Playing around with the model

To start with, let's recreate real font characters with characters generated from the network. Let's plot the real character together with the model outputs. For each pair below, the real character is on the left, the model output on the right.

These are all characters drawn from the test set, so the network hasn't seen any of them during training. All we're telling the network is (a) what font it is (b) what character it is. The model has seen other characters of the same font during training, so what it does is to infer from those training examples to the unseen test examples.

The network does a decent job at most of the characters, but gives up on some of the more difficult ones. For instance, characters with thin black lines are very hard to predict for the model, since if it renders the line just a few pixel to the side, that's twice the loss of just rendering whitespace.

We can also interpolate between different fonts in continuous space. Since every font is a vector, we can create arbitrary font vectors and generate output from it. Let's sample four fonts and put them in the corners of a square, then interpolate between them!

Certain characters have multiple forms that we can interpolate between, eg. lowercase g:

We can also pick a font vector and generate new fonts from random perturbations:

(btw internet god – please forgive me for wasting bandwidth on all the animated gifs in this blog post!)

We can also generate completely new fonts. If we model the distribution of font vectors as a multivariate normal, we can sample random vectors from it and look at the fonts they generate. I'm interpolating between a few of those vectors in the picture below:

An interesting thing here is that the model has learned that many fonts use upper case characters for the lower case range – the network interpolates between Q and q seamlessly. Here's an example of the network interpolating very slowly between two fonts where this is the main difference:

Another cool thing we can do since we have all fonts in a continuous space is to run t-SNE on them and embed all fonts into the 2D plane. Here's a small excerpt of such an embedding:

Final remarks

There are many other fun things you can do. It's clear that there's some room for improvement here. In particular, if I had more time, I would definitely explore generative adversarial models, which seems better at generating pictures. Another few things should be relatively easy to implement, such as batch normalization and parametric leaky rectifications. And finally the network architecture itself could probably benefit from doing deconvolutions instead of fully connected layers

Feel free to download the data and play around with it if you're interested!

I believe in the 10x engineer, but...

2016-01-08T00:00:00Z

The easiest way to be a 10x engineer is to make 10 other engineers 2x more efficient. Someone can be a 10x engineer if they do nothing for 364 days then convinces the team to change programming language to a 2x more productive language.
A motivated 10x engineer in one team could be a demotivated 0.5x engineer in another team (and vice versa).
A average 1x engineer could easily become a 5x engineer if surrounded by 10x engineers. Engagement and work ethics is contagious.
The cynical reason why 10x engineers aren't paid 10x more salary is that there is no way for the new employer to know. There is no “10x badge”.
…but also, a 10x engineer can go to a new company and become an 1x engineer because of bad focus / bad engagement / tech stack mismatch.
So unfortunately there's less economic rationality for companies to pay 10x salaries to 10x engineers (contrary to what Google or Netflix says)
There's no such thing as a 10x engineer spending time on something that never ends up delivering business value. If something doesn't deliver business value, it's 0x.
If you build something that the average engineer would not have been able to build, no matter how much time, that can make you 100x or 1000x, or ∞x. Quoting Alexander Scott: There is no number of ordinary eight-year-olds who, when organized into a team, will become smart enough to beat a grandmaster in chess.
Most of the 10x factor is most likely explained by team and company factors (process, tech stack, etc) and applies to everyone in the team/company. Intra-team variation is thus much smaller than 10x (even controlling for the fact that companies tend to attract people of equal caliber). Nature vs nurture…
I've never met the legendary “10x jerk”. Anecdotally the outperforming engineers are generally nice and humble.
Don't get hung up on the exact numbers here, it's just for illustration purposes. I.e. someone introduced a bug in the trading system of Knight Capital that made them lose $465M in 30 minutes. Did that make it a -1,000,000x engineer? (and btw it had more to do with company culture). The numbers aren't meant to be taken literally.

I got a unique photo opportunity of this small group of 10x engineers until they suddenly vanished. All I managed to hear was “Merkle trees” and “Kappa architecture”. What are the meanings of those expressions? We will never know.

Books I read in 2015

2016-01-01T00:00:00Z

Early last year when I left Spotify I decided to do more reading. I was planning to read at least one book per week and in particular I wanted to brush up on management, economics, and technology. 2015 was also a year of exclusively non-fiction, which is a pretty drastic shift, since I grew up reading fiction compulsively for 20 years.

My goal for 2015 failed – I ended up reading about 40 books last year. Here is a small selection of the best ones. Not all of it was published in 2015.

Ben Horowitz – The Hard Thing about Hard Things

Peter Thiel – Zero to One

Both books are highly entertaining. Ben Horowitz has a very narrative style and takes the reader through the ups and down of Opsware. Peter Thiel tries to provoke and goes on with long rants on how to position yourself. He stays true to his character – a nihilist libertarian who knows something about the future mere mortals do not. I like to think of the book as a set of mental models or factors that are all very important but maybe not as crucial as Thiel wants it to be. Ben in contrast stays close to earth and discusses his approach to management. It's more like an HBR case study, except a lot more interesting and fun.

Philip Tetlock and Dan Gardner – Superforecasting

Dan Gardner – Future Babble

I have a soft spot for cognitive biases, predictions, and making things quantitative. Both build heavy on the classic Expert Political Judgement by Philip Tetlock himself. Future Babble is by Dan Gardner, whereas Superforecasting is by the two of them together.

The books are about predictions but apply to any decision making with limited information. The fascinating TL;DR is that most people can become better decision makers by incorporating multiple factors and viewpoints, constantly calibrating their decisions, and stay away from single “big idea” explanations. If I had to pick one of these, read Superforecasting.

Sheryl Sandberg – Lean in

You could read this book from several perspectives, each useful. The intended perspective is if you're a woman and want to make it in business. I'm not a woman but there's at least two other perspectives of the book that both make it a worthwhile read. Most of the book to is really a great guide on how to make it in business in America. Being from another culture (growing up with a humble Scandinavian mindset) I found that it generalized quite well from the original target audience. The other reason to read it as a man is that it describes the double standards and the uphill struggles of women in business. Everyone's biased, it's just a difference whether they deny it or admit it. If you're managing people, I think it's your duty to be aware of your biases and how other people's biases affect women. Either perspective you choose, this book is an easy read, and worth spending a few hours on.

Andy Grove – High Output Management

Stanley McChrystal – Team of Teams

These books aren't very similar but in one way they represent the opposite sides of how to manage efficiently. Stanley McChrystal spends a lot of time discussing the Taylor approach to management and how it breaks down in wars, especially fighting insurgents. Say what you want about the military, but it represents the ultimate management under uncertainty. Interestingly Superforecasting borrows a chapter out of this book, spending a lot of time on Taylor and modern warfare, starting with van Moltke.

I think of management as a series of steps building on each other, ranging from maximum certainty (we know what we want to build and how to get there) to total uncertainty (we don't know what to build and have no idea how to find out). On that ladder, High Output Management is a great introduction to the lower levels, and Team of Teams a series of essays on what breaks down at the higher levels. None of the books is exhaustive, but definitely worth reading if you are a manager or interested in management.

Jessica Livingston – Founders at Work

Jessica Livingston (of YCombinator) interviews a set of founders. It's a bit dated (2007) but still a great set of stories of what built a bunch of companies.

David Ogilvy – Ogilvy on advertising

This books roughly comes off as Peter Thiel talking about advertising. Highly opinionated with lots of concrete advice for how to do advertising well. The book is severely dated – It's from 1983, but really talks more of the Mad Men era of 1960's. I still find it somewhat relevant, but more importantly it's an easy fun read.

More MCMC – Analyzing a small dataset with 1-5 ratings

2015-12-05T00:00:00Z

I've been obsessed with how to iterate quickly based on small scale feedback lately. One awesome website I encountered is Usability Hub which lets you run 5 second tests. Users see your site for 5 seconds and you can ask them free-form questions afterwards. The nice thing is you don't even have to build the site – just upload a static png/jpg and collect data.

We are redesigning our website, so I ran a bunch of experiments where I asked users how trustworthy they think the website looks like, on a scale from 1 to 5. So let's say you do that for several variants. How do you estimate the uncertainty of the average score?

You could compute the mean and the variance and use that to estimate. But let's pause for a second. We know this distribution is not a normal distribution because it's constrained to integers between 1 and 5.

Instead, let's use a multinomial distribution for the distributions of the five possible ratings. Furthermore let's say prior distribution is a Dirichlet distribution. Now let's compute the weighted average using the posterior of the that distribution. Much cooler!

I also discovered PyMC3 and Seaborn which turns out to be two pretty cool tools. Relevant code:

Output:

Beautiful stuff! But how does it compare to the normal approximation? I'm glad you asked! Here are both on the same plot:

You can see that there is a substantial difference. This is caused by two things: (a) our sample is not drawn from a normal distribution (b) the sample size is small.

For large sample sizes, the average of non-normal distributions converges to have a normal distribution (this is the Central limit theorem), but our sample size is very small (only 50 ratings in each set).

Dealing with these small dataset reminds me of the discussion between Karl Pearson and William Sealy Gossett (aka Student). Gossett, working for the Guinness factory in Dublin, developed a lot of modern statistics working with beer samples, in particular with small batch sizes. Talking to Pearson about this, Pearson remarked that Only naughty brewers deal in small samples! The t-test (of Gossett) is a great example of something coming out of necessity of working with small samples sizes. For larger samples, normal approximations work out very well.

Side note: I found a discussion by Andrew Gelman suggesting modeling this as a softmax instead – another option worth trying if you're interested)

There is no magic trick

2015-11-28T00:00:00Z

(Warning: super speculative, feel free to ignore)

As Yogi Berra said, “It's tough to make predictions, especially about the future”. Unfortunately predicting is hard, and unsurprisingly people look for the Magic Trick™ that can resolve all the uncertainty. Whether it's recruiting, investing, system design, finding your soulmate, or anything else, there's always an alleged shortcut.

In the famous book Expert Political Judgment a huge amount of forecasts about the future are tracked over a long time. The conclusion is: people suck at forecasting. The only characteristic that seems somewhat predictive is what the author calls being a hedgehog vs being a fox. Hedgehogs (bad) are people who have one mental model they apply to anything. Foxes (good) apply a huge amount of different model and combine them to arrive at a conclusion. (Confusingly, hedgehogs do not hedge their bets).

This is a quite profound conclusion that goes beyond prediction. In fact I see it in almost any hard decision I have to make.

Let's think about recruiting, for instance. So many people claim to have found the ultimate interview question. It ranges from “how old were you when you started coding?” to “what are your open source contributions” to “please spend ten hours on this take home assignment”.

After probably 500 tech interviews I've realized one thing: there is no trick. Empirically the correlation between who I thought would be good and who actually turned out to be good is very small. The overconfidence effect definitely is a real thing and I've become more skeptical about my abilities. The one thing I've learned is: try to collect as many independent metrics as you can. The other day I actually came across an old paper saying something similar.

The same thing applies to investing. You might follow Peter Thiel's advice and never invest in companies where they wear suits. Or you might have an extremely strong conviction that self-driving cars will take over so you go out and short GM's stock. But remember you're up against professional portfolio managers who stare at their screens for 14 hours per day. Did they miss something you are seeing? No. They know that what you are seeing is a small fraction of their valuation of a company.

What makes it even worse is both investing and recruiting are activities that takes place in a market. You are fighting with n other actors to find mispricings and arbitrage opportunity. Just like buying stocks based on a single model is bad, recruiting based on a single model will give you bad candidates. What happens is basically adverse selection and it will cause you to overpay for underperformance.

See below for a very silly market model where two companies X (blue) and Y (red) bid on employees but X knows something that Y doesn't know. I model that by assuming employees break down into a set of random factors and X know a few more factors than Y know. Click the graph to see the code.

It's better to be further to the left (lower cost) and higher up (more value) for a company.

It's a bit hard to see from the model but what happens is the total total surplus (value – cost) for company X is some positive number and for Y it's approximately zero. This holds true any time company X knows just a bit more about employees than company Y. Warren Buffet once said: “If you've been playing poker for half an hour and you still don't know who the patsy is, you're the patsy.”

Speaking of models, one of the most useful insights from machine learning is how much value you get from combining many models. This has been the central dogma in the machine learning community for a long time, whether it's Kaggle, or the Netflix Prize, or industry applications. All models are wrong, but some are useful – combining a bunch of those models will always outperform.

I think the best meta-model for how to think of any complex systems, whether it's recruiting or investing or anything else, is something like boosting. You start with nothing, then you find the best model that explains what you see. Then you increase the weight of the misclassified examples and fit another model (weak learner, in boosting lingo). And so on. Eventually you have built up a set of simple models that you can combine for a final prediction. (As a side note I think the reason humans can do this so well is they can use priors very efficiently)

Peter Thiel's advice is a set of models that are wrong, but still marginally useful, so why not include them? Does a company have network effect? Sure, marginally helpful. Does a candidate have an MBA? Etc. These all sound like weak learners to me. Using just one of them is pretty bad. Add up 100 of them and you have a pretty good prediction.

Installing TensorFlow on AWS

2015-11-12T00:00:00Z

Curious about Google's newly released TensorFlow? I don't have a beefy GPU machine, so I spent some time getting it to run on EC2. The steps on how to reproduce it are pretty brutal and I wouldn't recommend going through it unless you want to waste five hours of your live.

Instead, I recommend instead just getting the AMI that I built (ami-cf5028a5). Choose g2.2xlarge and you should have a box with TensorFlow running in a minute or two! Note that it's only available in us-east-1 (virginia) so far.

If you haven't used AWS, here's a tutorial on how to set up an instance from an AMI. I usually use spot instances since they are much cheaper, but they have some risk of getting killed unexpectedly (interestingly it seems more rare now, I wonder if it's since the Bitcoin price is so much lower).

There are some known issues with TensorFlow on AWS. In particular I wasn't able to get better performance from g2.8xlarge compared to g2.2xlarge, which sucks, since one of the cool features with TensorFlow is that it should distribute work across GPU's. See this thread for some more info. Looking forward to see these issues getting resolved.

What is TensorFlow?

It seems like there's a lot of misunderstanding about TensorFlow. It's not some crazy flow based graphical tool to do neural nets. It's kind of boring really. It's just a marginally better version of Theano with much faster compilation times and capability to distribute work over multiple GPU's/machines. Theano completely blew my mind when I first discovered it. Its approach was super innovative, but it's pretty rough around the edges and I think in open source the pioneers die with arrows in their backs.

I expect TensorFlow (or maybe CGT or something else) to grow more popular. But in practice I don't think people will use any of those straight up for machine learning – higher level libraries like Keras will be the preferred way to do most deep learning tasks.

Looking for smart people

2015-11-04T00:00:00Z

I haven't mentioned what I'm currently up to. Earlier this year I left Spotify to join a small startup called Better. We're going after one of the biggest industries in the world that also turns out to be completely broken. The mortgage industry might not be the #1 industry you pictured yourself in, but it's an enormous opportunity to fix a series of real consumer problems and join a company that I predict will be huge.

We're 6 engineers at the moment, mostly focused on backend stuff, but a bit of frontend and machine learning stuff as well. We have also raised a pretty substantial amount of money. At this point we're just a few weeks from launching, so I will definitely keep you posted. If you are interested in hearing more, drop me an email at erik@better.com

MCMC for marketing data

2015-10-31T00:00:00Z

The other day I was looking at marketing spend broken down by channel and wanted to compute some simple uncertainty estimates. I have data like this:

<th>
  Total spend
</th>

<th>
  Transactions
</th>

<td>
  2292.04
</td>

<td>
  9
</td>

<td>
  1276.85
</td>

<td>
  2
</td>

<td>
  139.59
</td>

<td>
  3
</td>

<td>
  954.98
</td>

<td>
  5
</td>


Channel A
Channel B
Channel C
Channel D

Of course, it's easy to compute the cost per transaction, but how do you produce uncertainty estimates? Turns out to be somewhat nontrivial. I don't even think it's possible to do a t-test, which is kind of interesting in itself.

Let's make some assumptions about the model:

The cost per transaction is an unknown with some prior (I just picked uniform)
The expected number of transaction is the total budget divided by the (unknown) cost per transaction
The actual observed number of transactions is a Poisson of the expected number of transactions

I always wanted to try using pymc and now I had an excuse. See gist below:

The result in the form of an animated GIF (Unfortunately animated gifs were never widely accepted as a homework format back in school)

You even get a useless graph for free!

Of course, we could have computed this exactly, but I know myself and I'm very unlikely to get the expressions right without some serious effort. The conjugate prior of a Poisson is a Gamma distribution and we have to account for the parameterization of the cost per conversion as the budget divided by the total conversions, which will be another factor. How fun is that? I don't have access to any windows to write on, so unfortunately not so fun.

From A Beautiful Mind

Anyway – this particular example might not have been the most useful example of using PyMC, but I do quite like the idea of it. Especially applied to conversion analyses, since it translates directly into a generative model. I will definitely use it for some further funnel analysis – in particular when the number of data points is very small and the model is very complex.

Interview with a Data Scientist: Erik Bernhardsson

2015-10-28T00:00:00Z

I was featured in Peadar Coyle's interview series interviewing various “data scientists” – which is kind of arguable since (a) all the other ppl in that series are much cooler than me (b) I'm not really a data scientist. Anyway, reposting the full interview:

As part of my interviews with Data Scientists I recently caught up with Erik Bernhardsson who is famous in the world of ‘Big Data’ for his open source contributions, his leading of teams at Spotify, and his various talks at various conferences.

1. What project have you worked on do you wish you could go back to, and do better?

Like… everything I ever built. But I think that’s part of the learning experience. Especially working with real users, you never know what’s going to happen. There’s no clear problem formulation, no clear loss function, lots of various data sets to use. Of course you’re going to waste too much time doing something that turns out to nothing. But research is that way. Learning stuff is what matters and kind of by definition you have to do stupid shit before you learned it. Sorry for a super unclear answer :)

The main thing I did wrong for many years was I built all this cool stuff but never really made it into prototypes that other people could play around with. So I learned something very useful about communication and promoting your ideas.

2. What advice do you have to younger analytics professionals and in particular PhD students in the Sciences?

Write a ton of code. Don’t watch TV :)

I really think showcasing cool stuff on Github and helping out other projects is a great way to learn and also to demonstrate market validation of your code.

Seriously, I think everyone can kick ass at almost anything as long as you spend a ridiculous amount of time on it. As long as you’re motivated by something, use that by focusing on something 80% of your time being awake.

I think people generally get motivated by coming up with various proxies for success. So be very careful about choosing the right proxies. I think people in academia often validate themselves in terms of things people in the industry don’t care about and things that doesn’t necessarily correlate with a successful career. It’s easy to fall down into a rabbit hole and become extremely good at say deep learning (or anything), but at a company that means you’re just some expert that will have a hard time getting impact beyond your field. Looking back on my own situation I should have spent a lot more time figuring out how to get other people excited about my ideas instead of perfecting ML algorithms (maybe similar to last question)

3. What do you wish you knew earlier about being a data scientist?

I don’t consider myself a data scientist so not sure :)

There’s a lot of definitions floating around about what a data scientist does. I have had this theory for a long time but just ran into a blog post the other day. I think it summarizes my own impression pretty well. There’s two camps, one is the “business insights” side, one is the “production ML engineer” side. I managed teams at Spotify on both sides. It’s very different.

If you want to understand the business and generate actionable insights, then in my experience you need pretty much no knowledge of statistics and machine learning. It seems like people think with ML you can generate these super interesting insights about a business but in my experience it’s very rare. Sometimes we had people coming in writing a master’s thesis about churn prediction and you can get a really high AUC but it’s almost impossible to use that model for anything. So it really just boils down to doing lots of highly informed A/B tests. And above all, having deep empathy for user behavior. What I mean is you really need to understand how your users think in order to generate hypotheses to test.

For the other camp, in my experience understanding backend development is super important. I’ve seen companies where there’s a “ML research team” and a “implementation team” and there’s a “throw it over the fence” attitude, but it doesn’t work. Iteration cycles get 100x larger and incentives just get misaligned. So I think for anyone who wants to build cool ML algos, they should also learn backend and data engineering.

4. How do you respond when you hear the phrase ‘big data’?

Love it. Seriously, there’s this weird anti-trend of people bashing big data. I throw up every time I see another tweet like “You can get a machine with 1TB of ram for $xyz. You don’t have big data”. I almost definitely had big data at Spotify. We trained models with 10B parameters on 10TB data sets all the time. There is a lot of those problems in the industry for sure. Unfortunately sampling doesn’t always work.

The other thing I think those people get wrong is the production aspect of it. Things like Hadoop forces your computation into fungible units that means you don’t have to worry about computers breaking down. It might be 10x slower than if you had specialized hardware, but that’s fine because you can have 100 teams running 10000 daily jobs and things rarely crash – especially if you use Luigi :) But I’m sure there’s a fair amount of snake oil Hadoop consultants who convince innocent teams they need it.

The other part of “big data” is that it’s at the far right of the hype cycle. Have you been to a Hadoop conference? It’s full of people in oversized suits talking about compliance now. At some point we’ll see deep learning or flux architecture or whatever going down the same route.

5. What is the most exciting thing about your field?

Boring answer but I do think the progress in deep learning has been extremely exciting. Seems like every week there’s new cool applications.

I think even more useful is how tools and platforms are maturing. A few years ago every company wrote their own dashboards, A/B test infrastructure, log synchronization, workflow management, etc. It’s great that there’s more open source projects and that more useful tools are emerging.

6. How do you go about framing a data problem – in particular, how do you avoid spending too long, how do you manage expectations etc. How do you know what is good enough?

Ideally you can iterate on it with real users and see what the impact is. If not, you need to introduce some proxy metrics. That’s a whole art form in itself.

It’s good enough when the opportunity cost outweighs the benefit :) I.e. the marginal return of time invested is lower than for something else. I think it’s good to keep a backlog full of 100s of ideas so that you can prioritize based on expected ROI at any time. I don’t know if that’s a helpful answer but prioritization is probably the hardest problem to solve and it really just boils down to having some rules of thumb.

How Erik describes himself: I like to work with smart people and deliver great software. After 5+ years at Spotify, I just left for new exciting startup in NYC where I am leading the engineering team.

At Spotify, I built up and lead the team responsible for music recommendations and machine learning. We designed and built many large scale machine learning algorithms we use to power the recommendation features: the radio feature, the “Discover” page, “Related Artists”, and much more. I also authored Luigi, which is a workflow manager in Python with 3,000+ stars on Github – used by Foursquare, Quora, Stripe, Asana, etc.

When I was younger I participated in lots of programming competitions. My team was five times Nordic champions in programming (2003-2010) and I have an IOI gold medal (2003).

Nearest neighbors and vector models – epilogue – curse of dimensionality

2015-10-20T00:00:00Z

This is another post based on my talk at NYC Machine Learning. The previous two parts covered most of the interesting parts, but there are still some topics left to be discussed. To go back and read the meaty stuff, check out

You should also check out the slides and the video if you're interested. Anyway, let's talk about the curse of dimensionality today.

This pic was obviously worth spending 20 minutes on

Curse of dimensionality refers to a set of things that happen when you are dealing with items in high dimensional spaces, in particular what happens with distances and neighborhoods, in such a way that finding the nearest neighbors gets tricky.

Consider a map of the world. Most countries have a handful of neighboring countries. It is also pretty close from New York to Philadelphia but it's far from New York to Beijing – distances are very different.

What happens when we go to higher dimensions is that everything starts being close to everything. All cities end up having almost the same distance to each other and all countries have borders to all other countries (Trump would have a lot of walls to build). This is highly nonintuitive (as is anything with more than 3 dimensions) but let's try to quantify this.

Let's look at how distances behaves as we go to higher dimensions. Let's sample a 10,000 points from a normal distribution, then pick a random point in the distribution, and compute the distance to the furthest and closest point:

(For code, check out knn_avg_dist.py on Github)

As the number of dimensions increase, the distances to the closest and the furthest point are almost similar. Wikipedia's article is actually quite enlightening and features this statement that as we go to higher dimensions, we have the relationship:

$$ \huge \lim_{d \to \infty} E\left(\frac{\text{dist}_{\max} (d) - \text{dist}_{\min} (d)}{\text{dist}_{\min} (d)}\right) \to 0 $$

For example if we are in a high dimensional version of New York, then the nearest city is 1.000 miles away and the furthest city is 1.001 miles away. The ratio above is then 0.001. Let's get back to this ratio shortly.

This weird behavior makes nearest neighbors in high dimensional spaces tricky. It's still an open problem whether exact k-NN is solvable in polynomial time.

Saving the day

The above relation applies to distribution where there is little structure – in particular the example I generated was just data points from the normal distribution. In the real world data sets we usually have a lot of structure in our data. Consider all the cities and towns of the world. This is a set of points in in 3D space, but all cities lie on a 2D sphere and so the point set will actually behave more 2D-like.

The same thing happens to word embeddings or any other set of vectors. Even if we're dealing with 1000's of dimensions, there underlying structure is really much more lower dimensional. This is exactly why dimensionality reduction works so well!

Let's actually compute the quantity mentioned above and map out some real datasets:

(Again for the code, check out knn_avg_dist.py on Github)

The graph above shows something pretty interesting: some of these “real” high dimensional data sets have a ratio that is similar to much fewer dimensions for the normal distribution.

Look at the Freebase vectors (which you can download here) for instance – they are 1000D, but they are similar to a normal distribution in 16D in terms of the ratio. This behavior holds true for a whole set of different vectors. The 784-dimensional MNIST digits data set behaves as the 8D normal distribution. The 128-dimensional embedding of food pictures same thing.

I think this is why approximate nearest neighbor methods work so well up to thousands of dimensions. The key thing is the algorithms need to learn the distribution from the data. I am generally bearish on LSH for this reason.

Enough about approximate nearest neighbors for a while! Hope you liked this series of posts!

Nearest neighbors and vector models – part 2 – algorithms and data structures

2015-10-01T00:00:00Z

This is a blog post rewritten from a presentation at NYC Machine Learning on Sep 17. It covers a library called Annoy that I have built that helps you do nearest neighbor queries in high dimensional spaces. In the first part, I went through some examples of why vector models are useful. In the second part I will be explaining the data structures and algorithms that Annoy uses to do approximate nearest neighbor queries.

Let's start by going back to our point set. The goal is to find nearest neighbors in this space. Again, I am showing a 2 dimensional point set because computer screens are 2D, but in reality most vector models have much higher dimensionality.

Our goal is to build a data structure that lets us find the nearest points to any query point in sublinear time.

We are going to build a tree that lets us do queries in $$ \mathcal{O}(\log n) $$ . This is how Annoy works. In fact, it's a binary tree where each node is a random split. Let's start by splitting the space once:

Annoy does this by picking two points randomly and then splitting by the hyperplane equidistant from those two points. The two points are indicated by the gray line and the hyperplane is the thick black line.

Let's keep splitting each subspace recursively!

A very tiny binary tree is starting to take shape:

We keep splitting again:

… and so on. We keep doing this until there's at most K items left in each node. At that point it looks something like this (for K=10):

With the corresponding binary tree:

Nice! We end up with a binary tree that partitions the space. The nice thing is that points that are close to each other in the space are more likely to be close to each other in the tree. In other words, if two points are close to each other in the space, it's unlikely that any hyperplane will cut them apart.

To search for any point in this space, we can traverse the binary tree from the root. Every intermediate node (the small squares in the tree above) defines a hyperplane, so we can figure out what side of the hyperplane we need to go on and that defines if we go down to the left or right child node. Searching for a point can be done in logarithmic time since that is the height of the tree.

Let's search for the point denoted by the red X in the plot below:

The path down the binary tree looks like this:

We end up with 7 nearest neighbors. Very cool, but this is not great for at least two reasons

What if we want more than 7 neighbors?
Some of the nearest neighbors are actually outside of this leaf polygon

Trick 1 – use a priority queue

The trick we're going to use is to go down on both sides of a split if we are “close enough” (which I will quantify in a second). So instead of just going down one path of the binary tree, we will go down a few more:

With the corresponding binary tree:

We can configure the threshold of how far we are willing to go into the “wrong” side of the split. If the threshold is 0, then we will always go on the “correct” side of the split. However if we set the threshold to 0.5 you get the search path above.

The trick here is we can actually use a priority queue to explore nodes sorted by the max distance into the “wrong” side. The nice part is we can search increasingly larger and larger thresholds starting from 0.

Trick 2 – build a forest of trees

The second trick we are going to use is is to construct many trees aka a forest. Each tree is constructed by using a random set of splits. We are going to search down all those trees at the same time:

We can search all trees at the same time using one single priority queue. This has an additional benefit that the search will focus on the trees that are “best” for each query – the splits that are the furthest away from the query point.

Every tree contains all points so when we search many trees we will find some points in multiple trees. If we look at the union of the leaf nodes we get a pretty good neighborhood:

At this point we have nailed it down to a small set of points. Notice so far we have not even computed distances to a single point. Next step is to compute all distances and rank the points:

We then sort all nodes by distance and return the top K nearest neighbors. Nice! And that is how the search algorithm works in Annoy.

Except one thing. In this case it turns out we actually did miss a couple of points outside:

But the A in Annoy stands for approximate and missing a few points is acceptable. Annoy actually has a knob you can tweak (search_k) that lets you trade off performance (time) for accuracy (quality).

The whole idea behind approximate algorithms is that sacrificing a little bit of accuracy can give you enormous performance gains (orders of magnitude). For instance we could return a decent solution where we really only computed the distance for 1% of the points – this is a 100x improvement over exhaustive search.

More trees always help. By adding more trees, you give Annoy more chances to find favorable splits. You generally want to bump it up as high as you can go without running out of memory.

Summary: Annoy's algorithm

Preprocessing time:

Build up a bunch of binary trees. For each tree, split all points recursively by random hyperplanes.

Query time:

Insert the root of each tree into the priority queue
Until we have _search_k _candidates, search all the trees using the priority queue
Remove duplicate candidates
Compute distances to candidates
Sort candidates by distance
Return the top ones

Feel free to check out _make_tree and _get_all_nns in annoylib.h

That's it for this post! More is coming from the presentation shorly. Btw, the take a look at the slides, and the check out the code to generate all graphs in this post.

Nearest neighbor methods and vector models – part 1

2015-09-24T00:00:00Z

This is a blog post rewritten from a presentation at NYC Machine Learning last week. It covers a library called Annoy that I have built that helps you do (approximate) nearest neighbor queries in high dimensional spaces. I will be splitting it into several parts. This first talks about vector models, how to measure similarity, and why nearest neighbor queries are useful.

Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.

This turns out to be quite useful for a bunch of different applications. Before we get started on exactly how nearest neighbor methods work, let's talk a bit about vector models.

Vector models and why nearest neighbors are useful

Vector models are increasingly popular in various applications. They have been used in natural language processing for a long time using things like LDA and PLSA (and even earlier using TF-IDF in raw space). Recently there has been a new generation of models: word2vec, RNN's, etc.

In collaborative filtering vector models have been among the most popular methods since going back to the Netflix Prize – the winning entry featured a huge ensemble where vector models made up a huge part.

The basic idea is to represent objects in a space where proximity means two items are similar. If we're using something like word2vec it could look something like this:

In this case similarity between words is determined by the angle between them. apple and banana are close to each other, whereas boat is further.

(As a side note: much has been written about word2vec's ability to do word analogies in vector space. This is a powerful demonstration of the structure of these vector spaces, but the idea of using vector spaces is old and similarity is arguably much more useful).

In the most basic form, data is already represented as vectors. For an example of this, let's look at one of the most canonical data sets in machine learning – the MNIST handwritten digits dataset.

Building an image search engine for handwritten digits

The MNIST dataset features 60,000 images of size 28×28. They each feature a handwritten digits in grayscale. One of the most basic ways we can play around with this data set is to smash each 28×28 array into a 784-dimensional vector. There is absolutely no machine learning involved in doing this, but we will get back and introduce cool stuff like neural networks and word2vec later.

Let's define a distance function in this space. Let's say the distance between two digits is the squared sum of the pixel differences. This is basically the squared Euclidean distance (i.e. the good old Pythagorean theorem):

This is nice because we can compute the distance of arbitrary digits in the dataset:

This now lets us search for neighbors in this 784-dimensional space. Check out some samples below – the leftmost digit is the seed digit and to the right of it are the ten most similar images using the pixel distance.

You can see that it sort of works. The digits are visually quite similar, although it's obvious to a human that some of the nearest neighbors are the wrong digit.

This was pretty nice and easy, but this also an approach that doesn't scale very well. What about larger images? What about color images? And how to we determine similars not just in terms of visual similarity but actually what a human would think of as similar. This simple definition of “distance” leaves a lot of room for improvement.

Dimensionality reduction

A powerful method that works across a wide range of domains is to take high dimensional complex items and project the items down to a compact vector representation.:

Do a dimensionality reduction from a large dimensional space to a small dimensional space (10-1000 dimensions)
Use similarity in this space instead

Dimensionality reduction is an extremely powerful technique because it lets us take almost any object and translate it to a small convenient vector representation in a space. This space is generally referred to as latent because we don't necessarily have any prior notion of what the axes are. What we care about is that objects that are similar end up being close to each other. What do we mean with similarity? In a lot of cases we can actually discover that from our data.

So let's talk about one approach for dimensionality reduction on images: deep convolutional neural networks. I had a side project about a year ago to classify food. It's a pretty silly application but the eventual goal was to see if you could predict calorie content from pictures, and a side goal was to learn how to use convolutional neural networks. I never ended up using this for anything and wasted way to much money renting GPU instances on AWS, but it was fun.

To train the model, I downloaded 6M pics from Yelp and Foursquare and trained a network quite similar to the one described in this paper using Theano.

](/assets/2015/09/foodnet.png)

The final layer in this model is a 1244-way multi-classification output using softmax so we're training this in a supervised way. These are words that occurred in the description text, eg. “spicy ramen” for the one above. However the nice thing is we have a “bottleneck” layer just before the final layer – a 128-dimensional vector that gives us exactly what we want.

Using the neural network as an embedding function and using cosine similarity as a metric (this is basically Euclidean distance, but normalize the vectors first) we get some quite cool nearest neighbors:

These similars look pretty reasonable! The top left picture is similar to a bunch of other fries. The second row shows a bunch of different white bowls with Asian food – more impressively they are all in different scales and angles, and pixel by pixel similarity is quite low. The last row shows a bunch of desserts with similar patterns of chocolate sprinkled over it. We're dealing with a space that can express object features quite well.

So how do we do find similar items? I'm not going to describe dimensionality reduction in great detail – there are a million different ways that you can read about. What I have spent more time thinking about is how to search for neighbors in vector spaces. In fact, finding the neighbors above takes only a few milliseconds per picture, because Annoy is very fast. This is why dimensionality reduction is so extremely useful. At the same time that it's discovering high level structure in data, it also computes a compact representation of items. This representation makes it easy to compute similarity and search for nearest neighbors.

Vector methods in collaborative filtering

Reducing dimensionality isn't just useful in computer vision, of course. As mentioned, it's incredibly useful in natural language processing. At Spotify, we use vector models extensively for collaborative filtering. The idea is to project artists, users, tracks, and other objects into a low dimensional space where similarity can be computed easily and recommendations can be made. This is in fact what powers almost all of the Spotify recommendations – in particular Discover Weekly that was launched recently.

I have already put together several presentations about this so if you're interested, you should check out some of them:

Exhaustive search as a baseline

So how do we find similar items? Before we go into detail about how Annoy works, it's worth looking at the baseline of doing a brute force exhaustive search. This means iterating over all possible items and computing the distance for each one of them to our query point.

word2vec actually comes with a tool to do exhaustive search. Let's see how it compares! Using the GoogleNews-vectors-negative300.bin dataset and querying for “chinese river”, it takes about 2 minutes 34 seconds to output this:

Qiantang_River
Yangtse
Yangtze_River
lake
rivers
creek
Mekong_river
Xiangjiang_River
Beas_river
Minjiang_River

I wrote a similar tool that uses Annoy (available on Github here). The first time you run it, it will precompute a bunch of stuff and can take a lot of time to run. However the second time it runs it will load (mmap) an Annoy index directly from disk into memory. Relying on the magic page cache, this will be very fast. Let's take it for a spin and search for “chinese river”:

Yangtse
Yangtze_River
rivers
creek
Mekong_river
Huangpu_River
Ganges
Thu_Bon
Yangtze
Yangtze_river

Amazingly, this ran in 470 milliseconds, probably some of it overhead for loading the Python interpreter etc. This is roughly 300x faster than the exhaustive search provided by word2vec.

Now – some of you probably noticed that the results are marginally different. That's because the A in Annoy stands for approximate. We are deliberately trading off some accuracy in return for a huge speed improvement. It turns out you can actually control this knob explicitly. Telling Annoy we want to search through 100k nodes (will get back to that later) we get this result in about 2 seconds:

Qiantang_River
Yangtse
Yangtze_River
lake
rivers
creek
Mekong_river
Xiangjiang_River
Beas_river
Minjiang_River

This is exactly the same as the exhaustive search it turns out – and still about 50x faster.

Other uses of nearest neighbors

Finally just as a fun example of another use, nearest neighbors is useful when you're dealing with physical spaces too. In an earlier blog post, I was showing this world map of how long it takes to ping IP addresses from my apartment in NYC:

This is a simple application of k-NN (k-nearest neighbors) regression that I've written earlier about on this blog. There is no dimensionality reduction involved here – we just deal with 3D coordinates (lat/long projected to the unit sphere).

In the next series, I will go in depth about how Annoy works. Stay tuned!

Presentations about Spotify music recommendations

2015-09-22T00:00:00Z

A couple of people in my old team have been around talking about how Spotify does music recommendations and put together some quite good presentations.

First one is Neville Li's presentation about Scala Data Pipelines @ Spotify:

The second one is Chris Johnson's presentation from RecSys 2015 about Interactive Recommender Systems:

Antipodes

2015-09-08T00:00:00Z

I was playing around with D3 last night and built a silly visualization of antipodes and how our intuitive understanding of the world sometimes doesn't make sense. Check out the visualization at bl.ocks.org!

Basically the idea is if you fly from Beijing to Buenos Aires then you can have a layover at any point of the Earth's surface and it won't make the trip longer.

Software Engineers and Automation

2015-08-16T00:00:00Z

Every once in a while when talking to smart people the topic of automation comes up. Technology has made lots of occupations redundant, so what's next?

Switchboard operator, a long time ago

What about software engineers? Every year technology replaces parts of what they do. Eventually surely everything must be replaced? I just ran into another one of these arguments: Software Engineers will be obsolete by 2060.

This might be a Lump of Labor Fallacy. Think about how much around us is currently powered by software and how much could be powered by software. The opportunity to apply software is probably 100x larger than what's currently being used. So why aren't we using software 100x more? Because software engineers are expensive.

It's easy to see this if you look back ten years. Say you wanted to build a web shop ten years ago. This was before the cloud, before API's, good web frameworks etc. Building a web shop was probably 100x more expensive back then. As a result – there were a lot fewer web shops available. Of course, it's harder to know what latent demand will be unlocked in the next ten years, but there's always new things coming out that you didn't realize you needed.

Somewhat counterintuitively, for many goods the latent demand is so big that what happens when the price drops is that the total demand increases. This is called Jevons Paradox after an economist noticed in the 1800s that increased efficiency of coal use lead to an increase in consumption of coal.

Vin Diesel as a stock broker in the movie “Boiler Room”

The key here is whether technology replaces a job or whether it increases the efficiency of a job. Technology did not increase the output of switchboard operators, so they were replaced. Similarly, technology is not going to make truck drivers 100x as efficient, so they will be replaced by self driving trucks at some point. But technology actually has the opportunity to increase the output of software engineers by another few orders of magnitude. This will unlock a lot of latent demand, and we will need more software engineers, not less.

The other key is of course whether demand is bounded. So if you want to identify which occupations will be automated, I would look for (a) limited latent demand (b) little technical leverage.

Is this rationalization? Maybe!

Also for a good quick read, check out Race Against the Machine by Erik Brynjolfsson and Andrew McAfee.

coin2dice

2015-07-24T00:00:00Z

Here's a problem that I used to give to candidates. I stopped using it seriously a long time ago since I don't believe in puzzles, but I think it's kind of fun.

Let's say you have a function that simulates a random coin flip. It returns “H” or “T”. This is the only random generator available. How can write a new function that simulates a random dice roll (1…6)?
Is there any method that guarantees that the second function returns in finite time?
Let's say you want to do this $$ n $$ times where $$ n \to \infty $$ . What's the most efficient way to do it? Efficient in terms of using the fewest amount of coin flips.

The first part is old, I think. The second and third part are follow up questions that I came up with.

I'll give you some time to think about it!

Don't peak!

Did you figure it out?

Solutions

There's a multitude of ways to do this. The easiest way to do it is probably to flip the coin three times and map the outcomes like this: (HHH, 0), (HHT, 1), (HTH, 2), (HTT, 3), (THH, 4), (THT, 5), (TTH, 6), (TTT, 7). If you end up getting HHH or TTT, you flip again.
Impossible! If you flip the coin $$ n $$ times then there's exactly $$ 2^n $$ outcomes. But we can't partition this space evenly into 6 buckets since $$ 3\nmid 2^n $$
This one is trickier. Think about it in terms of the amount of information you extract. Every coin flip extracts 1 bit, but every dice roll consumes $$ \log_2 6\approx 2.585 $$ bits. This is a lower bound – you need at least that many coin flips per dice roll.

Is there a way to achieve that lower bound? Turns out there is: basically encode a long series of coin flip as a binary representation of a number between 0 and 1, then convert it to base 6. This idea resembles Arithmetic coding. Sample code:

{% gist erikbern/22c48c622dd9e160419c %}

Hope you enjoyed geeking out with a math problem this time!

Benchmark of Approximate Nearest Neighbor libraries

2015-07-04T00:00:00Z

Annoy is a library written by me that supports fast approximate nearest neighbor queries. Say you have a high (1-1000) dimensional space with points in it, and you want to find the nearest neighbors to some point. Annoy gives you a way to do this very quickly. It could be points on a map, but also word vectors in a latent semantic representation or latent item vectors in collaborative filtering.

I've made a few optimizations to Annoy lately and I was curious to see how it stacks up against other libraries out there, so I wrote a benchmark suite: ann-benchmarks. It supports any library with a Python interface and I added a bunch of them. It even has Travis integration!

Cosine 100D results

Results for 128D Euclidean

The results so far for Annoy are pretty great. The only method that consistently beats Annoy is SW-graph from nmslib which is about 2-3x faster at the same precision. But Annoy beats both FLANN and KGraph at high precisions (>95%). At lower precisions (<95%) and cosine distance, Annoy is not quite as fast as FLANN and KGraph.

A surprising result was that Panns, Nearpy, and LSHForest all are very low performance. They are roughly 1,000 times slower than the other ones, and even worse, Panns & LSHForest don't even produce high precision scores. I created an issue in scikit-learn about LSHForest's performance. Of course, it might be that I did something wrong in the benchmark.

Annoy was actually never built with performance in mind. The killer feature was always to be able to load/unload indexes quickly using mmap – which no other package supports – but it's fun to see that it's actually very competitive on performance too. One of the things that made a difference was that I recently changed the interface for Annoy slightly (don't worry, it's backwards compatible). There is now a query-time tradeoff knob which lets you vary how many nodes are inspected.

The other factor not covered by the graph above is how hard it is to install most of these libraries. Annoy should compile and run almost anywhere (Linux, Win, OS X) very easily, but the other libraries can be challenging to install. For instance, both kgraph and nmslib depend on GCC-4.8, so they require custom installations. There are scripts to install all libraries in the repo tested with Ubuntu 12.04 and 14.04. For other platforms – good luck!

I might do a talk in September at the NYC Machine Learning meetup about this, we'll see! Until then, I found this really great presentation by Stefan Savev (with corresponding web site). He claims his own implementation is a bit faster than Annoy! It's in Scala so I haven't tested it yet.

Slides from Stefan Savev's presentation

More Luigi alternatives

2015-07-02T00:00:00Z

The workflow engine battle has intensified with some more interesting entries lately! Here are a couple I encountered in the last few days. I love that at least two of them are direct references to Luigi!

Airflow (Blog Post) (GitHub)

Airflow from Airbnb is probably the most interesting one. I've only glanced at it, but here are some very superficial notes

Seems mostly targeted to run raw UNIX commands using a pretty simple syntax with Jinja templates
Tasks don't have support for parameters but it seems like you can build tasks dynamically by just putting them in a function
It seems to be built around daily jobs, meaning dates and backfill by dates is a foundational concept of the package (whereas in Luigi dates are just one out of many different parameters)
There is a database of task history which is great
The visualization seems great
It also supports farming out jobs to other machines. This is something Luigi definitely needs
It also comes with built-in support for HDFS, S3, MySQL and Postgres, similar to Luigi
There's a built-in triggering mechanism (also something Luigi needs)

Screen shot of Airflow (from the blog post)

Mario (Blog post) (GitHub)

Mario seems to be a “Luigi in Scala”. It seems extremely simplistic, so it's probably more conceptual at this point than meant to be full-fledged.

The choice of Scala is interesting. First of all, I think it's fair to say that the JVM has taken over the data stack. A lot of Luigi's core concept are really functional in nature and a language like Scala might be a better choice. A cool thing is that the RPC between Luigi clients and the scheduler is actually just a simple REST interface and Luigi's scheduler could in fact support clients running other languages. Mario doesn't do this but it's something I've been meaning to explore for a long time.

Ruigi (GitHub)

Ruigi is “Luigi in R”. It follows the same set of conventions but seems to be pretty simple in that it runs everything locally.

Makeflow (Web site)

Seems to be some academic project mostly for publishing papers. What's up with not using GitHub in 2015? And also having a “download” section with tarballs!

The benefit of Makeflow seems to be support for a bunch of batch systems commonly used in HPC in academia. The dependencies are specified using their own DSL with some simple Makefile-like notation.

Conclusions

So what's the state of workflow engines at the moment? Allow me to say something provocative just for the sake of it: they all kind of suck. Including Luigi. There is basically so much trial and error when building a workflow engine and everytime I encounter one I just see a bunch of bad design decisions. Same when I look at Luigi. Luigi was the result of a bunch of many iterations and it avoids roughly 100 pitfalls we encountered in earlier attempts, but there are still some parts of the design that can be addressed.

I don't mean to bash my own open source project here. Open sourcing Luigi has helped a lot of people building awesome stuff and I think it's better than anything else out there.

I hope with the explosion of workflow engines lately, we will see a convergence of ideas to a second generation of much better, much more scalable, and much easier ones. My dream is that someone combines every god damned workflow engine in the world and write a new one, preferably in some JVM based language like Scala. I sincerely have no idea how that would look like, but I would love to do that some day as a Luigi 2.0!

Me and one of my cats (who is not too happy). Long story!

3D in D3

2015-06-21T00:00:00Z

I have spent some time lately with D3. It's a lot of fun to build interactive graphs. See for instance this demo (will provide a longer writeup soon).

D3 doesn't have support for 3D but you can do projections into 2D pretty easily. It's just old school computer graphics. I ended up adding an animated background to this blog based on an experiment. The math is simple.

First, there's the rotation. Given a bunch of 3D coordinates, how do you rotate them in 3D space? The cleanest way is to define angles $$ \alpha, \beta, \gamma $$ and use them for rotation in the yz-plane, xz-plane, and xy-plane, respectively. Each of them define a rotation matrix. For the xy-plane, we get the rotation matrix $$ R_{xy} $$ (see code):

$$ R_{xy} = \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \ \sin(\gamma) & \cos(\gamma) & 0 \ 0 & 0 & 1 \end{pmatrix} $$

We get three of these matrices in total: $$ R_{yz}, R_{xz}, R_{xy} $$ .

The rotation of any vector $$ \mathbf{v} $$ can now be described as $$ R_{yz} R_{xz} R_{xy}\mathbf{v} $$ . The nice thing is we can _precompute_ the product of these matrices $$ R $$ (see code). Math porn:

$$ R = \begin{pmatrix} 1 & 0 & 0 \ 0 & \cos(\alpha) & -\sin(\alpha) \ 0 & \sin(\alpha) & \cos(\alpha) \end{pmatrix} \begin{pmatrix} \cos(\beta) & 0 & \sin(\beta) \ 0 & 1 & 0 \ -\sin(\beta) & 0 & \cos(\beta) \end{pmatrix} \begin{pmatrix} \cos(\gamma) & -\sin(\gamma) & 0 \ \sin(\gamma) & \cos(\gamma) & 0 \ 0 & 0 & 1 \end{pmatrix} $$

Now going forward we can use the matrix $$ R $$ to rotate any vector $$ \mathbf{v} $$ (see code).

The other thing you want to do is to make distant objects look further away. Thinking through proportionalities you can derive a pretty simple equation (see code). $$ x� = x / (z/d + 1) $$ , and same for $$ y $$ . The constant $$ d $$ is just a hack to scale down the $$ z $$ values.

Not sure whether the 3D animation is cool or just annoying, but I'll prob keep it for a bit – enjoy!

The hardest challenge about becoming a manager

2015-06-05T00:00:00Z

Note: this post is full of pseudo-psychology and highly speculative content. Like most fun stuff!

I became a manager back in 2009. Being a developer is fun. You have this very tangible way to measure yourself. Did I deploy something today? How much code did I write today? Did I solve some really cool machine learning problem on paper?

But as 1:1's and emails and architecture discussions started filling up my day I often walked home with this gnawing feeling of having accomplished nothing. I saw my team build and deploy some really cool stuff, but I had this sort of guilt as if I was pretty useless.

To feel better about myself, I started coding more. But I noticed when I started coding it was like smoking crack. I couldn't stop doing it. I would come in at 9am thinking about some fun problem, then get completely sucked into it. I would find myself at 9pm drinking Red Bull deep in some highly optimized C++ latent factor model. That felt great until I realized I had missed a bunch of 1:1's and had 43 unread emails in my inbox.

I think what happens in your brain is you create all these proxies for accomplishments that take years to retrain. I would be incredibly costly if every action was judged in terms of its benefits amortized over your entire life time. Instead, we humans have an ability to invent pretty arbitrary proxies, such as getting high scores on exams, or in my case, write shitloads of code.

Proxies are great because they let you make decisions much quicker. If I have decided that it's good for me to write code, then I will start doing it, and eventually feel great about it. After a few years of doing something very consciously (programming is good because I can build this cool game and show it to my friends) you build up this great rewards system (kind of like Pavlov's dogs?) that makes you feel good about it in itself (programming is cool because I feel good when I do it).

The problem is when your ultimate goal changes and your old proxies are still in effect. You rational side might tell you: hey, look at you, you're team is really happy, they are learning new stuff, and delivering lots of stuff. But still, you have this really weird feeling that you are not getting anything useful done.

This took me literally years to retrain. I remember at some point I saw someone in my team that did something unexpectedly impressive and I got really excited. I got excited because I realized this person had grown tremendously since joining, and presumably some small fraction of it was due to me. With enough exposure finally this starts to be the new proxy for delivering value. Something the irrational side immediately detects and makes you feel a sense of accomplishment about.

Anyway… once an addict, always an addict. I still have relapses and fall back into programming sometimes. In general I've noticed it's extremely hard to balance and try to do 50-50 development and management. Basically one of the sides take over until it's 90-10. Either you start coding and you fall back to your old crack habits. Or you manage to break out of it, and you go into this opposite mode where you just don't have the energy to go into the zone.

I don't know what my conclusion is. Programmers are the most irrational people I know and I think they are really driven by more of a irrational, System 1 kind of thinking, the pattern matching brain. That's why I think so many super smart people get stuck in habits (I really want to just solve super cool graph algorithm problems!). The most powerful way to make progress is to put your rational System 2 on and constantly remind the other side what's really making impact and what's really long term beneficial. It takes a few year, but with enough persistence, your rational self can really train the primitive pattern matching brain to feel a sense of accomplishment in doing almost anything.

Sorry for offending anyone with my drug references!

The lane next to you is more likely to be slower than yours

2015-05-28T00:00:00Z

Saw this link on Hacker News the other day: The Highway Lane Next to Yours Isn’t Really Moving Any Faster

The article describes a phenomenon unique to traffic where cars spread out when they go fast and get more compact when they go slow. That's supposedly the explanation.

There's a much simpler explanation that works for any queue. Let's consider a supermarket checkout with two lines. One of them has a slow worker and will take 10 minutes. The other one has a fast worker and will take 5 minutes. You don't know which one is which so you pick one at random.

With $$ p=1/2 $$ you will pick the slow one, of course. But let's say you go to this supermarket every day for a year. Here's the interesting thing: on average you will spend $$ 2/3 $$ time in the slow queue. So if you sample any point in time where you are standing in line uniformly, with $$ p=2/3 $$ the other line will be faster.

Better precision and faster index building in Annoy

2015-05-26T00:00:00Z

Sometimes you have these awesome insights. A few days ago I got an idea for how to improve index building in Annoy.

For anyone who isn't acquainted with Annoy – it's a C++ library with Python bindings that provides fast high-dimensional nearest neighbor search.

Annoy recursively builds up a tree given a set of points. The algorithm so far was: at every level, pick a random hyperplane out of all possible hyperplanes that intersect the convex hull given by the point set. The hyperplane defines a way to split the set of points into two subsets. Recursively apply the same algorithm on each subset until there's only a small set of points left.

A much smarter way is this: sample two points from the set of points, compute the hyperplane equidistant to those points, and use this hyperplane to split the point set.

(I just described what happens for Euclidean distance. Angular is almost the same, just slightly simpler).

Implementing this turns makes index building 4x faster for Euclidean distance. But more importantly, the search quality is substantially better, both for angular and Euclidean distance. The difference is particularly large for high dimensional spaces.

I put together a test that measures precision for nearest neighbor search on the GloVe pretrained vectors using some hardcoded values for various parameters (10 trees, 10 nearest neighbors). See below:

This is pretty cool given that the commit is actually more red than green – the new algorithm is a lot simpler and I could remove a lot of old stuff that was no longer needed.

The intuitive reason why this works so well is: consider what happens if you have 200 dimensions, but your data is really “mostly” located on a lower dimensional space of say 20 dimensions. Then Annoy will find splits that are more aligned with the distribution of your data. I suspect these cases are pretty common in high dimensional spaces.

I also fixed another randomness issue that looked pretty severe (although I think in practice it didn't cause any issues) and added a unit tests that runs the f=25 test shown above in the graphs.

There is a fresh 1.3.1 version out on PyPI and Github – get it while it's hot!

Annoy – now without Boost dependencies and with Python 3 Support

2015-05-03T00:00:00Z

Annoy is a C++/Python package I built for fast approximate nearest neighbor search in high dimensional spaces. Spotify uses it a lot to find similar items. First, matrix factorization gives a low dimensional representation of each item (artist/album/track/user) so that every item is a k-dimensional vector, where k is typically 40-100. This is then loaded into an Annoy index for a number of things: fast similar items, personal music recommendations, etc.

Annoy stands for Approximate Nearest Neighbors something something and was originally open sourced back in 2013, although it wasn't entirely well-supported until last year when I fixed a couple of crucial bugs. Subsequently, Dirk Eddelbuettel released RCppAnnoy, an R version of Annoy.

The key feature of Annoy is that it supports file-based indexes that can be mmapped very quickly. This makes it very easy to share indexes across multiple processes, load/save indexes, etc.

I built the original version of Annoy using Boost Python but a bunch of people have complained that it's pretty hard to install. Additionally, Boost Python doesn't support Python 3.

Last weekend I decided to fix it. I have something to confess. I've been meaning to address the Boost dependency for a long time, but never found the time to do it. Finally I just put up an ad on Odesk and outsourced the whole project. I found a great developer who built it all in a few hours.

It might seem ironic to outsource open source projects since I don't get payed for it. But I spend time working on open source projects because it gives me things back in many ways – networking, recognition, some sort of fuzzy altruistic feeling of having contributed. I don't mind spending a few bucks on it, same way as I don't mind spending time on it.

The results is, Annoy doesn't depend on Boost, and now has Python 3 support. Grab the new version from Github/PyPI and Please let me know if you run into any issues!

Ping the world

2015-04-26T00:00:00Z

I just pinged a few million random IP addresses from my apartment in NYC. Here's the result:

Some notes:

What's going on with Sweden? Too much torrenting?
Ireland is likewise super slow, but not Northern Ireland
Eastern Ukraine is also super slow, maybe not surprising given current events.
Toronto seems screwed too, as well as part of NH and western PA.
Russia has fast internet.

The world:

Some more notes on methodology

The source code is here.
It's based on a 50 nearest neighbor average with the top 10% outliers removed.
Almost all random pings time out, so this is skewed towards the few (<10%) of hosts that actually respond
Some gaps of the map are filled out too much from neighbors. Eg. North Korea.
When computing nearest neighbors it's much easier if you convert everything to 3D vectors first. I used Annoy in 3D for nearest neighbors. Annoy is a Python module (written by me) that does fast approximate nearest neighbors using random projections.
Basemap is kind of a pain in the ass to install and mess around with, but gives nice plots.
I was pinging using subprocess from Python and threads. Really want to give Go a shot on this as a way to learn it.

Black Box Machine Learning in the Cloud

2015-04-22T00:00:00Z

There's a bunch of companies working on machine learning as a service. Some old companies like Google, but now also Amazon and Microsoft.

Then there's a ton of startups: PredictionIO ($2.7M funding), BigML ($1.6M funding), Clarifai, etc, etc. Here's a nice map from Bloomberg showing some of the landscape.

As much as I love ML, I'm not super bullish on these companies. I wrote a pretty cynical tweet the other day

Instead of the negative let's go through the ways I think a machine learning API can actually be useful (ok full disclosure: I don't think it's very many)

Does it solve an immediate business problem?

In ML dreamtown, an engineer realizes one day: “Hey, I have all these feature vectors, and these target values, and a loss function, I just wish someone could approximate a function for me”.

In reality, ML problems are often super messy, and it can be pretty challenging to get from data into a regression/classification problem (or anything else). Model fitting isn't the issue, getting to model fitting is the hard part. Here is a bunch of real-world scenarios I worked with over the last years:

Ad targeting based on music consumption. This is a very ill-defined problem and we need to figure out what we really want to solve.
Predict churn. We can build this complex model that takes user features and predict whether they are going to churn out. The resulting model is generally not that useful though – it doesn't give us any actionable insight.
Predict ad clickthrough rate. Yes – we can take historical data and train a binary classifier, but it suffers from a lot of issues (such as observation bias, feedback loops, etc).

Does it focus on a particular niche?

It's a lot more likely you solve an immediate business need if you focus on a specific niche. Natural language processing for sales people? I don't know.

Focusing on a particular niche makes it easier to build something that works off the shelf. A general purpose outlier detection is not as useful as a model to detect insurance fraud.

**Does it build on proprietary data sets? **

If you have amassed enormous troves of security footage, or aerial photography, or financial data, or whatever, then you can train a model that no one else can train. You can then sell this model for lots of money, because the cost of building up this data set is a huge barrier for anyone else.

Is there secret sauce?

Remember that trying to build something secret you are up against about 10,000 machine learning researchers in academia who spend all their time trying to come up with new methods. It's true that lots of machine learning has a bit of a gap between academia an industry. But this is just things that are hard and messy. That's not a defensible asset in the long run.

Convolutional neural networks for instance. It's still pretty messy to get Theano or Torch working – I know because I spent a lot of time reading papers and documentation to get a simple image classifier working. But the complexity of this is going to go down very quickly. In a year's time there will be open source libraries with pre-trained models for image classification that are on par with anything you can get through an API (probably better).

Does it solve infrastructural issues?

Scaling applications is still a hard problem. Similarly the use of GPU's in deep learning creates an artificial barrier for many companies who don't want to deal with Amazon instances etc – there is some value in abstracting this away from users.

The question is what companies have problems that require large scale machine learning that don't have problems that require scalability.

Do you have smart people?

I actually think the biggest upside in many of these companies is the possibility of acqui-hire. It's no secret that machine learning engineers are high in demand. So maybe the best strategy is try to attack some super hard problem, ignore whether it's actually useful, and hire as many smart people as possible.

It's called Berkson's paradox!

2015-04-09T00:00:00Z

As noted by multiple tweets, my previous post describes a phenomenon denoted Berkson's paradox.

Here's another example: Why Are Handsome Men Such Jerks?

Norvig's claim that programming competitions correlate negatively with being good on the job

2015-04-07T00:00:00Z

I saw a bunch of tweets over the weekend about Peter Norvig claiming there's a negative correlation between being good at programming competitions and being good at the job. There were some decent Hacker News comments on it.

Norvig's statement is obviously not true if we're drawing samples from the general population – most people can't code. It doesn't necessarily even have to do with time allocation as this commenter alluded to:

Being a champion at something requires excruciating narrow focus on something for unusually long time. If you are getting GPA of 4.0 or play Rachmaninoff's Piano Concerto No 3 or deadlift 400 pounds or in top 1000 chess players – you probably have to work on it for hours a day for years while ignoring everything else (of course unless you are one of those 1 in million polymath).

Here's the real reason: Google is already selecting for the top 1% programmers using some criteria, leading to selection bias. Even if the two values are positively correlated, you might have a selection criterion that leads to a negative correlation.

But let's start with the ideal case. Let's say there's a slight positive correlation between “being good at programming competitions” and “what really matters”. Let's assume Google hires perfectly. Let's assume everyone is on a multivariate Gaussian:

For all the people that were hired, I calculate the correlation between “Programming competition skills” and “What really matter”. The correlation for hired people is almost 0.2 and it's still positive!

However let's say Google for some reason puts too much weight on programming competitions during the interviews. We now get a negative correlation!

Does this mean it's bad to hire people who are good at programming competition? No, it just means that we probably overweighted it during the hiring process. If we lower the weight a bit we get something a positive correlation again:

But in general does it mean we should never look at programming competition skills? Actually the reality is a lot more complicated. Instead of observing what really matters, you observe some crappy proxy for it. And when all metrics is noisy, you should put some nonzero positive weight on any metric that correlate positively with your target. Just not too much!

Sorry for spamming you with scatter plots, but it's in the name of statistics! My point here is that you can tweak these variables and end up seeing correlations with pretty much any value. So when you have these complex selection biases you need to be super careful about how to interpret the data. It's a great reminder that studies like Project Oxygen always need to be taken with a bucket of sea salt.

Are there other examples of selection biases leading to spurious correlations? Let me know!

Pinterest open sources Pinball

2015-03-14T00:00:00Z

Pinterest just open sourced Pinball which seems like an interesting Luigi alternative. There's two blog posts: Pinball: Building workflow management (from 2014) and Open-sourcing Pinball (from this week). The author has a comment in the comments thread on Hacker News:

Luigi was not available in public, when Pinball starts. So not sure the pros and cons between Pinball and Luigi.

When we build pinball, we aim to build a scalable and flexible workflow manager to satisfy the the following requirements (I just name a few here).

easy system upgrade – when we fix bug or adding new features, there should be no interruption for current running workflow and jobs.

easy add/test workflow – end user can easily add new jobs and workflows into pinball system, without affecting other running jobs and workflows.

extensibility – a workflow manager should be easy to extended. As the company and business grows, there will be a lot new requirements and features needed. And also we love your contributions as well.

flexible workflow scheduling policy, easy failure handling.

We provide rich UI for you to easily manage your workflows – auto retry failed job, – you can retry failed job, can skip some job, can select a subset of jobs of a workflow to run (all from UI) – you can easily access all the running history of your job, and also get the stderr, stdout logs of your jobs – you can also explore the topology of your workflow, and also support easy search.

Pinball is very generic can support different kind platform, you can use different hadoop clusters,e.g., quoble cluster, emr cluster. You can write different kind of jobs, e.g., hadoop streaming, cascading, hive, pig, spark, python …

There are a lot interesting things built in Pinball, and you probably want to have a try!

Sounds pretty similar to Luigi! My initial impression is that

The architecture is a bit more advanced than Luigi and has some features that Luigi lacks. From what I can tell, it comes with task storage out of the box (whereas Luigi's task history DB is still not entirely integrated), distributed execution, and a triggering mechanism. These are all areas where Luigi still needs some love
The workflow API seems very convoluted. I don't really understand how the code works and there's a lot of boiler plate.

Fun to have something to compare to. Not that I want to rationalize Luigi's missing features, but in general I would argue that the importance of good API design is underrated compared to good architecture. I still believe the key thing for a workflow manager is to reduce boiler plate and configuration at any point. It's slightly harder to create an easy to use API than to think hard about architecture and check all the boxes for every feature.

Hopefully we'll see more of these in the future. Obviously being Luigi's author, I think Luigi is an awesome tool. But I think it's 10% of what it could be, and diversity in this space is great for innovation. There's a lot of them now: Oozie, Azkaban, Drake, Pinball, etc. Some people apparently use Jenkins for workflow management. A wildcard I encountered the other day is Ketrew. I wish I knew enough OCaml to understand what's going on!

The relationship between commit size and commit message size

2015-02-26T00:00:00Z

Wow I guess it was more than a year ago that I tweeted this. Crazy how time flies by. Anyway, here's my rationale:

When I update one line of code I feel like I have to put in a long explanation about its side effects, why it's fully backwards compatible, and why it fixes some issue #xyz.
When I refactor 500 lines of code, I get too lazy to write anything sensible, so I just put “refactoring FooBarController”. Note: don't do at home!

I decided to plot the relationship for Luigi: {% include 2015-02-26-the-relationship-between-commit-size-and-commit-message-size.html %}

The plot is clickable! Check it out! It's an old school image map which is pretty pathetic, since no one has used it since 1997, but it was just so much easier for this task. Hover over any point to see the commit message and click on it to jump to the commit on Github.

As you can see, there's essentially no relationship between the two values. Not as spectacular as I was hoping for, but still kind of weird/interesting.

Code is here if you're curious!

My favorite management failures

2015-02-22T00:00:00Z

For most people straight out of school, work life is a bit of a culture shock. For me it was an awesome experience, but a lot of the constraints were different and I had to learn to optimize for different things. It wasn't necessarily the technology that I struggled with. The hardest part was how to manage my own projects and my time, as well as how to grow and make impact as an engineer. I've listed some of my biggest mistakes, which are also mistakes I see other (mostly junior) engineers make.

Having the wrong scope

How do you know what's the right amount of work to spend on a project? I had horrible intuition about this coming out of school. One thing I think is helpful is to think of the relationship between time spent and impact. For a given project, it looks something like this:

It usually ends up being a concave function.

How do you pick a point on this curve? If you only have one task then it's usually pretty easy because you have some constraint on total time or total impact. In school usually you work on some task until it hits a certain y value (problem is solved) or until it hits a certain x value (time to turn in what you have).

The problem is in real life you actually need to pick not just one point on one curve but a points on each many curves. Actually an infinite number of curves. And you need to pick these points subject to the constraint that you get the maximum value per time invested.

This is a much harder problem! It means the amount of time we spend on task A is actually determined not just by how hard task A is but how hard an infinite number of other tasks are.

Let's get mathematical here: for this concave optimization problem you can show that the marginal impact of each task should be identical. (I really want to write a book some day called The Mathematics of Project Management)

This means: recognize when the marginal impact of spending more time on a project starts to get low and you get more marginal impact elsewhere. Or just think: is this already good enough to deliver user value? Then take a break and look at the whole portfolio of possible task: ignoring what I have done so far, what's the highest impact next thing I can do?

Focusing only on the things you are supposed to focus on

This might sound weird. What are you supposed to do at work? Most of the time you should probably do what your team/manager told you to do. But guess what? Your team/manager is not an all-seeing all-knowing oracle. Sometimes you might actually have a better idea of what to do.

Your sole purpose of working somewhere is to deliver value for the company. Completing a task from the backlog is a great proxy for that. But it's still a proxy and as such has no intrinsic value. In many cases there might be even higher leverage things that no one will tell you to do. For instance, look around you. Is the team struggling with some old framework? Can you help someone get unblocked?

I like people to come in every morning and ask themselves: what is the highest impact thing I can do to for the company today? And do that. If you think about it, task backlogs is a completely artificial construct needed because we don't have perfect information.

This gets especially important if you are interested in management roles. The higher up you get, the less people are going to tell you what to do.

Silly obligatory visualization:

Focusing only on low-leverage activities

There's only that much leverage you get by being an individual contributor. Even if you're a 10x engineer. Look around you for things with a force multiplier built in. Usually that means applying something to the entire team. Are you using the wrong language for the tool? Spend a few days investigating something else, introduce it to the team, and watch the whole team move twice as fast.

I used to work with Neville Li at Spotify who was a genius at finding these opportunities. He also never did what you told him to. Instead, he would spend days reading blogs and trying new technologies. Every few months he would find something that made the whole team 2x as much productive. Then he would organize a workshop, introduce it to the team, and move on.

Not realizing technology isn't just a job

This is probably my most cynical note, or optimistic, depending on how you look at it.

The truth is, software engineering isn't just a normal job. It's a life style. It's also a field that keeps changing from year to year. If you want to be successful, you need to stay up to date. If you want to be above average, you need to do things like:

Working on side projects
Reading tech blogs
Following influencers on Twitter
Going to meetups
Reading papers
Etc

Being a software engineer is a fantastic career in many ways. With lots of freedom comes a lot of responsibilities. If you want to stay fresh, you need to invest a fair amount of your spare time.

Not drawing diagrams on glass walls

This is a no-brainer. Everyone knows that solid software engineers work draw everything on glass walls. And they also write everything flipped horizontally because it's cooler.

Summary

I love technology. Go write some kick ass code now.

Leaving Spotify

2015-02-11T00:00:00Z

Febrary 6 was my last day at Spotify. In total I spent more than six years at Spotify and it was an amazing experience.

I joined Spotify in Stockholm in 2008, mainly because a bunch of friends from programming competitions had joined already. Their goal to change music consumption seemed ridiculous at that point, but six years later I think it's safe to say they actually succeeded.

Back in the early days, my job was to do almost anything related to data. I think the range of tasks that I was responsible for has now grown into 100+ people at Spotify. My day to day tasks was all over the map: Hadoop maintenance, Powerpoint presentations, label reporting, ran A/B tests, optimized ad delivery, did ad delivery forecasts, built music recommendations, and much more (for most of that time we were actually three people though, not just me).

It was an amazing learning experience to see a company grow this way. I think a company goes through different challenges at every stage, both technically and organizationally (honestly a lot more of the latter compared to the former).

Pushing the button, launching Spotify to the world (late 2008)

Figuring out the cable situation (2009)

I've been craving to go back and go through the same journey again, so I've joined a small startup in NYC as the head of engineering. I will share more details soon. Hopefully this time will be an opportunity to apply all those things I learned at Spotify.

Oskar Stål, the CTO of Spotify and a great mentor, would always tell me that I have to decide between machine learning and the “CTO ladder” at some point. I made a conscious decision right now to focus more on management and building teams. I think this might be the topic of some future blog post, but not now.

What's going to happen to my open source projects such as Luigi and Annoy? Nothing should change, except I will have a lot less time to spend on it.

Stay tuned for more updates!

Scala Data Pipelines for Music Recommendations

2015-01-13T00:00:00Z

Chris Johnson‘s presentation from Data Day Texas:

Everything I learned about technical debt

2014-12-30T00:00:00Z

I just made it to Sweden suffering from jet lag induced insomnia, but this blog post will not cover that. Instead, I will talk a little bit about technical debt.

The concept of technical debt always resonated with me, partly because I always like the analogy with “real” debt. If you take the analogy really far, there are some curious implications. I always like to think of the “interest rate” of software development. Debt is really just borrowing from the future, with some interest rate. You are getting a free lunch right now, but you need to pay back 1.2 free lunches in a few months. That's the interest rate. In a software project the equivalent could be to pick a database that will have scalability issues later, or to make all member variables of some class public. You are doing it because it makes it easier to do things now but you will have to pay the cost of that later.

A recent paper from Google stretches the analogy in its title: Machine Learning: The High-Interest Credit Card of Technical Debt. It focuses specifically on machine learning, but definitely read it if you are interested. A recent blog post challenges if tech debt is really “debt” in the strict sense (you borrow fixed amount and pay back slightly more) or if it has a more complicated structure: Bad code isn't Technical Debt, it's an Unhedged Call Option.

I like the blog post because it brings up something I have noticed many times. A lot of developers have this intuitive aversion towards tech debt and always want to fix anything that's perceived as “hacky”. FooBarController is a 1,000 line mayhem that no one understands, we need to refactor it! But say FooBarController is a well separated component that you have no intent on ever modifying, then there's really no reason to fix it. It's almost always a waste of time to try to fix bad code or bad architecture unless you at least some idea of why it helps you in the future.

So in some cases it makes sense not to fix technical debt. In other cases, it makes sense to take on tech debt deliberately. Back to the interest rate analogy: if the interest rate is lower than the return of investment, you should borrow money from the bank. It's fine to ship a product a year earlier with a hacky code, if you make a lot of money, and hire a ton of developers to clean it up. The concept of interest rate applies both to financing and software engineering.

In my experience, the biggest issues isn't taking on technical debt or not. As long as you make a conscious decision to take on tech debt, and everyone agrees it's tech debt that you might need to fix later, you're in the clear. You will get problems if you build up technical debt without acknowledging it. I made a chart to make it clear:

<th>
</th>

<th colspan="2">
  Do you think you are taking on tech debt?
</th>

<th>
</th>

<th>
  No
</th>

<th>
  Yes
</th>

<th>
  No
</th>

<td>
  Ok, cool
</td>

<td>
  Don't worry so much!
</td>

<td>
  <a href="/assets/2014/12/sopranos.png">![image](/assets/2014/12/sopranos.png)</a>
</td>

<td>
  <a href="/assets/2014/12/bank-loan.png">![image](/assets/2014/12/bank-loan.png)</a>
</td>



Are you taking on tech debt?
Yes

The bottom left picture is Tony Soprano knocking on your door because he's here to collect the debt you owe him. What happened is, you saw this investment (real estate?) that you thought would appreciate 10% year on year. You borrowed money from Tony, but you never realized you might have to pay it back. It turns out the interest rate was a lot more hefty than you thought, and now he wants it back a year later with 50% interest rate.

The bottom right picture is you going to the bank because you want to buy real estate. You examine the interest rates and make a decision to get a mortgage.

These pictures might not illustrate the point super well, because the bottom right also covers this situation: borrowing at a high interest rate because the return on investment is even higher. Maybe you know of this boxing match that's already rigged, and it's 5:1 odds. You won't be able to borrow money from the bank, so you go to Tony Soprano and borrow it for a few days. Next week, you pay it back with some interest, but you still made a ton of money.

Back to software engineering. The example above is like shipping the v2.0 of your web shop on time, and it turns out to be much better for users. You sell twice as much now! But you also have a bunch of scripts you have to run manually every day. You clearly should automate those scripts later, and it might be really messy to do so, but it's also clear that you can do that later. You made a deliberate decision to borrow some resources from the future, because the return of your investment was really high.

I already found the best gifs

2014-12-28T00:00:00Z

Just search for “hackers gif“.

There you go. Fun for your work emails for the next 500 years. From the awesome movie Hackers. That movie together with The Warriors convinced me that I wanted to live in NYC when I was like… 14 years old.

A brief history of Hadoop at Spotify

2014-12-20T00:00:00Z

I was talking with some data engineers at Spotify and had a moment of nostalgia.

2008

I was writing my master's thesis at Spotify and had to run a Hadoop job to extract some data from the logs. Every time I started running the job, I kept hearing this subtle noise. I kept noticing the correlation for a few days but I was too intimidated to ask. Finally people starting cursing that their machines had gotten really slow lately and I realized we were running Hadoop on the developer's desktop machines. No one had told me. I think back then we had only GB's of log data. I remember running less on the log and I would recognize half the usernames because they were my friends.

2009

We took a bunch of machines and put them on a pallet in the foosball room. It was a super hot Swedish summer and I kept running this matrix factorization job in Hadoop that would fail halfway through. The node on the top of the pile would crash and you had to reboot it. I suspected overheating. We had a fan running in the room but it wasn't helping. Finally I realized the problem was the sun was shining in through the window.

Jon Åslund with our Hadoop cluster

I found a big sheet or blanket and some nails and a hammer and put it up over the window. I was finally able to run my matrix factorization job to completion after doing this. This is probably going to be my favorite bug fix until the day I die.

In the summer of 2009, we installed a 30-node Hadoop cluster in our data center in Stockholm. Finally a “real” cluster.

2011

More and more people started using Hadoop so we decided to move to Elastic Mapreduce. I uploaded all our logs to S3 and we put together some tooling so that you could run things on our own Hadoop cluster or on EC2 using the same source code. It was pretty beautiful but the performance wasn't super great compared to how much we were paying.

Later in 2011 we had grown even more. We decided to move back to our own data center. We installed 500 nodes in our data center in London, later upgrading it to 700 and then 900 nodes.

Our fifth Hadoop cluster

I also implemented Luigi as a workflow engine with Mapreduce support in late 2011.

2012

There was this long-standing assumption (at least I had) that Hadoop jobs were I/O bound and thus the language didn't matter. We were using Python for probably 95% of all jobs, with some stuff in Hive by the analytics team. During 2012 and forward, we started realizing Python isn't the ideal language, both from a performance and usability point of view. Eventually we would end up switching to Crunch and Scalding. We still use Luigi as the workflow engine to glue everything together.

This is a super simplified history of everything that took place. Josh Baer and Rafal Wojdyla are talking about the Evolution of Hadoop at Spotify at Strata in February for the rest of the story!

Luigi Presentation @ NYC Data Science, Dec 16, 2014

2014-12-17T00:00:00Z

Luigi talk tomorrow

2014-12-16T00:00:00Z

At NYC Data Science meetup! Unfortunately the space is full but the talk will be livestreamed – check out the meetup web page for a link tomorrow.

Deep learning for… Go

2014-12-11T00:00:00Z

This is the last post about deep learning for chess/go/whatever. But this really cool paper by Christopher Clark and Amos Storkey was forwarded to me by Michael Eickenberg. It's about using convolutional neural networks to play Go. The authors of the paper do a much better job than I would ever have done of modeling move prediction in Go and show that their model beat certain Go engines.

The fascinating thing about this paper is that playing against other Go engines, they just plug in their move prediction function, with no deep search beyond one level. That means the total time it spends is a fraction of the opponents. Still, the fact that it plays so well speaks for its strength.

So what happened if we could plug this into a deep search framework? The authors suggest doing exactly that in the conclusion. State of the art of Go engines actually use Monte Carlo tree search rather than minimax but other than that, it's the same principle.

I talked a bit with the authors and the main thing that you have to change is to switch from move prediction to an evaluation function. For my chess experiments, I found a (hacky) way to train a function that does both at the same time. There's essentially two terms in my objective function: one is comparing the actual move with a random move, using a sigmoid:

$$ frac{P(q)}{P(q) + P(r)} = frac{exp(f(q))}{exp(f(q)) + exp(f(r))} $$ .

If you extend that to all possible random moves you actually get a full probability distribution (a softmax) over all possible next moves.

$$ P(p rightarrow q) = frac{exp(f(q))}{sum exp(f(r)) } $$ .

Now, how do you “convert” that into an evaluation function? That's the second term, which tries fit the negative parent score to the current score. We penalize the quantity $$ f(p) + f(q) $$ by throwing in two more sigmoids. It's a “soft constraint” that has absolutely no probabilistic interpretation. This a hacky solution, but here's how I justify it:

Note that the evaluation functions are unique up to a monotonic transform, so we can actually mangle it quite a lot.
The softmax distribution has one degree of freedom in how it chooses the quantities, so (I'm speculating) the artificial constraint does not change the probabilities.

I think you could do the exact thing with their Go engine. In fact I'm willing to bet a couple of hundred bucks that if you did that, you would end up with the best Go engine in the world.

Btw another fun thing was that they plot some of the filters and they seem as random as the ones I learn for Chess. But a clever trick enforcing symmetry seem to help the model quite a lot.

Deep learning for… chess (addendum)

2014-12-08T00:00:00Z

My previous blog post about deep learning for chess blew up and made it to Hacker News and a couple of other places. One pretty amazing thing was that the Github repo got 150 stars overnight. There was also lots of comments on the Hacker News post that I thought were really interesting. (See this skeptical comment for instance).

A couple of things came up in several places. I actually fully agree with a lot of the skepticism my blog post got. Here's a bit of clarification + other stuff

My assumption that amateur players make near-optimal moves

Let me retract that statement a bit. But just a little bit. There's several ideas here. The first one is that if 1,000 amateur chess players could vote for the next move, that move is probably pretty strong. There's some anecdotal evidence suggesting that a large amount of amateurs actually, eg. Kasparov vs the World. The cool thing is that training this machine learning model, it will actually learn to pick the move that corresponds to what “most” players would choose. (You can actually see that the probability distribution over all next valid moves are given by a softmax distribution where the $$ z $$ values are given by the evaluation function).

The second idea is that a lot of moves are pretty obvious, because you are forced to do something. The third thing is that almost any move is good compared to a random move.

I think in hindsight it's probably not correct that most moves by amateur players are “near-optimal”, but I don't think it matters for the model.

What does each layer show if you look at it?

I looked at it, but it's pretty much all over the place. Unlike convolutional neural networks, where the first layer often represents edges, there is nothing like that in this network. It seems like the logic is encoded throughout the whole network. Here's the first few coefficients of the first feature (out of the 2048 features in the first layer), ranked in decreasing order of magnitude:

<td>
  q @ e7
</td>

<td>
  P @ f7
</td>

<td>
  q @ f6
</td>

<td>
  P @ d3
</td>

<td>
  r @ c6
</td>

<td>
  N @ e4
</td>

<td>
  P @ d6
</td>

<td>
  r @ e6
</td>

<td>
  q @ d6
</td>

<td>
  p @ d7
</td>

0.0856
-0.0686
0.0658
0.0657
-0.0655
0.0650
0.0648
-0.0625
0.0625
0.0588

White pieces are upper case, black are lower case. I don't see much going on here.

There is actually at least one paper about using deep neural networks for Go

****Ilya Sutskever and Vinod Nair wrote this paper in 2008. It even uses convolutional neural networks. It only has about 10k parameters (compared to 10M in my model) but it does something very similar to what I did: it tries to predict the next move of an expert player. I'm not sure why they didn't evaluate playing with it though. I would guess it probably needs a lot more parameters to play well.

Deep learning for... chess

2014-11-29T00:00:00Z

I've been meaning to learn Theano for a while and I've also wanted to build a chess AI at some point. So why not combine the two? That's what I thought, and I ended up spending way too much time on it. I actually built most of this back in September but not until Thanksgiving did I have the time to write a blog post about it.

What's the theory?

Chess is a game with a finite number of states, meaning if you had infinite computing capacity, you could actually solve chess. Every position in chess is either a win for white, a win for black, or a forced draw for both players. We can denote this by the function $$ f(\mbox{position}) $$ . If we had an infinitely fast machine we could compute this by

Assign all the final positions the value $$ {-1, 0, 1} $$ depending on who wins.
Use the recursive rule

$$ f(p) = \max_{p \rightarrow p’} -f(p’) $$

where $$ p \rightarrow p’ $$ denotes all the legal moves from position $$ p $$ . The minus sign is because the players alternate between positions, so if position $$ p $$ is white's turn, then position $$ p’ $$ is black turns (and vice versa). This is the same thing as minimax.

There's approximately $$ 10^{43} $$ positions, so there's no way we can compute this. We need to resort to approximations to $$ f(p) $$ .

What's the point of using machine learning for this?

What machine learning really boils down to is approximating functions given data. So assuming we can get a lot of data to learn this from, we can learn this function $$ f(p) $$ . Once we have a model, an objective, and training data, we can go knock ourselves out.

I downloaded 100M games from FICS Games Database and began training a machine learning model. My function $$ f(p) $$ is learned from data by using two principles

Players will choose an optimal or near-optimal move. This means that for two position in succession $$ p \rightarrow q $$ observed in the game, we will have $$ f(p) = -f(q) $$ .
For the same reason above, going from $$ p $$ , not to $$ q $$ , but to a random position $$ p \rightarrow r $$ , we must have $$ f(r) > f(q) $$ because the random position is better for the next player and worse for the player that made the move.

The model

We construct $$ f(p) $$ as a 3 layer deep 2048 units wide artificial neural network, with rectified linear units in each layer. The input is a 8 * 8 * 12 = 768 wide layer which indicates whether each piece (there are 12 types) is present in each square (there are 8 * 8 squares). After three matrix multiplications (each followed by a nonlinearity), there's a final dot product with a 2048-wide vector to condense it down to a single value.

In total there's roughly 10M unknown parameters in the network.

To train the network, I present it with $$ (p, q, r) $$ triplets. I feed it through the network. Denoting by $$ S(x) = 1 / (1 + exp(-x)) $$ , the sigmoid function, the total objective is:

$$ sum_{(p, q, r)} \log S(f(q) - f(r)) + \kappa \log (f(p) + f(q)) + \kappa \log (-f(q) - f(p)) $$

This is the log likelihood of the “soft” inequalities $$ f(r) > f(q) $$ , $$ f(p) > -f(q) $$ , and $$ f(p) < -f(q) $$ . The last two are just a way of expressing a “soft” equality $$ f(p) = -f(q) $$ . I also use $$ \kappa $$ to put more emphasis on getting the equality right. I set it to 10.0. I don't think the solution is super sensitive to the value of $$ \kappa $$ .

Notice that the function we learn has no idea about the rules of chess. We're not even teaching it how each piece move. We make sure the model has the expressiveness to work out legal moves, but we don't encode any information about the game itself. The model learns this information by observing tons of chess games.

Note that I'm also not trying to learn anything from who won the game. The reason is that the training data is full of games played by amateurs. If a grandmaster came into the middle of a game, s/he could probably completely turn it around. This means the final score is a pretty weak label. Still, even an amateur player probably makes near-optimal moves for most time.

Training the model

I rented a GPU instance from AWS and trained it on 100M games for about four days using stochastic gradient descent with Nesterov momentum. I put all (p, q, r) triplets into a HDF5 data file. I was messing around with learning rates for a while but after a while I realized I just wanted something that would give me good results in a few days. So I ended using a slightly unorthodox learning rate scheme: $$ 0.03 \cdot \exp(-\mbox{time in days}) $$ . Since I had so much training data, regularization wasn't necessary, so I wasn't using either dropout or L2 regularization.

A trick I did was to encode the boards as 64 bytes and then transform the board into a 768 units wide float vector on the GPU. This gave a pretty substantial performance boost since there's a lot less I/O.

How does a chess AI work?

Every chess AI starts with some function $$ f(p) $$ that approximates the value of the position. This is known as evaluation function.

This function is also combined with a deep search of many millions of positions down the game tree. It turns out that an approximation of $$ f(p) $$ is just a small part of the playing chess well. All chess AI's focus on smart search algorithms, but the number of positions explode exponentially down the search tree, so in practice you can't go deeper than say 5-10 positions ahead. What you do is you use some approximation to evaluate leaf nodes and then use some variety of negamax to evaluate a game tree of a bunch of possible next moves.

By applying some smart searching algorithm, we can take pretty much any approximation and make it better. Chess AI's typically start with some simple evaluation function like: every pawn is worth 1 point, every knight is worth 3 points, etc.

We're going to take the function we learned and use it to evaluate leaves in the game tree. Then try to search deep. So we're first going to learn the function $$ f(p) $$ from data, then we're going to plug it into a search algorithm.

Does it work?

I coined my chess engine Deep Pink as an homage to Deep Blue. As it turns out, the function we learn can definitely play chess. It beats me, every time. But I'm a horrible chess player.

Does Deep Pink beat existing chess AI's? Sometimes

I pit it against another chess engine: Sunfish by Thomas Dybdahl Ahle. Sunfish is written entirely in Python. The reason I chose to stick to the same language was that I didn't want this to be an endless exercise of making fast move generation. Deep Pink also relies heavily on quick move generation, and I didn't want to spend weeks working out edge cases with bitmaps in C++ to be able to compete with the state of the art engines. That would just be an arms race. So to be able to establish something useful, I picked a pure Python engine.

The obvious thing in hindsight is: the main thing you want out of any evaluation function $$ f(p) $$ isn't accuracy, it's accuracy per time unit. It doesn't matter that one evaluation function is slightly better than another if it's ten times slower, because you can take the fast (but slightly worse) evaluation function and search more nodes in the game tree. So you really want to take into account the time spent by the engine. Without further ado, here's some results of playing against the engine many times:

Notice the log-scale. The x-axis and y-axis aren't super relevant here, the main thing is the distance to the diagonal, because that tells us which engine spent more CPU time. Every game I randomized the parameters for each engine: the max depth for Deep Pink, and the max number of nodes for Sunfish. (I didn't include draws because both engines struggle with it).

Not surprisingly, the more time advantage either side has, the better it plays. Overall, Sunfish is better, winning the majority of the games, but Deep Pink probably wins 1/3 of the time. I'm actually pretty encouraged by this. I think with some optimizations, Deep Pink could actually play substantially better:

Better search algorithm. I'm currently using Negamax with alpha-beta pruning, whereas Sunfish uses MTD-f
Better evaluation function. Deep Pink plays pretty aggressively, but makes a lot of dumb mistakes. By generating “harder” training examples (ideally fed from mistakes it made) it should learn a better model
Faster evaluation function: It might be possible to train a smaller (but maybe deeper) version of the same neural network
Faster evaluation function: I didn't use the GPU for playing, only for training.

Obviously the real goal wouldn't be to beat Sunfish, but one of the “real” chess engines out there. But for that, I would have to write carefully tuned C++ code, and I'm not sure it's the best way to spend my time.

Summary

I'm encouraged by this. I think it's really cool that

It's possible to learn an evaluation function directly from raw data, with no preprocessing
A fairly slow evaluation function (several orders of magnitude slower) can still play well if it's more accurate

I'm pretty curious to see if this could fare well for Go or other games where AI's still don't play well. Either way, the conclusions above come with a million caveats. The biggest one is obviously that I haven't challenged a “real” chess engine. I'm not sure if I have the time to start hacking on chess engines, but if anyone is interested, I've put all the source code up on Github.

Optimizing things: everything is a proxy for a proxy for a proxy

2014-11-22T00:00:00Z

Say you build a machine learning model, like a movie recommender system. You need to optimize for something. You have 1-5 stars as ratings so let's optimize for mean squared error. Great.

Then let's say you build a new model. It has even lower mean squared error. You deploy it. This model turns out to give a lower mean squared error. You roll it out to users and the metrics are tanking. Crap! Ok so maybe mean squared error isn't the right thing to optimize for.

The way you solve this, of course, is you start A/B testing your changes. But what metric to choose? People often ask me why we use one or another metric. We typically look at numbers like Daily active users, Day 2 retention, etc. But what if optimizing too hard for one hurts the other? What if you're driving day 2 retention but screwing up month 2 retention?

What I like to remind myself is that everything is a proxy metric. We really want to maximize shareholder value or something like similar (let's not get into a debate about this, so for the purpose of this blog post I'm going to assume that's our goal).

The problem is, you can't take all your hyperparameters and calculate the gradient

$$ \frac{\partial}{\partial Theta} \mbox{shareholder value}(\Theta) $$

That's silly for many reasons, but let's break it down. First of all, the functional relationship is highly stochastic, and depends on all kinds of external factors. Second of all, there's no way we can even evaluate multiple values of this function, and it won't be possible until we invent time machines. Third of all, there's no way we can extract gradient information at all.

So what are we going to do? We're going to invent a new function that we think is highly correlated with shareholder value. We could even define multiple functions if we want to, just to cross check different ones. But we will never be able to establish the correlation because of the reasons I just mentioned.

So that's usually why metrics like daily active users are great. If you are trying to grow your company, it's reasonable to assume that ultimately user growth will lead to success.

But in general, what properties should such a function have? Ideally as many as possible out of this:

Should be highly correlated with shareholder value. For a company focusing on growth, the number of active users is probably a good one.
Should be possible to measure separate independent outcomes, using A/B tests, blind tests, or something else. For instance, number of signups is tricky to test.
Should be fast to measure. We don't want to launch an A/B test and have to wait many months to get an answer
Should have a high signal to noise ratio. You want to extract as much value from it. If you're running an A/B test you want to reach statistical significance quickly.

One thing I've learned the hard way is that sometimes it's often useful to pick a more biased metric if that means you can results faster or getting more results. For instance, we can roll out an A/B test with two different recommendation algorithms. We probably won't see an impact on high level metrics such as retention, so we can pick a feature-specific metric instead.

But we can go even further. Still if we're A/B testing, we probably have to run that test for two weeks to get any meaningful numbers. At the end of the day we obtain 1 bit of information (roughly) at best after two weeks, which tells us which test group won.

If we want to iterate quickly sometimes it makes a lot more sense to just take the output of the recommendation algorithms and let people go through a blind test. This is slightly more biased because we're not using real users, but usually a lot quicker, and also lets us extract a lot more information. Not just do we learn which algorithm is the better, we often end up with lots of (noisy) anecdotal information, such as “algorithm A sucks for popular content” or “algorithm B is less diverse”.

At the lowest level, if you have a recommender algorithm, the model's objective (eg. mean squared error) is another great proxy. Turns out it's extremely easy to try multiple parameters and learn from it – all you need to do is to retrain the model. It totally dominates points 2, 3 and 4, but doesn't do a great job on point 1.

TL;DR any metric you're using is just a proxy. Pick the right one for your task.

Luigi conquering the world

2014-11-15T00:00:00Z

I keep forgetting to buy a costume for Halloween every year, so this year I prepared and got myself a Luigi costume a month in advance. Only to realize I was going to be out of town the whole weekend. If anyone wants a Luigi costume, let me know!

(I'm not as big as the guy in the picture)

Anyway, that's not the Luigi this blog post is about. This is about the Python workflow manager that I've open sourced. If you're putting together batch jobs into complex pipelines, you should check it out.

There's been a couple of companies in NYC using it for a while, eg. Foursquare, Bitly, and Mortar Data. The latter company even provides a hosted Luigi solution.

Lately, there's been a surge of contributors (at the time of writing there's 78 contributors, which is pretty cool!). People at Houzz has sent tons of pull requests, and they are not the only ones.

Two blog posts just came out, both describing their use of Luigi (among other things)

In addition, I found this interesting deck from Stripe:

There's also a couple of meetups in the past and the future. There was a dedicated Luigi meetup on July 31 at the Spotify NYC office, also sponsored by Run Ads who use Luigi a lot. Here are my slides about Luigi's future, heavily delayed:

I'm also talking about Luigi at the NYC Data Science meetup on Dec 19 (also at Spotify's NYC office). Feel free to drop by and ask me some tough questions!

Annoying blog post

2014-11-11T00:00:00Z

I spent a couple of hours this weekend going through some pull requests and issues to Annoy, which is an open source C++/Python library for Approximate Nearest Neighbor search.

I set up Travis-CI integration and spent some time on one of the issues that multiple people had reported. At the end of the day, it turns out the issue was actually caused by a bug in GCC 4.8. Some crazy compiler optimization introduced between 4.6 and 4.8 caused this loop to be removed:

if (indices.size() <= (size_t)_K) {
  for (size_t i = 0; i < indices.size(); i++)
    m->children[i] = indices[i];
  return item;
}

Replacing it with std::copy turned out to do the trick

if (indices.size() <= (size_t)_K) {
  copy(indices.begin(), indices.end(), m->children);
  return item;
}

It's still bizarre, but I probably deserved it, given how Annoy is abusing C++. The m->children array is declared as being only 2 elements long, but I deliberately overflow the array because I allocate extra space after it. I think this might cause GCC to unroll the loop to run twice.

I always feel a bit more comfortable when it turns out that the compiler is introducing bugs rather than my code. Made me think of the Jeff Dean joke: Jeff Dean builds his code before committing it, but only to check for compiler and linker bugs.

Anyway, after fixing this in three separately places, it seems like it's finally working. Dirk Eddelbuettel is working on an R implementation of Annoy which is fun to see.

I haven't spent much time with Annoy in a year or two and looking around it seems like there's some new competitors on the block. Panns is one of them, another one is the LSHForest pull request for scikit-learn. I haven't looked at them thoroughly, but they are both Python-only and claim some advantages over Annoy. None of them implement mmap as a method to load indexes, which imho is Annoy's killer feature.

There's a performance benchmark featuring Annoy, LSHForest, and FLANN, written by the author of LSHForest. Annoy performs horribly in the benchmark, getting its ass severely kicked by the other two. After re-running the benchmark myself, I think what happened is that the bug I mentioned above was present for Annoy and that's why it performed so bad. Re-running the benchmark (thanks for making it easily reproducible!) yields very different results.

It's extremely hard to compare all trade-offs between index building time, index size, query performance, and accuracy. So please don't take this super seriously. The only thing I changed in the benchmark was (1) I added Panns, for good measure (2) I reduced the number of trees for Annoy (and Panns) to 10 instead of using n_features. Without reducing the number of trees for Annoy, it gets pretty much 100% accuracy for all data sets, but takes several minutes to build each index. So to emphasize approximate aspect of ANN, I decided to sacrifice some accuracy to gain performance.

Pardon the lousy graphics, but here's the result in all its glory:

In my layman terms:

Annoy and Panns outperform LSHF and FLANN significantly on accuracy.
Index building process is fast for LSHF and FLANN. Annoy takes a lot more time, but Panns is 10x slower than Annoy
FLANN is faster than Annoy for queries. Annoy is 10x faster than LSHF. Panns is super duper slow.

And with my severely biased conclusions:

If you want to use mmap for fast index loading, use Annoy
If you want to minimize file size at any cost, use Panns
If you want fast query times at any cost, use FLANN
If you want a pure Python solution, use LSHF
For anything else, use Annoy. Or am I going too far promoting my own projects now…?

Btw, I would love it if someone could help me reimplement the algo used by Panns in Annoy, since it seems pretty good.

For another comparison, check out Radim Řehůřek's Performance Shootout of Nearest Neighbours.

All metrics below:

<td data-sheets-value="[null,2,&quot;Time building index (s)&quot;]">
  Time building index (s)
</td>

<td data-sheets-value="[null,2,&quot;Average query time (ms)&quot;]">
  Average query time (ms)
</td>

<td data-sheets-value="[null,2,&quot;Average accuracy&quot;]">
  Average accuracy
</td>

<td data-sheets-numberformat="[null,2,&quot;0.000&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.000&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.000&quot;,1]">
</td>

<td data-sheets-value="[null,3,null,0.0223457813263]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.02
</td>

<td data-sheets-value="[null,3,null,5.461249351499999]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  5.46
</td>

<td data-sheets-value="[null,3,null,0.5856]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.59
</td>

<td data-sheets-value="[null,3,null,0.146422863007]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.15
</td>

<td data-sheets-value="[null,3,null,0.26562213897700004]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.27
</td>

<td data-sheets-value="[null,3,null,0.9776]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.98
</td>

<td data-sheets-value="[null,3,null,0.00321507453918]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.00
</td>

<td data-sheets-value="[null,3,null,0.17126083374]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.17
</td>

<td data-sheets-value="[null,3,null,0.5978]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.60
</td>

<td data-sheets-value="[null,3,null,3.27013206482]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  3.27
</td>

<td data-sheets-value="[null,3,null,66.48426055910001]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  66.48
</td>

<td data-sheets-value="[null,3,null,0.9258]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.93
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-value="[null,3,null,0.0957989692688]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.10
</td>

<td data-sheets-value="[null,3,null,7.10513591766]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  7.11
</td>

<td data-sheets-value="[null,3,null,0.609]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.61
</td>

<td data-sheets-value="[null,3,null,0.388671875]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.39
</td>

<td data-sheets-value="[null,3,null,0.7512760162350001]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.75
</td>

<td data-sheets-value="[null,3,null,0.9826]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.98
</td>

<td data-sheets-value="[null,3,null,0.00826096534729]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.01
</td>

<td data-sheets-value="[null,3,null,0.24263381958]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.24
</td>

<td data-sheets-value="[null,3,null,0.6164]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.62
</td>

<td data-sheets-value="[null,3,null,9.94446802139]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  9.94
</td>

<td data-sheets-value="[null,3,null,140.598635674]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  140.60
</td>

<td data-sheets-value="[null,3,null,0.9594]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.96
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-value="[null,3,null,0.247771978378]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.25
</td>

<td data-sheets-value="[null,3,null,8.01199913025]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  8.01
</td>

<td data-sheets-value="[null,3,null,0.609]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.61
</td>

<td data-sheets-value="[null,3,null,3.16764092445]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  3.17
</td>

<td data-sheets-value="[null,3,null,0.44639110565199996]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.45
</td>

<td data-sheets-value="[null,3,null,0.9826]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.98
</td>

<td data-sheets-value="[null,3,null,0.0239078998566]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.02
</td>

<td data-sheets-value="[null,3,null,0.203204154968]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.20
</td>

<td data-sheets-value="[null,3,null,0.6164]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.62
</td>

<td data-sheets-value="[null,3,null,55.3448691368]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  55.34
</td>

<td data-sheets-value="[null,3,null,71.1174964905]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  71.12
</td>

<td data-sheets-value="[null,3,null,0.9594]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.96
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-value="[null,3,null,1.28864502907]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  1.29
</td>

<td data-sheets-value="[null,3,null,9.49889659882]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  9.50
</td>

<td data-sheets-value="[null,3,null,0.1492]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.15
</td>

<td data-sheets-value="[null,3,null,10.4567909241]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  10.46
</td>

<td data-sheets-value="[null,3,null,1.1410522460899999]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  1.14
</td>

<td data-sheets-value="[null,3,null,0.5042]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.50
</td>

<td data-sheets-value="[null,3,null,0.0745980739594]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.07
</td>

<td data-sheets-value="[null,3,null,0.243935585022]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.24
</td>

<td data-sheets-value="[null,3,null,0.1334]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.13
</td>

<td data-sheets-value="[null,3,null,154.577830076]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  154.58
</td>

<td data-sheets-value="[null,3,null,139.825344086]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  139.83
</td>

<td data-sheets-value="[null,3,null,0.5372]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.54
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
</td>

<td data-sheets-value="[null,3,null,2.69871807098]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  2.70
</td>

<td data-sheets-value="[null,3,null,13.7353038788]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  13.74
</td>

<td data-sheets-value="[null,3,null,0.1588]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.16
</td>

<td data-sheets-value="[null,3,null,18.2781989574]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  18.28
</td>

<td data-sheets-value="[null,3,null,2.32357978821]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  2.32
</td>

<td data-sheets-value="[null,3,null,0.4868]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.49
</td>

<td data-sheets-value="[null,3,null,0.11420583725]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.11
</td>

<td data-sheets-value="[null,3,null,0.322947502136]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  0.32
</td>

<td data-sheets-value="[null,3,null,0.1242]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.12
</td>

<td data-sheets-value="[null,3,null,278.210582972]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  278.21
</td>

<td data-sheets-value="[null,3,null,257.453641891]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]" data-sheets-formula="=R[-32]C[0]*1000">
  257.45
</td>

<td data-sheets-value="[null,3,null,0.49]" data-sheets-numberformat="[null,2,&quot;0.00&quot;,1]">
  0.49
</td>

n_samples: 1000 n_features: 100

LSHF

Annoy

Flann

Panns

n_samples: 1000 n_features: 500

LSHF

Annoy

Flann

Panns

n_samples: 10000 n_features: 100

LSHF

Annoy

Flann

Panns

n_samples: 10000 n_features: 500

LSHF

Annoy

Flann

Panns

n_samples: 10000 n_features: 1000

LSHF

Annoy

Flann

Panns

The Filter Bubble is Silly and you Can't Guess What Happened Next

2014-10-10T00:00:00Z

I'm at RecSys 2014, meeting a lot of people and hanging out at talks. Some of the discussions here was about the filter bubble which prompted me to formalize my own thoughts.

I firmly believe that it's the role of a system to respect the user's intent. Any sensible system will optimize for user's long-term happiness by providing info back to the user that s/he finds useful. This holds true as long as a system isn't (a) stupid and recommends the wrong content (b) trying to push its own agenda, that may or may not be hidden.

When I go to Google and search for “banana” it tries to predict my intent. I expect to results about bananas back. It's like I go to an Indian restaurant and order something – I expect to get Indian food back.

Now, the chef could be bad, and I might not get what most people consider Indian food. But that's just a “bug”. The chef knows that he should respect my intent.

The other option is that I get some weird food because the chef has a hidden agenda. For instance, maybe s/he puts crack in my food so that I come back tomorrow. Or maybe, s/he gives me a hamburger, tells me this is Indian food, and blatantly lies that there's no other Indian restaurants in United States.

Luckily, we live in a free market, and as long as there's some competition, I believe that the truth will prevail. I'll eventually figure out that there are other Indian restaurants and I will discover that they are much better. This also means that there's really no incentive for the chef to serve something that's wildly different that what I'm asking for.

My point is, I think any system's role is to respect the intent of the user, and serve whatever you ask for. It's the user's role to decide what s/he wants to consume, and in a free market I think there will be those options out there.

Anyway, this story took a funny twist. As I was debating this over Twitter, some Twitter bot stepped in and started offering advice on Indian restaurants. Obviously unfiltered and neutral.

Detecting corporate fraud using Benford's law

2014-10-07T00:00:00Z

Note: This is a silly application. Don't take anything seriously.

Benford's law describes a phenomenon where numbers in any data series will exhibit patterns in their first digit. For instance, if you took a list of the 1,000 longest rivers of Mongolia, or the average daily calorie consumption of mammals, or the wealth distribution of German soccer players, you will on average see that these numbers start with “1” about 30% of the time. I won't attempt at proving this, but essentially it's a result of scale invariance. It doesn't apply to all numerical series, like IQ or shoe size, but this pattern turns out to pop up in a lot of places.

Since the theory predicts that the first digit follows a certain outcome, you can use it to find “strange” distributions that seem to disobey what we can expect statistically. The Wikipedia article mentions using Benford's law to detect accounting fraud, and Greece was busted by researchers noting that the Greek macroeconomic data had an abnormally large deviation from what Benford's law would predict. There's another couple of papers and an interesting blog post applying Benford's law to industry sectors.

For fun, I downloaded about 5,000 annual reports (10-K) for most publicly traded companies in the US, to see if there are big outliers.

Benford's law predict that the probability of any first digit, 1-9, is

$$ Q(d) = left(log (d+1) - log d right) / log 10 $$ .

For every annual report, I calculate the empirical distribution, $$ P(d) = n_d / sum n_i $$ where $$ n_d $$ is just the number of occurrences of a dollar amount starting with digit d. To correct for reports with few values, I smooth the measured digit distribution a bit and add $$ 100cdot P(d) $$ “fake” counts to each $$ n_d $$ .

To measure the difference between expected and actual distributions, I use the KL-divergence which boils down to

$$ D_{P \mid Q} = sum_i log left( P(i) / Q(i) right) P(i) $$

I downloaded the annual reports from SEC and extracted all figures from all tables containing dollar amounts. Since some amounts may occur many times, and skew the digit distribution, I only looked at the unique amounts that occurred in the report. I then extracted first non-zero digit of all amounts.

The distributions of digits for the top five outlier entries illustrate Benford's law in practice:

On closer inspection, some of these seem legit. For instance, the #1 spot on the list, Mid-America Apartment Communities, Inc. has a long list of units across the country, and the average price per unit happens to cluster around $800.

Below is a list containing the 100 companies with the largest KL-divergence (most “fishy”). None of the companies stand out as having an outrageous distribution, and even the top companies on the list are very unlikely to have commit fraud. The prior belief of accounting fraud is basically extremely low. We would commit the prosecutor's fallacy for singling out any of these numbers as fraudulent. Anyway, I'll follow up with a new blog post in five years to see if any of the companies below were actually caught:

<td>
  Mid-America Apartment Communities, Inc.
</td>

<td>
  Power Integrations Inc.
</td>

<td>
  United Natural Foods,Inc.
</td>

<td>
  Lexicon Pharmaceuticals, Inc
</td>

<td>
  Pacific Office Properties Trust, Inc.
</td>

<td>
  Xilinx
</td>

<td>
  Host Hotels & Resorts Inc.
</td>

<td>
  World Acceptance Corp
</td>

<td>
  Immunomedics, Inc.
</td>

<td>
  Marriott International Inc.
</td>

<td>
  CVS Caremark Corporation
</td>

<td>
  Paychex, Inc.
</td>

<td>
  Luna Innovations Incorporated
</td>

<td>
  Capstead Mortgage Corporation
</td>

<td>
  Verso Paper Corp.
</td>

<td>
  Fastenal Co.
</td>

<td>
  Insperity, Inc.
</td>

<td>
  Diamond Hill Investment Group Inc.
</td>

<td>
  National Security Group Inc.
</td>

<td>
  GameStop Corp.
</td>

<td>
  Compass Minerals International Inc.
</td>

<td>
  SIRIUS XM Radio Inc.
</td>

<td>
  BP Prudhoe Bay Royalty Trust
</td>

<td>
  Investors Bancorp Inc.
</td>

<td>
  Kohlberg Capital Corporation
</td>

<td>
  Equity One
</td>

<td>
  Kona Grill Inc.
</td>

<td>
  Alliance Financial Corporation
</td>

<td>
  Zale Corporation
</td>

<td>
  Anadarko Petroleum Corporation
</td>

<td>
  Sigma-Aldrich Corp.
</td>

<td>
  Global Cash Access Holdings, Inc.
</td>

<td>
  Corcept Therapeutics
</td>

<td>
  Enbridge Energy Management LLC
</td>

<td>
  BJ's Restaurants Inc.
</td>

<td>
  Air Transport Services Group, Inc.
</td>

<td>
  Fairchild Semiconductor International Inc.
</td>

<td>
  Universal Electronics Inc.
</td>

<td>
  Espey Manufacturing & Electronics Corp.
</td>

<td>
  Inland Real Estate Corporation
</td>

<td>
  W. R. Berkley Corporation
</td>

<td>
  Albemarle Corp.
</td>

<td>
  Koss Corp.
</td>

<td>
  Leap Wireless International Inc.
</td>

<td>
  Encore Wire Corp.
</td>

<td>
  UQM Technologies, Inc.
</td>

<td>
  DuPont Fabros Technology Inc.
</td>

<td>
  Applied Materials Inc.
</td>

<td>
  Destination Maternity Corporation
</td>

<td>
  Pepsico, Inc.
</td>

<td>
  CorVel Corporation
</td>

<td>
  Nathan's Famous Inc.
</td>

<td>
  Sport Chalet, Inc.
</td>

<td>
  Key Technology Inc.
</td>

<td>
  Overhill Farms Inc.
</td>

<td>
  Digi International Inc.
</td>

<td>
  Materion Corporation
</td>

<td>
  DreamWorks Animation SKG Inc.
</td>

<td>
  NIC Inc.
</td>

<td>
  ANSYS Inc.
</td>

<td>
  Volterra Semiconductor Corporation
</td>

<td>
  Verenium Corporation
</td>

<td>
  KeyCorp
</td>

<td>
  Rockwell Collins Inc.
</td>

<td>
  Meritage Homes Corporation
</td>

<td>
  Perrigo Co.
</td>

<td>
  Zhone Technologies Inc
</td>

<td>
  McGrath RentCorp
</td>

<td>
  A.M. Castle & Co.
</td>

<td>
  Delta Natural Gas Co. Inc.
</td>

<td>
  Pervasive Software Inc.
</td>

<td>
  Senomyx
</td>

<td>
  ManTech International Corp.
</td>

<td>
  Ross Stores Inc.
</td>

<td>
  Bancorp Of New Jersey, Inc.
</td>

<td>
  Werner Enterprises
</td>

<td>
  Dillards Inc.
</td>

<td>
  Sparton Corp.
</td>

<td>
  Rudolph Technologies Inc.
</td>

<td>
  CyberOptics Corp.
</td>

<td>
  Hallador Energy Company
</td>

<td>
  DARA BioSciences, Inc
</td>

<td>
  Chico's FAS Inc.
</td>

<td>
  Delcath Systems Inc.
</td>

<td>
  Pure Cycle Corp.
</td>

<td>
  Cytori Therapeutics
</td>

<td>
  Vonage Holdings Corporation
</td>

<td>
  Spectranetics Corporation
</td>

<td>
  Regal-Beloit Corporation
</td>

<td>
  ScanSource, Inc.
</td>

<td>
  Weyco Group Inc
</td>

<td>
  Ambassadors Group Inc.
</td>

<td>
  Rent-A-Center Inc.
</td>

<td>
  Accenture plc
</td>

<td>
  Idenix Pharmaceuticals
</td>

<td>
  KAR Auction Services, Inc.
</td>

<td>
  Progressive
</td>

<td>
  BCSB Bankcorp Inc.
</td>

<td>
  PCTEL, Inc.
</td>

<td>
  Cincinnati Financial Corp.
</td>

0.1311
0.0578
0.0497
0.0474
0.0461
0.0414
0.0406
0.0391
0.0390
0.0388
0.0387
0.0382
0.0382
0.0381
0.0370
0.0370
0.0364
0.0359
0.0354
0.0345
0.0342
0.0340
0.0339
0.0326
0.0323
0.0319
0.0319
0.0313
0.0310
0.0310
0.0308
0.0304
0.0300
0.0294
0.0293
0.0293
0.0292
0.0292
0.0291
0.0290
0.0286
0.0285
0.0282
0.0281
0.0279
0.0276
0.0276
0.0276
0.0275
0.0272
0.0271
0.0270
0.0269
0.0269
0.0268
0.0268
0.0267
0.0265
0.0265
0.0264
0.0258
0.0258
0.0258
0.0255
0.0254
0.0253
0.0249
0.0249
0.0249
0.0248
0.0247
0.0247
0.0247
0.0246
0.0245
0.0245
0.0244
0.0244
0.0243
0.0243
0.0240
0.0238
0.0238
0.0237
0.0236
0.0235
0.0235
0.0235
0.0235
0.0234
0.0234
0.0232
0.0232
0.0232
0.0231
0.0231
0.0230
0.0230
0.0229
0.0229

Again, a bunch of disclaimers: this is just a silly application, don't take it seriously, elevator inspection certificate available in the building manager's office, etc.

Running Theano on EC2

2014-08-19T00:00:00Z

Inspired by Sander Dieleman's internship at Spotify, I've been playing around with deep learning using Theano. Theano is this Python package that lets you define symbolic expressions (cool), does automatic differentiation (really cool), and compiles it down into bytecode to run on a CPU/GPU (super cool). It's built by Yoshua Bengio's deep learning team up in Montreal.

This isn't going to be a long blog post – I just wanted to share two pro tips:

I was messing around for hours trying to get Theano running on the GPU instances in EC2. Turns out Andreas Jansson, a coworker at Spotify, has already built an ready-to-use AMI. When you start an EC2 instance, search for the gpu_theano AMI. (AMI's are Amazon's virtual images that you boot your system from). The gpu_theano AMI runs Ubuntu 14.04 and comes with a bunch of stuff pre-installed. Andreas also has this tool to spin it up from the command line, but I couldn't get it working (somehow the instances it created weren't accessible over SSH) so I ended up just booting machines from the AWS Management Console.
The list price for the g2.2xlarge instances (the ones with GPU's) are $0.65/h. If you end up running something for a week then that's just above $100. The spot instance price, however, is (currently) only $0.0641/h – less than 10%. The downside with spot instances is that your using excess capacity of EC2, so there's a small likelihood your machine will be taken down at any point. But so far supply generally seems to outstrip demand. The price looks fairly stable, and you can always checkpoint data to S3 to persist it.

![image](/assets/2014/08/Screen-Shot-2014-08-19-at-9.18.15-AM.png)

My deep learning model is about 50x faster on a g2.2xlarge (which has 1,536 GPU cores) compared to a c3.4xlarge (which has 16 CPU cores) so the speedup is pretty substantial.

In defense of false positives (why you can't fail with A/B tests)

2014-07-30T00:00:00Z

Many years ago, I used to think that A/B tests were foolproof and all you need to do is compare the metrics for the two groups. The group with the highest conversion rate wins, right?

Then, for a long period, I ran a lot of tests. I started using confidence intervals, and learned about all the pitfalls of A/B testing. What to think about when running many A/B tests, why you shouldn't check your metrics every day, why you shouldn't optimize for a local maximum, and so on. I started becoming paranoid.

There's about a million blog posts out there saying how everyone's doing A/B testing wrong and how you should do it instead. It's like there's some secret society of people and the only way to join this club is to sacrifice a limb. You clearly have no idea what you're doing… why are you even thinking about A/B testing? Go back and hack on your cousin's webshop, or else pick up this 1,500 page book about Bayesian statistics and come back in two years.

The other side of this is about half a gazillion people who argue that A/B testing is inherently flawed. What I used to say was: Don't throw out the baby with the bathwater. As long as you're aware of the pitfalls, it's a great tool. I thought for many years that running A/B test without a thorough understanding of all its shortcomings was dangerous.

But then I changed. Here's the thing: Even if you're doing A/B testing completely wrong, you are probably benefitting from it. Even if you don't care about confidence intervals, multiple comparison corrections, or if you are basically too impatient to wait, you probably are still doing the right thing. The reason is that user metrics optimization is not a drug trial.

What do I mean with this? There's just a few things that govern the success of an A/B test

The impact of a true positive. Assuming you end up deploying the right thing, what's the business impact?
The cost of a false positive. Assuming you end up deploying the wrong thing, what's the business impact?
The prior probability of success. Before you start running the test, what's the probability of success? In the long run, what's the success rate of testing?

For a drug trial, the impact of a true positive is huge. You found a cure for baldness! But the cost of a false positive is even bigger: it turns out your drug doesn't work, and it's also causing hallucinations. Finally, if you're a drug company, you probably evaluated 100 different drugs before finding one that seems to work, meaning the success rate of any specific drug is minimal.

This is why drug trials are subject to such intense scrutiny by government agencies. It's also why most published research findings are false.

But you're not a drug company, nor are you trying to find the Higgs boson. You're basically evaluating whether a bigger “sign up” button leads to more conversions. In fact, most of your tests are driven by strong hypotheses with a large prior belief. You have a clever idea of how to impact users and historically few A/B tests show negative results.

The cost of deploying the wrong thing (false positives) is also low. You might end up with the wrong color button or some extra code that adds small tech debt. But not more than that. After all, a feature can't be horrible if metrics aren't tanking.

The other thing people argue a lot about is what success metric matters. In my experience, it usually never matters. I've very rarely seen statistically significant impacts going in two directions (one metric going up, the other going down) as long as you pick metrics in a sensible way (eg. avoid ratio metrics). But what I have seen is insignificant tests. Lots of them. So if you have to pick a metric, the most important thing is you should just pick the one with the largest signal to noise. Just don't cherry-pick metric after the test is run.

Conclusion: Don't listen to all the haters. Do more A/B testing.

Recurrent Neural Networks for Collaborative Filtering

2014-06-28T00:00:00Z

I’ve been spending quite some time lately playing around with RNN’s for collaborative filtering. RNN’s are models that predict a sequence of something. The beauty is that this something can be anything really – as long as you can design an output gate with a proper loss function, you can model essentially anything.

In the case of collaborative filtering, we will predict the next item given the previous items. More specifically, we will predict the next artist, album, or track, given the history of streams. Without loss of generalization, let’s assume we want to predict tracks only.

Note that we’re not trying to predict ratings or any explicit information – just what track the user chose to play.

The data

We use playlist data or session data, because it has an inherent sequence to it. Removing consecutive duplicates improves performance a lot, since otherwise the network just learns to predict the same item as it just predicted.

In our case we use a few billion playlist/sessions and in total about ten to a hundred billion “words”.

The model

Recurrent neural networks have a simple model that tries to predict the next item given all previous ones. After predicting the item, the network gets to “know” what item it was, and incorporates this.

More formally, let’s assume we have time steps $$ 0 \ldots t-1 $$ . The model has a “hidden” internal state $$ h_0 \ldots h_{t-1} $$ . These are generally vectors of some dimension $$ k $$ . Every time step, we have two things going on

Predict the output given the hidden state. We need to model a $$ P(y_i \mid h_i) $$ for this.
Observe the output $$ y_t $$ and feed it back into the next hidden state $$ h_{i+1} $$ . In the most general form, $$ h_{i+1} = f(a(h_i) + b(y_i)) $$ . In practice, $$ f $$ is generally some nonlinear function like sigmoid or tanh, whereas $$ a $$ , and $$ b $$ are usually a simple linear transform. It depends a bit on the structure of the output.

Now, all we need to do is write down the total likelihood and optimize for it!

Wait a minute?

Sorry about the extremely superficial introduction without much detail. Here's a more specific example:

Let's say we want to predict a sequence of daily stock returns. In that case, $$ y_i $$ is a vector of stock returns – maybe containing three values with the daily return for Apple, Google, and Microsoft. To get from the hidden state $$ h_i $$ to $$ y_i $$ let's just use a simple matrix multiplication: $$ y_i = W h_i $$

We can assume $$ P(y_i \mid h_i) $$ is a normal distribution because then the log-likelihood of the loss is just the (negative) L2 loss: $$ -(y_t - h_t)^2 $$

We can specify that $$ h_{i+1} = \tanh(Uy_i + Vh_i) $$ and that $$ h_0 = 0 $$ (remember it's still a vector). If we want to be more fancy we could add bias terms and stuff but let's ignore that for the purpose of this example. Our model is now completely specified and we have $$ 3k^2 $$ unknown parameters: $$ U $$ , $$ V $$ , and $$ W $$ .

What are we optimizing?

We want to find $$ U $$ , $$ V $$ , and $$ W $$ that maximizes the log-likelihood over all examples:

$$ \log L = \sum \limits_{\mbox{all examples}} \left( \sum \limits_{i=0}^{t-1} -(y_i - h_i)^2 \right) $$

Backprop

The way to maximize the log-likelihood is through back-propagation. This is a well-known method and there's so much resources on line that I'll be a bit superficial about the details.

Anyway, we need to do two passes through each sequence. First propagation. For $$ i=0 \ldots t-1 $$ : Calculate all hidden states $$ h_i $$ . Nothing magic going on here, we're just applying our rule.

Now it's time for backprop. This is essentially just the chain rule taken to the extreme. Remember that the total log-likelihood is the sum of all the individual log-probabilities of observing each output:

$$ L = \sum \limits_{i=0}^{t-1} \log P(y_i \mid h_i) $$

We define the derivatives $$ \delta_i $$ as the partial derivatives of the log-likelihood with respect to the hidden state:

$$ \delta_i = \frac{\partial \log L}{\partial h_i} $$

Since each hidden state only influences later hidden states, the $$ \delta $$'s are just a function of all future $$ \delta_j, j = i \ldots t-1 $$ .

$$ \delta_i = \sum \limits_{j=i}^{t-1}\frac{\partial \log P(y_j \mid h_j)}{\partial h_i} $$

We can now rewrite $$ \delta_i $$ as a function of $$ \delta_{i+1} $$ and use some chain rule magic:

$$ \frac{\partial \log L}{\partial h_i} = \delta_i = \frac{\partial}{\partial h_i} \log P(y_i \mid h_i) + \frac{\partial h_{i+1}}{\partial h_{i}} \delta_{i+1} $$

We can evaluate this backwards from $$ t-1 \ldots 0 $$ . Both $$ P(y_i \mid h_i) $$ and $$ \frac{\partial h_{i+1}}{\partial h_{i}} $$ are specified by our model, so we just need to plug in the expressions.

For the stock price example, we have the unknown parameters $$ U, V, W $$ where we can derive the gradients like this:

$$ \frac{\partial \log L}{\partial U} = \frac{\partial L}{\partial h_{i+1}}\frac{\partial h_{i+1}}{\partial U} = \delta_{i+1} \frac{\partial}{\partial U} \tanh(Uy_i + Vh_i) = \delta_{i+1} \tanh’(Uy_i + Vh_i) y_i^T $$

This looks intimidating but it's really just a lot of chain rule applications and fairly straightforward math. You get similar gradients for $$ V, W $$ . Now that we have the gradients, we can optimize using stochastic gradient descent over lots of example.

How does this relate to Hidden Markov Models?

The nice thing about RNN's is that the relation between $$ h_{t+1} $$ and $$ h_t $$ is _exact_ rather than being some probabilistic relationship. This means that the $$ h $$ ‘s are not parameters in themselves, so we don't have to solve for them at all. This is usually the slow part of HMM's since figuring out the hidden values takes some slow iterative process like the Baum-Welch algorithm. For RNN's, we only need _two passes_ through each sequence rather than iterating lots of times until the hidden states converge.

The other thing is that RNN's need some kind of nonlinearity or else they magnitude of the hidden states will explode. This nonlinearity is usually taken to be sigmoid or tanh. I guess in theory HMM's could also use nonlinearities, but I've never heard of this.

Predicting other things

Let's focus on the collaborative filtering example. Given a sequence of watched movies, or tracks that the user has listened to, predict what the next one is going to be. Now $$ y_i $$ is not a scalar, but one out of many items. We need some kind of distribution $$ P(y_i \mid h_i) $$ . The one I've seen being used is the Softmax distribution over all possible outputs. This means we have to learn a vector $$ a_j $$ for each item. The probability $$ P(y_i \mid h_i) $$ is now proportional to $$ \exp(h_t^Ta_j) $$ :

$$ P(y_i \mid h_i) = \frac{\exp(h_t^Ta_j)}{\sum_k \exp(h_t^Ta_k)} $$

Notice that the summation part in the denominator is over all items – something that is pretty slow to compute. I'll get back to that.

We also need something linking back the output to the next hidden state. In fact, we will learn another set of vectors $$ b_j $$ – one for each item. With slight abuse of notation, here is how it looks like:

Since we are learning a bunch of vectors, we don't need the matrices $$ U $$ and $$ V $$ . Our model now becomes:

$$ h_{i+1} = \tanh(W h_i + b_i) $$

with some slight abuse of notation since the $$ b $$ ‘s are shared for each item $$ y_i $$ . Again, we want to maximize the total log-likelihood

$$ \sum \sum_{i=0}^t \log P(y_i \mid h_i) $$

We now end up with a ton of parameters because we have the unknowns $$ a_j $$ ‘s and $$ b_j $$ ‘s for each item.

Let's just pause here and reflect a bit: so far this is essentially a model that works for any sequence of items. There's been some research on how to use this for natural language processing. In particular, check out Tomas Mikolov's work on RNN's (this is the same guy that invented word2vec, so it's pretty cool.

The gnarly details

If you have 10 different items, you can evaluate $$ P(y_i \mid h_i) $$ easily, but not if you have five million items. But do not despair! There's a lot of ways you can attack this:

Take the right output and sample some random items from the entire “vocabulary”. Train the model to classify which one is the right output. This is sort of like a police lineup: one is the right suspect, the remaining people just some random sample.
Hierarchical softmax: Put all items in a binary tree, and break it up into roughly $$ \log m $$ binary classification problems (where m is the size of the vocabulary).

Instead of messing around with Hamming trees and things recommended in literature, I ended up implementing a much more simple version of hierarchical softmax. Internally, all item are described as integers, so I build the tree implicitly. The root node is a binary classifier for the last bit of the item. The next level classifies the second last bit, and so on.

The idea is that you can calculate $$ P(y_i \mid h_i) $$ as the product of all the probabilities for each individual node on the path from the root to the leaf.

Instead of learning an $$ a_j $$ for every item, we just have to learn one for every node in the tree, in total $$ 2^{\lceil{log_2 m}\rceil -1} $$ vectors. It doesn't really matter that much how we build the tree – we don't need to enforce that similar items are close to each other in the tree (however that would probably improve performance a bit).

But in general, this problem pops up in a lot of places and here's some other crazy ideas I've thought about:

If generating random “negative” samples, one way to better mine random examples would be to sample some items from the $$ y_i $$ ‘s themselves. Since those values are probably pretty similar, that would force the model to discriminate between more “hard” cases.
Assume the $$ a_j $$ ‘s have some kind of distribution like a Normal distribution, and calculate an expectation. This is just a bunch of integrals. We've had some success using this method in other cases.
Don't use softmax but instead use L2 loss on $$ m $$ binary classification problems. All entries would be $$ 0 $$ except the right one which is $$ 1 $$ . You can put more weight on the right one to address the class imbalance. The cool thing is that with L2 loss, everything becomes linear, and you can compute the sum over all items in constant time. This is essentially how Yehuda Koren's 2008 paper on implicit collaborative filtering works

Implementing it

I ended up building everything in C++ because it's fast and I'm pretty comfortable with it. It reads about 10k words per second on a single thread. A multi-threaded version I've built can handle 10x that amount and we can parse 10B “words” in about a day.

Hyperparameters

A beautiful thing with this model is that there's basically no hyperparameters. The larger number of factors is better – we typically use 40-200. With dropout (see below) overfitting is not a concern. It takes a little trial and error to get the step sizes right though.

Initialization

As with most latent factor models, you need to initialize your parameter to random noise. Typically small Gaussian noise like $$ \mathcal{N}(0, 0.1^2) $$ works well.

Nonlinearity

I tried both sigmoid and tanh. Tanh makes more sense to me, because it's symmetric around 0, so you don't have to think too much about the bias term. Looking at some offline benchmarks, it seemed like tanh was slightly better than sigmoid.

Dropout

I also added dropout to the hidden values since it seemed to help improve the predictions of the model. After the each $$ h_i $$ is calculated, I set half of them to zero. This also seems to help with exploding gradients. What happens is that the $$ W $$ matrix will essentially learn how to recombine the features

Gradient clipping

Actually gradient clipping wasn't needed for sigmoid, but for tanh I had to add it. Basically during the backprop I cap the magnitude of the gradient to 1. This also helps equalizing the impact of different examples since otherwise for longer sequences you get bigger gradients.

Adagrad

I use Adagrad on the item vectors but simple learning rate on the shared parameters $$ W $$ and the bias terms. Adagrad is fairly simple: in addition to each vector $$ a_j $$ and $$ b_j $$ you just store a single scalar with the sum of the squared magnitudes of the gradients. You then use that to normalize the gradient.

For a vector $$ x $$ with gradient $$ d $$ , Adagrad can be written as:

$$ x^{(n+1)} = x^{(n)} + \eta \frac{d^{(n)}}{\sqrt{\sum \limits_{i=1 \ldots n} \left( d^{(i)} \right)^2}} $$

$$ \eta $$ is a hyperparameter that should be set to about half the final magnitude these vectors will have. I usually have had success just setting $$ \eta = 1 $$ .

Results

Offline results show that the RNN is one of the best performing algorithms for collaborative filtering and A/B tests confirm this.

More resources

I can recommend A tutorial on training recurrent neural networks as another starting point to read more about recurrent neural networks.

(Edit: fixed some minor errors in the math and a wrong reference to Baum-Welch)

Where do locals go in NYC?

2014-06-17T00:00:00Z

One obvious thing to anyone living in NYC is how tourists cluster in certain areas. I was curious about the larger patterns around this, so I spent some time looking at data. The thing I wanted to understand is: what areas are dominated by tourists? Or conversely, what areas are dominated by locals?

After some looking around, I found this Foursquare data dump and analyzed about 200,000 check-ins in NYC. Time to crunch some data…

First of all, I split up check ins into those done by (a) people living within 10,000 feet of the check in (b) people living further away. As the next step I broke up the check-ins by 2,166 census areas of New York and calculated the ratio of locals. I color-coded each census area from green (100% locals) to purple (0% locals). Here is the result, for the five boroughs:

Some obvious ones stand out, like the airports: JFK and LaGuardia, which are completely dominated by “non-locals” . Interestingly, Red Hook in Brooklyn also is dominated by non-locals (maybe because of IKEA?), as well as Prospect Park and some other areas in downtown Brooklyn. In Bronx, the area surrounding Yankee Stadium is also nearly 100% non-locals.

Maybe not surprisingly, there seems to be some truth to the joke that real New Yorkers do not go above 14th Street. Zooming into lower Manhattan, you can clearly see a sharp dividing line cutting across Manhattan:

Another thing that stands out is how North Williamsburg is completely dominated by non-locals, maybe because of the huge influx of people on weekends going out partying.

I then split up into (a) people who live in the five boroughs (b) tourists, and color-coded into blue (0% tourists) and red (100% tourists):

Airports are slightly biased towards tourists. Interestingly Red Hook now becomes a New York affair.

Zooming in on Lower Manhattan again:

Another thing that stands out is how North Williamsburg is dominated by New Yorkers, meaning that even though most people are not local, they are still from the city.

In contrast, most of Lower Manhattan (East Village, West Village, etc) isn't just dominated by locals, it's also very low ratio of tourists to New Yorkers.

The areas most dominated by tourists are, maybe not surprisingly

Central Park
Times Square and parts of Midtown
Financial district
Liberty Island and Ellis Island

I did this in Python, mostly using geopandas, Matplotlib, and a bunch of open source data sets. It was a fun weekend project and it ended up taking way too much time. And since I live in Lower East Side myself, I'm probably pretty biased…

How to build up a data team (everything I ever learned about recruiting)

2014-06-08T00:00:00Z

During my time at Spotify, I've reviewed thousands of resumes and interviewed hundreds of people. Lots of them were rejected but lots of them also got offers. Finally, I've also had my share of offers rejected by the candidate.

Recruiting is one of those things where the Dunning-Kruger effect is the most pronounced: the more you do it, the more you realize how bad you are at it. Every time I look back a year, I realize 10 things I did wrong. Extrapolating this, I know in another year I'll realize all the stupid mistakes I'm doing right now. Anyway, that being said, here are some things I learned from recruiting.

Getting the word out

Depending on where you work, people might have no clue about your company. Why would they work for something they have never heard of? Or alternatively – something they know of, but doesn't necessarily associate with cutting edge tech? There's a million companies out there doing cool stuff, so make sure that people know your company stands out. Blog, talk at meetups, open source stuff, go to conferences. I honestly don't know what works – I don't have any numbers. But you need to hedge your bets by attacking on all angles at the same time.

I think developers have a hard time justifying this just because success is not easily quantifiable – this is a branding exercise, and it's super hard to find out if you're doing the right thing. But over time if you do this right, you will get anecdotal feedback from candidates coming in saying they saw your presentation or read this cool story on Hacker News, or what not.

Finding the people

I don't think there's anything magic about this – just go through external recruiters, internal recruiters, job postings, connections, whatever.

Presenting the opportunity

I think most people in the industry are fed up with bad bulk messages over email/LinkedIn. Ideally, the hiring manager should introduce themselves, or for more senior roles having more senior people reaching out (all the way up to the CTO). If a recruiter is reaching out, it's super important to make sure the recruiter can reach out to people with a quick note on what's interesting about the team and why it's a good fit.

Finding the right candidates

Recruiting is some crazy type of active learning problem with this observation bias where you only see how well the people you hire are doing. In particular there was a lot of discussion a while back when Google claimed there was no correlation between test scores and GPA. I think there totally are really strong correlations on a macro scale, but if you are already filtering out people based on those criteria, obviously you will reduce the strength, or even reverse it. Not that I claim to have found any magic criteria. I do however thing the two most successful traits that I've observed are (with the risk of sounding cheesy):

Programming fluency (10,000 hour rule or whatever) – you need to be able to visualize large codebases, and understand how things fit together. I strongly believe that data engineers need to understand the full stack from idea, to machine learning algorithm, to code running in production. I've seen other companies having a “throw it over the fence” attitude, with one team brainstorming algorithms, another team in another city implementing them. I think that's a flawed way to have a tight learning cycle. In particular, I'm hesitant to hire candidates who are strong on the theoretical side, but with little experience writing code. That's why I really avoid the “data science” label – most people within this group are generally lacking on the core programming side. I don't think this necessarily means candidates has to have a solid understanding of the CAP theorem and the linux page cache. The most important thing is they have written a lot of code, can work with nontrivial code bases, and can write clean, maintainable code. There is nothing magic to this – but a person who only has written Matlab scripts probably will have a harder time adjusting.
Understand the big picture – go from a vision to a set of tasks, to a bunch of code being written. People have to be able to go from an idea (“analyze this data set and build an algorithm that uses it as a signal for the recommender system”) to code, without having to hand hold them throughout every single step. People who need to be given small tasks rather than the underlying problem will never understand why we're working on things, and will inevitably end up doing the wrong thing.

Attracting the right candidates

Even if you find awesome candidates, your interview process, or your lack of selling might destroy your prospects of hiring the right people. Here's some things I've learned

Smart people are actually really impressed by good interview process. If some smart ML genius comes in and walks them through a super hard problem, the candidate will feel like they can learn from that interviewer. Conversely, giving interview problems that are not well structured, without any follow up questions, or discussion notes, will give a really bad impression. The best type of interview problems have a (b), (c) and (d) version that you can pull up in case the candidate nails (a) directly.
Understand the level of selling you need to do. If you get a super senior person in, spend the bare minimum to establish that this person actually lives up to their rumor. Then, spend the rest of the time explaining the role and why you are so excited about it.
Explaining the role and why it's a good fit is everything. But obviously not superficial stuff. These are engineers and you need to impress them with the stuff they will learn. Talk about why your company is working on, what challenges you have, what technologies you are using.
It's often much easier to start by listening to the candidate. Let them ask questions and talk more than you. Try to understand what they are looking for. Then, explain how this position meets or does not meet those criteria.
Everyone has an angle. Someone coming from finance is probably more excited to hear about your product vision. Someone from a small failed startup is probably more excited to hear about your benefits.
Make the candidate meet the team they will be working with. There's so many companies failing this. I remember going through a series of ten Google interviews many years ago and then the recruiter wouldn't tell me what team I would work on. I immediately turned down the offer. I think on the other side of the spectrum, the best thing you can do is to bring the candidate out on a team lunch to meet as many as possible.
Make them understand you take this role seriously. Set up a quick chat with the CTO or some senior people.

Hiring the right candidate

If it's a senior person, and everyone is super impressed, hire the person. If it's a senior person, and people are on the fence, you shouldn't hire the person.
If it's a junior person, it's sometimes hard to know. One thing I've learned is that excitement really counts. For instance the candidate kicks ass during the ML interview, and clearly has solid programming fluency, but doesn't necessarily know much about scalability and version control, then I really think it boils down to how excited this person is.
At the end of the day, if you realize this candidate is brilliant, but not necessarily the right fit for your role, find something else. It's a sign of a great organization that you can always find a place for smart people.

Building the right team

Candidates will be on a spectrum between “theoretical” (more ML stuff) or “applied” (more scalability, backend, data processing). But just for the purpose of the argument, let's assume people are on each side of the spectrum. For a high performing machine learning team, I think a ratio of 1:2 is good. But everyone should be able to work on every part of the system, if needed.
If you're a small company, you will end up hiring lots of people who are similar to yourself. I don't mean necessarily people of the same gender or origin, but people who will all go out to the same bar getting the same beer. This might good for cohesion early on, but as soon as you grow beyond the early stage, you might end up with a destructive monoculture that is super hard to break out of.
On the same note: if you're a big company and your team still skews a certain way, then it means you have prejudice problems you need to work on. Seriously. For instance, with the number of female software engineers being 20.2%, you can derive that 95% of all 50 people team should contain at least five women, and 70% of all teams should contain at least eight women. It's just plain math.

For more reading on the subject, I just came across this interesting blog post by Cheng-Tao Chu: Why building a data science team is deceptively hard.

The power of ensembles

2014-04-24T00:00:00Z

From my presentation at MLConf, one of the points I think is worth stressing again is how extremely well combining different algorithms works.

In this case, we're training machine learning algorithms on different data sets (playlists, play counts, sessions) and different objectives (least squares, max likelihood). Then we combine all the models using gradient boosted decision trees training on a smaller but higher quality data set. Finally, we validate on a third data set, in this case looking at recall for a ground truth data set of related artists.

The ensemble was common knowledge throughout the Netflix prize, and the winner ended up having many hundreds of models in the ensemble. But measured in pure RMSE score the results weren't mind blowing – the best models weren't far from the best ensemble.

I think in a real world setting, chances are bigger that you have lots of models working on different data sets optimizing for different things. The bias of each model will be much larger, so the gains from combining all models are even larger.

Another difference in our case is we basically don't know what to optimize for. There's no single golden metric such as RMSE. Instead, we train things on different data sets. Eg individual models may be trained on playlists data, and the ensemble on thumbs. We then have a bunch of offline metrics as well as A/B test metrics we use to determine success.

Some people have suggested that these large ensembles are not practical and have tried to find a single good algorithm. I think this is a bad idea. Ensembles scale really well because you can parallelize all the work, and building a good framework it's easy to add or remove new models. Data changes all the time and a good ensemble with a bunch of models makes recommendations robust and less sensitive to shift in the underlying data.

Our ensemble looks quite different (and much bigger) than the picture above. We are thinking a lot about adding diversified signals from various machine learning algorithms applied to different data sets. These signals can be anything from collaborative filtering to dummy things, content-based, or editorial data. Pretty much anything that can be converted into a numerical signal can be used.

MLConf 2014

2014-04-12T00:00:00Z

Just spent a day at MLConf where I was talking about how we do music recommendations. There was a whole range of great speakers (actually almost 2/3 women which was pretty cool in itself).

Here are my slides:

Justin Basilico from Netflix talked about how they deliver a personalized start page using lots of ranking

Claudia Perlich also had a great presentation about ad targeting. A few things were really interesting: (a) using transfer learning to learn from web site visiting and apply it to add targeting (b) causality tests to figure out if a campaign actually made an impact (c) how they intentionally introduce bias in their classifiers by skewing subsampling. I think she also mentioned throwing in artificial negative data, which is something we do too.

Josh Wills had a hilarious slide about “ML as a tool” companies. His analogy was waiting two hours in a line for a roller coaster ride, and then someone jumps in front of you, rides the roller coaster, and gives you a video of the ride. His point was that ML is the 1% fun part, and 99% plumbing is the boring part. If you want to start a company, focus on the boring stuff no one else wants to deal with.

Music recommendations using cover images (part 1)

2014-04-01T00:00:00Z

Scrolling through the Discover page on Spotify the other day it occurred to me that the album is in fact a fairly strong visual proxy for what kind of content you can expect from it. I started wondering if the album cover can in fact be used for recommendations. For many obvious reasons this is a kind of ridiculous idea, but still interesting enough that I just had to explore it a bit. So, I embarked on a journey to see how far I could get in a few hours.

First thing I did was to scrape all album covers. We have a few million of them (I don't think I could give you the exact number, or I would have to kill you). A full set of 64x64px images is 10 GB roughly, so not an insane amount.

1024 random cover images in 16×16 px

My first attempt at defining “similarity” was simply to resize all images to 16×16, convert to grayscale, subtract the mean and normalize by the variance. Each cover image is then essentially a 256-vector in Euclidean space. Load those vectors into Annoy and Bob's your uncle! Nice!

These recommendations turn out to be pretty horrible, at best.

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

In the chart above, “covers” is the image-based method. “rnn” is recurrent neural networks, “koren” is this paper and “vector_exp” is a model I've described before.

This image similarity algorithm gave some fun results, like finding out that the most similar album to the left one was the right one:

In general, the only type of music I could reliably see this working on was minimal techno. Pretty much all minimal techno albums have a completely identical layout: a white (or light) circle on a black background:

Most similar covers for Kassem Mosse – Workshop 12: http://open.spotify.com/album/0NQvm5y6CLtjtbyJNZltFg

This wasn't enough to resolve the question, so I kept searching for the answer. I stumbled across pyleargist by Oliver Grisel and to my delight it was trivial to install. Essentially pyleargist is a wrapper around a C library that takes any image and generates an image descriptor, which is a vector with 960 elements. Evaluating it using the same metric actually yields some fairly decent results:

Classification score on predicting thumbs, measured by Area Under Curve (AUC). Random algorithm = 50%, higher numbers are better.

The results are already not that horrible (although with a very biased and quite unreliable metric). At least it's definitely better than pure random.

Most similar album covers to Daft Punk's – Random Access Memories

At this point I decided the next step on this journey of overengineering would be either, or possibly both out of

Learning a mapping from image space to collaborative filtering space. That way we learn which features in the picture are relevant for the music. This is a similar idea to this paper.
Venture into the realms of deep learning. I went to the Twitters for some assistance and got a ton of response back.

This is an ongoing project (and an ongoing source of frustrations) so I'll defer the updates to a second part. Stay tuned!

Edit: Was linking to the wrong paper by Sander Dieleman's paper!

Luigi success

2014-03-22T00:00:00Z

So Luigi, our open sourced workflow engine in Python, just recently passed 1,000 stars on Github, then shortly after passed mrjob as (I think) the most popular Python package to do Hadoop stuff. This is exciting!

A fun anecdote from last week: we accidentally deleted roughly 10TB of data on HDFS, and the output of 1,000s of jobs. This could have been a disaster, but luckily most of the data was intermediate, and luckily everything we do is powered by Luigi meaning it's encoded as a big huge dependency graph in Python. Some of it is Hadoop jobs, some of it inserts data in Cassandra, some of it trains machine learning models, and much more. The Hadoop jobs are a happy mixture between inline Python jobs and jobs using Scalding.

So anyway, Luigi happily picked up that a bunch of data was missing, traversed the dependency graph backwards, and scheduled everything it needed. A few hours (and a heavly loaded cluster) later, everything was recreated.

Welcome Echo Nest!

2014-03-22T00:00:00Z

In case you missed it, we just acquired a company called Echo Nest in Boston. These people have been obsessed with understanding music for the past 8 years since it was founded by Brian Whitman and Tristan Jehan out of MIT Medialab.

We think this is such a great fit in a lot of ways. In particular, they have focused on very complementary things to what we have. While we have spent a lot of time on things like collaborative filtering, A/B testing, learning from user feedback, and scalability, they have spent time on audio analysis, cultural understanding through web scraping, playlisting, and building a kick ass API.

For some info about what Echo Nest has been up to, check out Paul Lamere's Music Machinery blog as well as Brian Whitman's blog.

We're super excited on starting to incorporate Echo Nest's technology into our own product and we already have a couple of things in the pipe we might launch shortly. Stay tuned!

Momentum strategies

2014-03-03T00:00:00Z

Haven't posted anything in ages, so here's a quick hack I threw together in Python on a Sunday night. Basically I wanted to know whether momentum strategies work well for international stock indexes. I spent a bit of time putting together a strategy that buys the stock index if the return during the previous n days was positive, otherwise doesn't do anything. I ran this strategy for a basket of approximately 20 stock markets.

Anyway, disregarding transaction costs, yada yada, historical returns are not a guarantee for future returns, blah blah, here are the results. It doesn't matter which window size we use, we still make a little bit more money from a momentum strategy:

Returns for each strategy. Click picture to see source code

Minor note: the returns above ignore volatility, actually understating the impact. Looking at Sharpe ratios instead, the momentum strategies have ratios between 1.08-1.40, whereas the “Buy and Hold” strategy has a Sharpe ratio of 0.49.

This whole exercise was mostly for fun, but it was a delight how easy it was. I just installed ystockquote and was able to put everything together within an hour. I've worked in finance, where this was almost harder to do. With such a low barrier, I almost consider it a duty for any hacker with a trading account to spend some time hacking on their finances. I will certainly spend more time analyzing my trades. Not necessarily engaging in anything crazy though, and I certainly discourage anyone from doing so.

Ratio metrics

2014-01-23T00:00:00Z

We run a ton of A/B tests at Spotify and we look at a ton of metrics. Defining metrics is a little bit of an art form. Ideally you want to define success metrics before you run a test to avoid cherry picking metrics. You also want to define a metric that has as high signal to noise ratio. And of course, most importantly, your metric should ideally correlate to high level business impact as much as possible.

One pet peeve I have is metrics defined as ratios. While some of them are useful, there are usually severe caveats that you can spot by just thinking about what goes in the numerator and what goes into the denominator.

Example 1: Average session length

Bad metric. What happens if you add a couple of short-term sessions on top of your existing numbers without changing anything else? Eg. you could improve the number of sessions by 10% but the total session time by 5%. This is a good thing, but your metric would tell another story.

Example 2: Number of clicks per user

What if you launch a feature that sucks and you churn out a bunch of low-intent users? You might end up with high-intent users who drive this up, going against what you mean by “success”.

Example 3: Repeat consumption (or bounce rate)

If you encourage content discovery, you might hope that people enjoy new content so much they come back to it. But you might also improve superficial discovery even more so, meaning this metric goes down.

Example 4: Skip rate

Imagine Spotify Radio. Same thing as 2: churning out a bunch of low-intent users may actually improve the skip rate, although this is a bad thing. Conversely, building a better product might paradoxically increase skip rate, because an influx of low-intent users who dig the feature.

So what metric should you use?

In general, unless you have a really good reason for it, avoid metrics that look at the ratio between two quantities. Instead, I prefer metrics such as total time, number of daily active users, total number of clicks. These metrics are pretty uninteresting in themselves (what does it tell you that the user base spent 10,000 years listening to playlists yesterday?) but they let you draw conclusions about the differences. Eg. if the total number of clicks went up by +5%, then that's a good thing.

For A/B tests where you have uneven proportions between groups, you can simply extrapolate to the whole population by dividing by the ratios. Eg. if 1% of the users are in test group A, and 2% in group B, multiply the metrics by 100x and 50x, respectively. Alternatively, just divide them by the total number of registered users in each bucket. That's a static denominator, so it's totally cool to do so.

There's of a million pitfalls with A/B testing and using metrics. This is not an argument against any of it per se. Don't throw out the baby with the bathwater, just stay cool and make sure you do the right thing :)

Benchmarking nearest neighbor libraries in Python

2014-01-12T00:00:00Z

Radim Rehurek has put together an excellent summary of approximate nearest neighbor libraries in Python. This is exciting, because one of the libraries he's covering, annoy, was built by me.

After introducing the problem, he goes through the list of contestants and sticks with five remaining ones. Finally, the benchmarks pits annoy against FLANN. Although FLANN seems to have roughly 4x better performance, somewhat surprisingly, Radim concludes annoy is the “winner”. Yay!

1000 nearest neighbors, performance vs accuracy, stolen from Radim's blog

I'm not surprised that FLANN is a bit faster. Annoy was mainly built having other goals in mind, primarily being able to use a file based memory range and mmap this quickly. I also think the current algorithm (random hyperplanes) works better with at most a hundred dimensions or so. That being said, at some point I think there's a lot of optimizations to do to Annoy some day when the urge to write messy C++ comes back.

Radim does give some fair criticism of the state of these libraries (including annoy). It can be a pain to install any of them, and in annoy's case there seems to be some problems with certain architectures where it basically returns bogus data. Given that this is such a fundamental problem, it's a little depressing how hard it is to use. Hoping this can change soon with more and more competing libraries out there.

More recommender algorithms

2013-12-20T00:00:00Z

I wanted to share some more insight into the algorithms we use at Spotify. One matrix factorization algorithm we have used for a while assumes that we have user vectors $$ bf{a}_u $$ and item vectors $$ bf{b}_i $$ . The next track $$ i $$ for a user is now given by the relation

$$ P(i \mid u) = exp(bf{a}_u^T bf{b}_i) / Z_u $$

Where $$ Z_u $$ is a normalization constant to make sure the probabilities add up to one. This essentially means we treat the tracks choices as outputs of a softmax. You can think of it as a series of $$ n_u $$ choices for each user, where $$ n_u $$ is the number of tracks they played. Every track is a choice where the probability of each track is given by a softmax distribution. We don't care about the order of these choices, but we assume each choice is independent of the other.

Assuming we have a ton of historical data where user $$ u $$ listened $$ n_{ui} $$ times to track $$ i $$ , we get a total log-likelihood (after simplifying a bit).

$$ log L = sum_{u, i} n_{ui}left(bf{a}_u^T bf{b}_i - log Z_uright) = left(sum_{u, i} n_{ui}bf{a}_u^T bf{b}_iright) - left( sum_u n_u log Z_uright) $$

Note that $$ Z_u $$ only has to be calculated once for every user. Although it's a sum over all tracks, in practice we can estimate it by sampling random tracks and extrapolating.

Turns out you can implement this as a Hadoop job (this paper is quite helpful) doing alternate gradient ascent. Convergence isn't super fast, but the results look qualitatively better than other methods like PLSA and the algorithm by Koren et al that seems to be pretty popular.

Measured as a pure predictive model, the model described above doesn't perform super well. In the graph above, we look at how well the algorithm ranks future items in a validation set, including items the user didn't play. We use the Gini coefficient, the higher, the better. The model described above is “User MF beta=1.0, f=40″ (I'll follow up and explain beta in a later post). HKV refers to the model by Koren et al (because the authors of that paper were Hu, Koren, and Volinsky).

In the plot above, the x axis represents the number of iterations, whereas the y-axis gives the value of the Gini coefficient.

However, turns out if we look only at items that the user played at least once, the algorithm kicks ass (especially if try different values for beta, which I'll get back to at some point). It also makes sense that this is a more useful metric to look at, since if a user plays a track zero times, it's probably likely they didn't actually know about the track in the first place.

We also have a lot of other internal metrics that show that this model outperforms pretty much anything else. It is substantially better at predicting radio skips and thumbs, it's much better at ranking related artists, etc.

Here are some examples of related artists as ranked by the cosine between the vectors:

<th>
  Most related artists by this algo
</th>

<th>
  Most related artists by Koren et al's algo
</th>

<td>
  Justice<br /> Gorillaz<br /> Deadmau5<br /> The Chemical Brothers<br /> Fatboy Slim
</td>

<td>
  Florence + The Machine<br /> Red Hot Chili Peppers<br /> Kings of Leon<br /> Muse<br /> Lana Del Rey
</td>

<td>
  Simon & Garfunkel<br /> The Rolling Stones<br /> The Mamas & The Papas<br /> The Monkees<br /> Elton John
</td>

<td>
  Simon & Garfunkel<br /> Bob Dylan<br /> David Bowie<br /> Billy Joel<br /> Fleetwood Mac
</td>

<td>
  Dr. Dre<br /> 50 Cent<br /> B.o.B<br /> Ludacris<br /> The Black Eyed Peas
</td>

<td>
  Rihanna<br /> David Guetta<br /> Maroon 5<br /> Katy Perry<br /> Beyoncé
</td>

<td>
  Rihanna<br /> Alicia Keys<br /> Christina Aguilera<br /> Usher<br /> Bruno Mars
</td>

<td>
  Rihanna<br /> Katy Perry<br /> David Guetta<br /> Maroon 5<br /> The Black Eyed Peas
</td>

Artist
Daft Punk
Beach Boys
Eminem
Beyoncé

Microsoft's new marketing strategy: give up

2013-12-12T00:00:00Z

I think it's funny how MS at some point realized they are not the cool kids and there's no reason to appeal to that target audience. Their new marketing strategy finally admits what's been long known: the correlation between “business casual” and using Microsoft products:

Apparently it's also for people in ties:

And let's add a (beige?) cardigan on top of that:

On top of that, let's be sexists and add a woman to the campaign who doesn't care about work but likes to take photos!

Yay forward thinking company!

Bagging as a regularizer

2013-12-06T00:00:00Z

One thing I encountered today was a trick using bagging as a way to go beyond a point estimate and get an approximation for the full distribution. This can then be used to penalize predictions with larger uncertainty, which helps reducing false positives.

To me it sounds like a useful trick that I found roughly 0 hits on Google for, so I thought I'd share it. Of course, it might be that I've completely gotten something backwards (my ML skills have some notable gaps), so let me know if that's the case.

Here's a little toy model to illustrate the idea. Let's assume we have observations $$ x_i $$ which are $$ mathcal{N}(0, 1) $$ . We also have $$ y_i = 0.2 + 0.3 / (1 + x_i^2) $$ and labels $$ z_i $$ sampled from the Bernoulli distribution given by $$ P(z_i=1) = y_i $$ (i.e. just flipping a weighted coin where the odds are determined by $$ y_i $$ .

Gradient Boosted Decision Trees are among the state of the art in regression and classification and can have amazing performance. They are available in scikit-learn and easy to plug into an existing script.

Let's say we want to find the best choices for $$ x $$ that gives large values of $$ y $$ . For instance, maybe we have click data and we want to find what items will get most clicks. We fit a GBDT to the $$ (x_i, z_i) $$ observations to train a model that predicts whether an item is going to get clicked or not based on the value of $$ x $$ .

Unfortunately, we might end up getting wild predictions. In this particular case, a single noisy point in the training data around $$ x_i=-4 $$ makes the model believe that large negative values of $$ x $$ are good for maximizing $$ y $$ .

Here's the trick. Instead of training a single GBDT, we train 100 smaller GBDT's on strongly subsampled data (instead of 1000 data points, we sample 100 with replacement). Now, we can use all the predicted values of each GBDT to give us an idea of the uncertainty of of $$ y $$ . This is awesome, because (among other things) we can penalize uncertain estimates. In this case, I just picked the value at the 20th percentile. I have to confess I'm still a little unsure whether the distribution of $$ y $$ represents a probability distribution, but I think this is just an example of bootrapping and you could also Bayes rule with a prior to derive a posterior distribution.

Why could this be useful? For instance, when we recommend music at Spotify, it's much more important to err on the safe side and remove false positives at the cost of false negatives. If we can explicitly penalize uncertainty, then we can focus on recommendations that are safe bets and have more support in historical data.

Model benchmarks

2013-11-02T00:00:00Z

A lot of people have asked me what models we use for recommendations at Spotify so I wanted to share some insights. Here's benchmarks for some models. Note that we don't use all of them in production.

Performance for recommender models

This particular benchmark looks at how well we are able to rank “related artists”. More info about models:

vector_exp: Our own method, a latent factor method trained on all log data using Hadoop (50B+ events).
word2vec: Google's open sourced word2vec. We train a model on subsampled (5%) playlist data using skip-grams and 40 factors.
rnn: Recurrent Neural Networks trained on session data (users playing tracks in a sequence). With 40 nodes in each layer, using Hierarchical Softmax for the output layer and dropout for regularization.
koren: Collaborative Filtering for Implicit Feedback Datasets. Trained on same data as vector_exp. Running in Hadoop, 40 factors.
lda: Latent Dirichlet Allocation using 400 topics, same dataset as above, also running in Hadoop.
freebase: Training a latent factor model on artist entities in the Freebase dump.
plsa: Probabilistic Latent Semantic Analysis, using 40 factors and same dataset/framework as above. More factors give significantly better results, but still nothing that can compete with the other models.

Again, not all of these models are in production, and conversely, we have other algorithms not included above that are in production. This is just a selections of things we've experimented with. In particular, I think it's interesting to note that neither PLSA nor LDA perform very well. Taking sequence into account (rnn, word2vec) seems to add a lot of value, but our best model (vector_exp) is a pure bag-of-words model.

statself.com

2013-10-18T00:00:00Z

Btw I just put something up online that I spent a couple of evenings in my couch putting together: it's a website where you can track any numerical data on the web. Want to know how many Twitter followers you have? Temperature in NYC? Go to statself.com and start tracking it.

Actually statself.com was just a domain name I had lying around for something else, but it turned out to fit pretty well.

I'm not a web developer but sometimes it's nice to see how easy doing complex web apps has become. I put together something in a few afternoons using Tornado, Bootstrap, etc – something that would have taken weeks a few years ago.

Implicit data and collaborative filtering

2013-09-16T00:00:00Z

A lot of people these days know about collaborative filtering. It's that Netflix Prize thing, right? People rate things 1-5 stars and then you have to predict missing ratings.

While there's no doubt that the Netflix Prize was successful, I think it created an illusion that all recommender systems care about explicit 1-5 ratings and RMSE as the objective. Some people even distrust me when I talk about the approach we take at Spotify.

Misconception 1: Recommender systems are about predicting missing ratings.

This is not true. In our case at Spotify, we have a huge matrix with users and items and each element containing the number of times user u played track i. Note that all of the matrix entries are known. Zero is a zero, and it means that user u actually played track i exactly 0 times.

Actually, even Netflix themselves have stated that there's much more information in the implicit data than the explicit. Using implicit data has received a lot less attention, probably because the Netflix Prize was so successful.

Misconception 2: Recommender systems use squared loss.

This is one of my biggest pet peeves. Think about it – what does squared loss mean, from a Bayesian perspective? It means you assume that the errors are all from a normal distribution. This is a reasonable approximation for 1-5 star rating (although questionable even there), but it's definitely a horrible way to fit play count data (a reasonable approximation would be Poisson). Some people's reaction to this is to transform the data to a more reasonable scale before taking the squared difference, but then your model gets even more complicated to interpret.

If there's one lesson here, it's definitely that every loss function is an assumption about how data is generated. That's why I prefer generative models in the first place, such as PLSA or LDA. These methods were originally developed for text classification, but the “bag of words” approach turns out to work great for implicit collaborative filtering. Note that there are some algorithms that use squared loss even for implicit collaborative filtering, but I'm not sure what they assume about the data really.

Misconception 3: Recommender systems are predictive models.

This is a subtle one. You can look at the Netflix Prize as a challenge to predict unknown values, and in the same way you can look at implicit collaborative filtering as essentially a predictive model where you are trying to predict what the user is going to do in the future. But just because you can predict that user u is going to play track i, does that mean it's a good recommendation? After all, there might be some super obscure track j that user u would love if they actually had found it. Just recommending most likely track i introduces a strong popularity bias.

This relates to the previous question. Even if we found a loss function that relates to the generative model, it doesn't mean we have a way of optimizing recommendation quality by minimizing some loss function. So what should we do? Luckily, it turns out there's some tricks you can do, like normalizing for popularity, that work reasonably well in practice.

Misconception 4: Recommender systems are all about recommending items to users.

I would actually argue that detecting item similarity using collaborative filtering is more important. This is another thing where I've hardly seen any research, and I don't really have a lot of good ideas, although for some reason item-item cosine works really well in latent factor models.

Vote for our SXSW panel!

2013-09-04T00:00:00Z

If you have a few minutes, you should check out mine and Chris Johnson‘s panel proposal. Go here and vote: http://panelpicker.sxsw.com/vote/24504

Algorithmic Music Discovery at Spotify

****Spotify crunches hundreds of billions of streams to analyze user's music taste and provide music recommendations for its users. We will discuss how the algorithms work, how they fit in within the products, what the problems are and where we think music discovery is going. The talk will be quite technical with a focus on the concepts and methods, mainly how we use large scale machine learning, but we will also some aspects of music discovery from a user perspective that greatly influenced the design decisions.

What's up with music recommendations?

2013-08-17T00:00:00Z

I just answered a Quora question about what, if any, are the differences in the algorithms that are behind recommendations for music and movies.

Of course, every media type is different. For instance, there's fundamental reasons why latent factor models works really well for music and movies, as opposed to location recommendations where I suspect graph based models are more powerful. People recommendations is another animal and I'm sure beer recommendations has its own domain-specific quirks.

3D

2013-08-12T00:00:00Z

Andy Sloane decided to call my 2D visualization and raise it to 3D.

(Looks a little weird in the iframe but check out the link). It's based on a LDA model with 200 topics, so the artists tend to stick to clusters where each cluster is a topic. The embedding also uses t-SNE but in three dimensions (obviously).

2D embedding of 5k artists = WIN

2013-08-11T00:00:00Z

I'm at KDD in Chicago for a few days. We have a Spotify booth tomorrow, and I wanted to put together some cool graphics to show. I've been thinking about doing a 2D embedding of the top artists forever since I read about t-SNE and other papers so this was a perfect opportunity to spend some time on it.

So – I spent a couple of hours taking the lower dimensionality representation of all artists, plugging it into the C++ implementation they provide, then using matplotlib to render something cool. Like all good visualizations, it took me roughly 30 minutes to get something up, but then I spent another 3 hours tweaking the font, colors, sizes, all that stuff.

Here is the result, and I'm pretty happy with it. Check it out, and zoom in/out and scroll around. Here's some random screen shots:

High-level overview

Zooming in reveals a lot of distinct clusters, mainly by genre but there's also big islands of Swedish/Dutch/Spanish artists. Here's a close up of the hip hop cluster:

Hip hop cluster

Anyway, cool thing #2: I took the entire PDF to FedEx downstairs and had them print it on a 24” x 24” poster. Win!

Printed poster

Btw, I also had to include the view from my room at the 34th floor #humblebrag

And since you're asking, I generate the colors by a separate t-SNE embedding. What I do is I run two separate 2D embeddings, one to get the coordinates, and one to get the colors. The “color coordinates” are then mapped to coordinates in the HSV space which is then transformed to RGB.

Delivering Music Recommendations

2013-08-09T00:00:00Z

I've turned into a lazy bastard and I'm just posting presentations on this blog, but here's one from Rohan Singh at Spotify talking about the backend infrastructure of the Discover page.

ML+Hadoop at NYC Predictive Analytics

2013-08-03T00:00:00Z

I was just at the NYC Predictive Analytics meetup talking about how we build machine learning algorithms using Hadoop to power music recommendations.

Great meetup, where we had two speakers, me and Blake Shaw from Foursquare. Blake talked about how they use machine learning at Foursquare, using Hadoop (and Luigi), and he uploaded his slides here!

Here's the full video for the talk (both mine and Blake's)

HubSpot's Picture Shows how to Maintain Monocultures in the 21st Century

2013-07-28T00:00:00Z

I thought this article about the company culture at HubSpot is kind of funny. “HubSpot's Awesome Presentation Shows how to Create a 21st Century Culture”.

Just FYI: You're not different. You're a bunch of white hipsters aged 25-30 dressed up in the same theme. That's not being different.

On a more serious note, this represents one of the most challenging aspects of scaling a company culture. You start with a bunch of friends with the same background. Nothing strange about this – friends hire friends. The fact that people have friends that are similar to them isn't great, but it reflects society. Spotify started off as a bunch of like-minded mostly male Swedish university dropouts, for instance. Anyway, at some point, you need to start hiring different people to avoid groupthink and to get access to a much bigger talent pool. Not saying it's easy.

More Luigi: Presentation from OSCON

2013-07-27T00:00:00Z

I was in Portland, OR for a few days hanging out at OSCON. Was fun. I also talked a bit about Luigi:

Next week I'm presenting at the NYC Predictive Analytics meetup together with Blake Shaw from Foursquare. The topic is ML + Hadoop. Will be fun!

Optimizing over multinomial distributions

2013-07-24T00:00:00Z

Sometimes you have to maximize some function $$ f(w_1, w_2, ldots, w_n) $$ where $$ w_1 + w_2 + ldots + w_n = 1 $$ and $$ 0 le w_i le 1 $$ . Usually, $$ f $$ is concave and differentiable, so there's one unique global maximum and you can solve it by applying gradient ascent. The presence of the constraint makes it a little tricky, but we can solve it using the method of Lagrange multipliers. In particular, since the surface $$ w_1 + w_2 + ldots + w_n $$ has the normal $$ (1, 1, ldots, 1) $$ , the following optimization procedure works:

Go one step in the direction of the gradient
Normalize the new point by projecting it orthogonally back onto the surface

Note that we can't just normalize by dividing with the sum of the new vector. What we want to do is to project it orthogonally back onto the surface. However, we need to do this without ending up with negative numbers. This turns out to be surprisingly difficult to implement, but let me spare you the agony and present one implementation in Python:

def project(v):
    excess = sum(v) - 1.0
    for i, elm in enumerate(sorted(v)):
        sub = excess / (len(v) - i)
        excess -= min(elm, sub)

    return [max(w - sub, 0) for w in v]

More Luigi!

2013-06-26T00:00:00Z

Continuing in the same spirit of shameless self-promotion, here's some recent Luigi press:

Reddit thread
A Guide to Python Frameworks for Hadoop (slides from the NYC Hadoop User Group)
This presentation from the Open Analytics NYC meetup about how Foursquare uses Luigi

Luigi is in the middle of a pretty massive refactoring of the visualizer. David Whiting at Spotify just ripped out the old visualizer (based on Graphviz) and replaced it with one based on D3. Some features were sacrificed, but I'm convinced the new visualizer has a lot more potential to grow into a full fledged monitoring page. There's a ton of new visualization commits sitting in VisualDNA's repo that I'm really excited to merge back into the main repo soon.

Over and out.

hdfs2cass

2013-06-19T00:00:00Z

Just open sourced hdfs2cass which is a Hadoop job (written in Java) to do efficient Cassandra bulkloading. The nice thing is that it queries Cassandra for its topology and uses that to partition the data so that each reducer can upload data directly to a Cassandra node. It also builds SSTables locally etc. Not an expert at Cassandra so I'll stop describing those parts before I embarrass myself.

The way we use it, Cassandra almost becomes some kind of a CDN. All we want to do is to push out large amounts of semi-static data, with no random writes at all. Cassandra is pretty good at high write loads so it's a pretty good fit for the job.

Anyway, the reason why I think this is cool is that all of our previous bulk loading tool has had a single-machine bottleneck. Suddenly we can do many-to-many bulk loading. This means we now have a solution that scales horizontally. This starts to matter when you transfer lots of data. In contrast, for smaller files (< 50GB), we typically scp or even torrent files. This is pretty fast, but you are still constrained by the pipe of the source machine.

Another thing I like is that shows how Hadoop (I'm talking mapreduce now) is great at sharding up things, dealing with failures, retries, etc. It's awesome to rely on that framework when you are pushing a terabyte and you expect some fraction of failed reduce tasks.

NoDoc

2013-06-16T00:00:00Z

We had an unconference at Spotify last Thursday and I added a semi-trolling semi-serious topic about abolishing documentation. Or NoDoc, as I'm going to call this movement. This was meant to be mostly a thought experiment, but I don't see it as complete madness.

To be clear, I'm not talking about comments in the code here. I think those are great, and you should probably do more than you are already doing. Explaining non-obvious things and edge cases are great things to do. Including a link to a Stack Overflow thread discussing a bug and a workaround – awesome. What I'm referring to and what I want to remove is long introductions about class hierarchy, technical stuff about the code, or maybe even architecture. These can all be put in comments and sometimes even in the symbol names.

Here's my point: Everybody says they like documentation. Nobody writes it. Even if there is, I hardly read it. Everyone feels bad about the lack of documentation. Someone starts a workgroup about it every few months. They come up with guidelines. Then people write more docs in a new formats, put it in some obscure place like a wiki, and the documentation grows outdated a few months later. Looking back, maybe documentation is an investment that has negative net return.

Here's the assumptions

Outdated documentation is worse than no documentation
Any documentation outside the repo will get lost and outdated
Talking to people and asking them questions is 100x more efficient than trying to figure something out from documentation
To find people to talk to, look at the git log
Examples are better than API docs
Ideally examples are a part of the unit tests so they are guaranteed to stay up to date
If you need API docs for a method, consider renaming/rewriting/refactoring the method. It should be obvious from the function name and arguments what's going on
In dynamically typed languages, type assertions are the best way to specify expected type
Having a “deployment book” (with detailed instructions on how to put the system in production) should be discouraged. If you need something like that, you haven't fully automated it
What is the audience? If it's just three people, don't waste your time.

Obviously documentation is nice. Luigi has a long overview at the GitHub page, for instance. It's about finding a balance, that's why we're discussing this in the first place. I'm arguing that there's no need to feel bad about advocating slightly less documentation overall. For internal projects, unless you are planning to leave, you probably don't need to exaggerate. And it's probably a good idea not to encourage documenting manual steps you need to deploy something.

Wikiphilia

2013-06-02T00:00:00Z

I've been obsessed with Wikipedia for the past ten years. Occasionally I find some good articles worth sharing and that's why I created the wikiphilia Twitter handle. Just a long stream of stuff that for one reason or another may be interesting.

It's also a bunch of friends posting links. Anyway, the tragedy is that there's 800 tweets but only 70 followers, so you should follow it now.

On a related note, this version of Wikipedia article for Warren G's Regulate was hilarious until someone removed it in 2010.

Spotify's Discovery page

2013-05-31T00:00:00Z

The Discovery page, the new start page in Spotify, is finally out to a fairly significant percentage of all users. Really happy since we have worked on it for the past six months. Here's a screen shot:

Some cool features

Artist/album/track recommendations based on stuff you've listened to before
New releases recommendations
Concert recommendations
Third party news stories

There's a ton of reviews out there: cnet, mashable. Some super positive, some kind of negative. Here's a super positive one that I'm mainly pointing out because I'm obviously biased: Awesome new Discover feed adds personalized music recommendations to Spotify.

How does it work? A ton of matrix factorization algorithms powers the collaborative filtering part, then a bunch of Hadoop jobs to generate recommendations. Looking forward to talk more about this in detail at some point! Meanwhile, the Discover page will definitely improve over time as we keep tweaking algorithms and content.

Fermat's principle

2013-05-21T00:00:00Z

I was browsing around on the Internet and the physics geek in me started reading about Fermat's principle. And suddenly something came back to me that I've been trying to suppress for many years – how I never understood why there's anything fundamental about the principal of least time.

The principle of least time states that the light will travel from A to B in such a way that the time is minimized. Using this principle you can derive a whole bunch of optics laws. But to me that doesn't make any sense.

Reading a dozen articles online about it, they all seem to do the following thing

Fermat's principle works for reflections
So let's solve it for the case when light enters another medium
Yay we just derived Snell's law!

Problem is that they never derived Fermat's principle for anything else than reflection. And it hardly works for reflections: why doesn't the beam just travel straight from A to B without reflecting in the mirror.

I suspect that Snell's law in itself actually “proves” the principle of least time, but that's the way causation works, not the other way around. Of course there's absolutely no way nature can “know” what path is the shortest. Fermat's principle is true, but the derivation in Wikipedia seems like bogus to me.

The main problem with just saying that the nature “chooses” the shortest path, is that the light never travelled along all possible paths in the first place. Consider a laser beam fired from some point into glass: it obviously already has an angle. You can take any two points on the incoming and the outgoing beam and verify that Fermat's principle is true, but that seems like the result of Snell's law. Causation is tricky sometimes.

Consider the same laser beam fired from air into glass at a tilted angle. The light only goes in one direction in the first place so you can't talk about constructive inference until the beam actually hits the surface. At that point, Wikipedia mentions that Snell's law can be derived using the Huygens-Fresnel principle, which seems totally legit to me.

If I'm not mistaken, Fermat's principle follows from that, but that's the way the causation goes, not the other way around. Unless I'm smoking crack here, this is something most standard physics textbooks probably do not understand.

Snakebite

2013-05-07T00:00:00Z

Just promoting Spotify stuff here: check out the Snakebite repo on Github, written by Wouter de Bie. It's a super fast tool to access HDFS over CLI/Python, by accessing the namenode directly over sockets/protobuf.

Spotify's developer blog features a nice blog post outlining what it's useful for. I think this kicks ass and there will definitely be some kind of Luigi integration coming up at some point

Stuff that bothers me: “100x faster than Hadoop”

2013-04-27T00:00:00Z

The simple way to get featured on big data blog these days seem to be

Build something that does 1 thing super well but nothing else
Benchmark it against Hadoop
Publish stats showing that it's 100x faster than Hadoop
$$$

Spark claims their 100x faster than Hadoop and there's a lot of stats showing Redshift is 10x faster than Hadoop. There's a bunch of papers with similar claims. I spent five minutes Googling “Xx faster than Hadoop” and found a ton of other stats.

(Btw, when people say this, I generally take it to mean that Z is y times faster than Hadoop Mapreduce. Just nitpicking.)

Anyway, these stats bother me a lot because everyone knows that

Horizontal scalability comes at a very high price, because things get I/O bound. That's fine, because you can always throw more hardware at the problem.
Flexibility comes at a price, and that's totally fine for most people. Hadoop supports pretty much anything that can be reduced to a series of Mapreduce jobs, which in practice turns out to me most stuff.
Ease of use comes at a price, and that's fine. There's a reason a lot of people choose Python over C++, after all. Ok, writing mapreduce jobs in Java sucks, but there's a lot of nice tools out there to make it simple (subtle product placement: check out Luigi)

I think Spark is a really cool piece of technology, so don't get me wrong. I just think it's stupid to compare things between Hadoop and Spark when clearly they are two very different products with different use cases. Just as you wouldn't compare a Tokyo Cabinet to MySQL or whatever. So please never ever say that something is X times faster than Hadoop again.

Presentation about Luigi

2013-04-26T00:00:00Z

I like the editing!

Being data driven

2013-04-13T00:00:00Z

I picked up an issue of Foreign Affairs while flying back to NYC from SFO. It features this long interview with U.S. General Stanley McChrystal and I thought it was pretty interesting how striking some of the similarities are between fighting in a war and developing software.

On cycle time and how it's important to learn and integrate quickly:

In 2003, in many cases we'd go after someone, we might locate them and capture or kill them, and it would be weeks until we took the intelligence we learned from that and were able to turn it into another operation. Within about two years, we could turn that cycle three times in a night. We could capture someone, gain intelligence from the experience, go after someone else, and do three of those in a row, the second two involving people we didn't even know existed at the beginning of the night.

Annoy

2013-04-12T00:00:00Z

Annoy is a simple package to find approximate nearest neighbors (ANN) that I just put on Github. I'm not trying to compete with existing packages, but Annoy has a couple of features that makes it pretty useful. Most importantly, it uses very little memory and can put everything in a contiguous blob that you can mmap from disk. This way multiple processes can share the same index.

We use it at Spotify to put a couple of million tracks in 40-dimensional space and then query for the most similar tracks. Using floats, 5M * 40 * 4 is already 800MB, so sharing memory across multiple processes makes sense.

I hate when people talk about the “multicore revolution”, but it kind of makes sense here. An interesting side effect of it is that memory capacity doesn't seem to grow as fast as the number of cores. Static file-based mmapped indexes are actually really useful for a wide range of features. We also use Tokyo Cabinet a lot at Spotify, whenever data is at least a few hundred megs, at which point it does make sense to share the data across processes.

More Luigi!

2013-03-22T00:00:00Z

Elias Freider just talked about Luigi at PyData 2013:

The presentation above is much better than one I put together a few weeks ago. In case anyone is interested I'll include it too:

ML at Twitter

2013-02-27T00:00:00Z

I recently came across this paper describing how they do ML at Twitter.

TL;DR Their approach is pretty interesting. Everything is a Pig workflow and then they do everything as UDF's.

This approach seems pretty interesting. As long as your data can be expressed as small atomic machine learning functions, I'm sure it works great. But there's so much more than that. All small slicing, transforming etc is so much easier to express in a language like Python. I'm still not really comfortable with Pig as a language to power these data flows.

John Cook wrote about math stuff in Python a few months ago: I find doing mathematical programming in a general-purpose language is easier than doing general-purpose programming in a mathematical language. I could not agree more and I want to generalize it further: I rather do domain specific programming in a general-purpose language than the other way around.

Pig definitely comes with a set of cool features. It handles joining automatically (not a trivial task), gives you trivial ways to transform your data, and does the query planning for you.

It's not a full fledged workflow management tool though. At Spotify, we use Luigi to define the workflows and the dependencies. Luigi gives you a way to write everything in 100% Python, including how different computational tasks are related and what is dependent on what. It comes with Hadoop support, but the abstraction layer is lower than Pig, and you have to implement your mapper and reducer yourself. For skewed joins this could involve some engineering around things that Pig does automatically for you.

I'm quite curious how this works out for Twitter. Do they find that the abstraction makes it easy for them to focus on ML and not care about the execution details? Or does it add complexity because it enforces the ML to be written as UDF primitives? Will try to find out more.

I'm featured in Mashable

2013-02-06T00:00:00Z

This article from today in Mashable describes some of the fun stuff I get to work with:

Erik Bernhardsson is technical lead at Spotify, where he helped to build a music recommendation system based on large-scale machine learning algorithms, mainly matrix factorization of big matrices using Hadoop. He moved into this role after heading the Business Intelligence team, where he collected, aggregated and made sense of all the data at Spotify, whether that's ad-hoc insights, A/B testing, visualization or ad optimization.

Bernhardsson's roots at Spotify date back to 2008, when he interned for the company while writing his master's thesis on systems for automatic music recommendations (he was awarded master's thesis of the year by Naturvetarna, and we all know where that led).

Slides from NYC Machine Learning talk

2013-01-27T00:00:00Z

Slides from the talk. Slightly edited because (a) some of the slides make little sense taken out of context (b) Slideshare seem to have problem converting some of the stuff.

Collaborative filtering at Spotify from Erik Bernhardsson

NYC Machine Learning meetup

2013-01-22T00:00:00Z

From the NYC Machine Learning talk I had last week:

Haven't looked at it yet except briefly. Unfortunately the quality isn't the best.

Momentum and mean reversion might just be volatility bias

2013-01-13T00:00:00Z

The Economist just published an article called The best, the worst and the ugly. By looking at historical performance for mutual funds, they find strong support for momentum and mean reversion. Picking the best or the worst fund over the previous five years gives great returns over the next five years.

I think this is just confusion around what risk reward is. Selecting the worst and best performing mutual funds is basically a way of selecting funds with high volatility. Any risky asset with higher volatility will give a slightly higher return. This is predicted by the Capital asset pricing model, which AFAIK is a reasonable approximation of reality.

Calculating cosine similarities using dimensionality reduction

2012-12-05T00:00:00Z

This was posted on the Twitter Engineering blog a few days ago: Dimension Independent Similarity Computation (DISCO)

I just glanced at the paper, and there's some cool stuff going on from a theoretical perspective. What I'm curious about is why they didn't decide to use dimensionality reduction to solve such a big problem. The benefit of this approach is that it scales much better (linear in input data size) and produces much better results. The drawback is that it's much harder to implement.

Dimensionality reduction is a lot messier to implement, but basically it works like this: You take your matrix $$ M $$ and factor it into matrices $$ A $$ and $$ B $$ so that $$ M = A^TB $$ . Denoting each row as $$ bf{a_u} $$ (user vectors) and $$ bf{b_i} $$ (item vectors), respectively, the idea is that $$ M_{ui} approx bf{a_i}^Tbf{b_i} $$ . Furthermore, cosine between user can be approximated by their low-dimensionality counterpart pretty well.

This is great for two reasons. First of all, you generally use only a handful of dimensions so all you have to deal with now is super trivial calculations. Taking the cosine of two users or items is $$ O(f) $$ where f is a small number denoting the number of dimensions. You can also calculate user-item score by just taking dot products, also in $$ O(f) $$ .

Second of all, forcing everything down onto a few dimensions is a great way to reduce noise. An intuitive way to see this is that in the original matrix, if user A had a lot of items in common with user B and C, but B and C didn't have any items (or very few) in common, we would draw the conclusion that $$ cos(B, C) = 0 $$ . Working in a reduced dimensionality we would probably still assign a pretty high value of similarity between B and C.

Now, you just reduced the problem to a much lower dimensionality and you still need to apply hashing techniques to find similar pairs. But if we can bring it down to something like 10 or 50 dimensions this is much easier to implement. One way to do it is to cut through the space using random hyperplanes and hash by that.

Dimensionality reduction (aka matrix factorization) is no easy task in itself. Mahout provides some tools to do this, but they don't scale super well to the scale that Spotify or Twitter has. Instead, you are probably stuck having to build something from scratch. Be warned my friend, but at least I'd recommend these two papers:

Google News Personalization: Scalable Online Collaborative Filtering – describes how to scale PLSA (a factorization method) to large data sets

Collaborative Filtering for Implicit Feedback Datasets – describes another factorization algorithm that converges very fast in practice

Tumblr's awesome project names

2012-11-18T00:00:00Z

Not sure how I managed to miss this, but I'm watching this Tumblr presentation and they talk about their projects named after Arrested Development topics: Gob, Parmesan, Buster, Jetpants, Oscar, George and Motherboy.

Still, the best software project name is probably still Apple's BHA.

A neat little trick with time decay

2012-10-29T00:00:00Z

Something that pops up pretty frequently is to implement time decay, especially where you have recursive chains of jobs. For instance, say you want to keep track of a popularity score. You calculate today's output by reading yesterday's output, discounting it by $$ exp(-lambda Delta T) $$ and then adding some hit count for today. Typically you choose $$ lambda $$ so that $$ exp(-lambda Delta T) = 0.95 $$ for a day or something like that. We do this to generate popularity scores for every track at Spotify.

There is another approach that doesn't require you to do the discounting and gives you a bit more flexibility. If you think about it, essentially what you want to calculate is $$ sum exp(-lambda(T - t_i)) $$ where $$ T $$ is the current time and the sum is over all hits since we started keeping track of the popularity. This is a single score that takes time decay into account. You can add new hits by just discounting the existing sum and adding $$ 1 $$ .

The problem is that you have to keep track of the timestamps together with the scores, or else you can't do proper discounting when you add numbers that were calculated at different points in the time. There is a trick to get around this. Notice that the current time only introduces a constant factor $$ exp(-lambda T) $$ so let's just factor it out. You get $$ exp(-lambda T)sum exp(lambda t_i) $$ which means you don't have to keep track of what time it is. You can just apply the discount factor when you need it! In practice this means that you don't have to keep track of any timestamp or anything.

Now, you need to keep track of the following thing instead: $$ S = sum exp(lambda t_i) $$ . This is a ****ridiculously large number, so in principle you need to store the logarithm $$ s = log S $$ of it instead.

But if you store the logarithm of it, how do you add a new term? You basically want to calculate something like $$ log(exp(s) + exp(u)) $$ where $$ u = log U = lambda t_i $$ represents a new term. This isn't possible by the naive method because the intermediate sum will overflow, but it turns out that there is a simple trick to calculate this that makes the computer happy. This identity can be derived by some simple substitution and some logarithm identities: $$ log(exp(s) + exp(u)) = max(s, u) + log (1 + exp(min(s, u) - max(s, u))) $$

Furthermore, say you want to evaluate the score at some point in $$ T $$ in time later. This is equivalent to $$ Sexp(-lambda T) = exp(s - lambda T) $$ and it can be evaluated without any overflow problems.

Luigi: complex pipelines of tasks in Python

2012-10-21T00:00:00Z

I'm shamelessly promoting my first major open source project. Luigi is a Python module that helps you build complex pipelines of batch jobs, handle dependency resolution, and create visualizations to help manage multiple workflows. It also comes with Hadoop support built in (because that's where really where its strength becomes clear).

We use Luigi internally at Spotify to run thousands of tasks every day, organized in complex dependency graphs. Luigi provides an infrastructure that powers several Spotify features including recommendations, top lists, A/B test analysis, external reports, internal dashboards, and many more.

Conceptually, Luigi is similar to GNU Make where you have certain tasks and these tasks in turn may have dependencies on other tasks.

Read more about it on Github: https://github.com/spotify/luigi.

Erik Bernhardsson

Predicting solar eclipses with Python

Solving for the coordinates using black-box optimization

Finding all the eclipses

Plotting it

Notes

Simple sabotage for software

Your mission

Technology

Product

Leadership

Hiring

Project management

The outcome

What I have been working on: Modal

Context

Data as its own discipline

Let's make data teams more productive!

Let's build a new runtime

What are some things you can build using Modal?

How does it work?

We are still early with the cloud: why software development is overdue for a change

The genesis

I think we might be early though?

Adapters and toys

Cloud is coming for your workflow

To be free, we need to break free of our past

Vertical toys

Welcome, Cloud 2.0

Addendum

σ-driven project management: when is the optimal time to give up?

All models are wrong, etc…

How much business value are you creating, my friend?

Drop it like it's lower marginal ROI

Project management depends on σ

Low-σ management

High-σ management

Is software different?

Storm in the stratosphere: how the cloud will be reshuffled

Redshift and what happened

Enter Snowflake

What if….?

Startups are coming for the cloud

Maybe owning the lowest layer isn't so bad?

What about lock-in?

The cloud in 2030

Startups are coming for your code

Predictions

What is the right level of specialization? For data teams and anyone else.

Specialization is a good thing

What are some drawbacks of specialization?

Specialization is probably driven a lot by bad tools

It's dangerous when people lose sight of the goal

Building a data team at a mid-stage startup: a short story

July 1: morning

July 1: afternoon

What's happening so far?

July 8

September 1: morning

September 1: afternoon

What's happening?

September 2

January 3

What's going on?

April 1

OK, what's happening here?

July 1

What just happened?

Software infrastructure 2.0: a wishlist

Built for delight

Truly serverless

Fast

Ephemeral resources

Code not configuration

Built for productivity

What's Erik up to?

So what's next?

Why data?

🍻

Giving more tools to software engineers: the reorganization of the factory