Erik Bernhardsson    About

Black Box Machine Learning in the Cloud


There’s a bunch of companies working on machine learning as a service. Some old companies like Google, but now also Amazon and Microsoft.

Then there’s a ton of startups: PredictionIO ($2.7M funding), BigML ($1.6M funding), Clarifai, etc, etc. Here’s a nice map from Bloomberg showing some of the landscape.

As much as I love ML, I’m not super bullish on these companies. I wrote a pretty cynical tweet the other day


Instead of the negative let’s go through the ways I think a machine learning API can actually be useful (ok full disclosure: I don’t think it’s very many)

Does it solve an immediate business problem?

In ML dreamtown, an engineer realizes one day: “Hey, I have all these feature vectors, and these target values, and a loss function, I just wish someone could approximate a function for me”.

In reality, ML problems are often super messy, and it can be pretty challenging to get from data into a regression/classification problem (or anything else). Model fitting isn’t the issue, getting to model fitting is the hard part. Here is a bunch of real-world scenarios I worked with over the last years:

  1. Ad targeting based on music consumption. This is a very ill-defined problem and we need to figure out what we really want to solve.
  2. Predict churn. We can build this complex model that takes user features and predict whether they are going to churn out. The resulting model is generally not that useful though – it doesn’t give us any actionable insight.
  3. Predict ad clickthrough rate. Yes – we can take historical data and train a binary classifier, but it suffers from a lot of issues (such as observation bias, feedback loops, etc).

Does it focus on a particular niche?

It’s a lot more likely you solve an immediate business need if you focus on a specific niche. Natural language processing for sales people? I don’t know.

Focusing on a particular niche makes it easier to build something that works off the shelf. A general purpose outlier detection is not as useful as a model to detect insurance fraud.

**Does it build on proprietary data sets? **

If you have amassed enormous troves of security footage, or aerial photography, or financial data, or whatever, then you can train a model that no one else can train. You can then sell this model for lots of money, because the cost of building up this data set is a huge barrier for anyone else.

Is there secret sauce?

Remember that trying to build something secret you are up against about 10,000 machine learning researchers in academia who spend all their time trying to come up with new methods. It’s true that lots of machine learning has a bit of a gap between academia an industry. But this is just things that are hard and messy. That’s not a defensible asset in the long run.

Convolutional neural networks for instance. It’s still pretty messy to get Theano or Torch working – I know because I spent a lot of time reading papers and documentation to get a simple image classifier working. But the complexity of this is going to go down very quickly. In a year’s time there will be open source libraries with pre-trained models for image classification that are on par with anything you can get through an API (probably better).

Does it solve infrastructural issues?

Scaling applications is still a hard problem. Similarly the use of GPU’s in deep learning creates an artificial barrier for many companies who don’t want to deal with Amazon instances etc – there is some value in abstracting this away from users.

The question is what companies have problems that require large scale machine learning that don’t have problems that require scalability.

Do you have smart people?

I actually think the biggest upside in many of these companies is the possibility of acqui-hire. It’s no secret that machine learning engineers are high in demand. So maybe the best strategy is try to attack some super hard problem, ignore whether it’s actually useful, and hire as many smart people as possible.



Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.