Stuff that bothers me: “100x faster than Hadoop”

The simple way to get featured on big data blog these days seem to be

  1. Build something that does 1 thing super well but nothing else
  2. Benchmark it against Hadoop
  3. Publish stats showing that it’s 100x faster than Hadoop
  4. $$$

Spark claims their 100x faster than Hadoop and there’s a lot of stats showing Redshift is 10x faster than Hadoop. There’s a bunch of papers with similar claims. I spent five minutes Googling “Xx faster than Hadoop” and found a ton of other stats.

Read more…

Annoy

Annoy is a simple package to find approximate nearest neighbors (ANN) that I just put on Github. I’m not trying to compete with existing packages, but Annoy has a couple of features that makes it pretty useful. Most importantly, it uses very little memory and can put everything in a contiguous blob that you can mmap from disk. This way multiple processes can share the same index.

Read more…

ML at Twitter

I recently came across this paper describing how they do ML at Twitter.

TL;DR Their approach is pretty interesting. Everything is a Pig workflow and then they do everything as UDF’s.

This approach seems pretty interesting. As long as your data can be expressed as small atomic machine learning functions, I’m sure it works great. But there’s so much more than that. All small slicing, transforming etc is so much easier to express in a language like Python. I’m still not really comfortable with Pig as a language to power these data flows.

Read more…

I'm featured in Mashable

This article from today in Mashable describes some of the fun stuff I get to work with:

Erik Bernhardsson is technical lead at Spotify, where he helped to build a music recommendation system based on large-scale machine learning algorithms, mainly matrix factorization of big matrices using Hadoop. He moved into this role after heading the Business Intelligence team, where he collected, aggregated and made sense of all the data at Spotify, whether that’s ad-hoc insights, A/B testing, visualization or ad optimization.

Read more…