Erik Bernhardsson    About

Stuff that bothers me: “100x faster than Hadoop”

The simple way to get featured on big data blog these days seem to be

  1. Build something that does 1 thing super well but nothing else
  2. Benchmark it against Hadoop
  3. Publish stats showing that it’s 100x faster than Hadoop
  4. $$$

Spark claims their 100x faster than Hadoop and there’s a lot of stats showing Redshift is 10x faster than Hadoop. There’s a bunch of papers with similar claims. I spent five minutes Googling “Xx faster than Hadoop” and found a ton of other stats.

(Btw, when people say this, I generally take it to mean that Z is y times faster than Hadoop Mapreduce. Just nitpicking.)

Anyway, these stats bother me a lot because everyone knows that

  • Horizontal scalability comes at a very high price, because things get I/O bound. That’s fine, because you can always throw more hardware at the problem.
  • Flexibility comes at a price, and that’s totally fine for most people. Hadoop supports pretty much anything that can be reduced to a series of Mapreduce jobs, which in practice turns out to me most stuff.
  • Ease of use comes at a price, and that’s fine. There’s a reason a lot of people choose Python over C++, after all. Ok, writing mapreduce jobs in Java sucks, but there’s a lot of nice tools out there to make it simple (subtle product placement: check out Luigi)

I think Spark is a really cool piece of technology, so don’t get me wrong. I just think it’s stupid to compare things between Hadoop and Spark when clearly they are two very different products with different use cases. Just as you wouldn’t compare a Tokyo Cabinet to MySQL or whatever. So please never ever say that something is X times faster than Hadoop again.

Want to get blog posts over email?

Enter in your email address and get weekly emails with new articles!

Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.