Stuff that bothers me: “100x faster than Hadoop”

The simple way to get featured on big data blog these days seem to be

  1. Build something that does 1 thing super well but nothing else
  2. Benchmark it against Hadoop
  3. Publish stats showing that it's 100x faster than Hadoop
  4. $$$

Spark claims their 100x faster than Hadoop and there's a lot of stats showing Redshift is 10x faster than Hadoop. There's a bunch of papers with similar claims. I spent five minutes Googling “Xx faster than Hadoop” and found a ton of other stats.

(Btw, when people say this, I generally take it to mean that Z is y times faster than Hadoop Mapreduce. Just nitpicking.)

Anyway, these stats bother me a lot because everyone knows that

  • Horizontal scalability comes at a very high price, because things get I/O bound. That's fine, because you can always throw more hardware at the problem.
  • Flexibility comes at a price, and that's totally fine for most people. Hadoop supports pretty much anything that can be reduced to a series of Mapreduce jobs, which in practice turns out to me most stuff.
  • Ease of use comes at a price, and that's fine. There's a reason a lot of people choose Python over C++, after all. Ok, writing mapreduce jobs in Java sucks, but there's a lot of nice tools out there to make it simple (subtle product placement: check out Luigi)

I think Spark is a really cool piece of technology, so don't get me wrong. I just think it's stupid to compare things between Hadoop and Spark when clearly they are two very different products with different use cases. Just as you wouldn't compare a Tokyo Cabinet to MySQL or whatever. So please never ever say that something is X times faster than Hadoop again.

Tagged with: math