Stuff that bothers me: “100x faster than Hadoop”
The simple way to get featured on big data blog these days seem to be
- Build something that does 1 thing super well but nothing else
- Benchmark it against Hadoop
- Publish stats showing that it’s 100x faster than Hadoop
Spark claims their 100x faster than Hadoop and there’s a lot of stats showing Redshift is 10x faster than Hadoop. There’s a bunch of papers with similar claims. I spent five minutes Googling “Xx faster than Hadoop” and found a ton of other stats.
(Btw, when people say this, I generally take it to mean that Z is y times faster than Hadoop Mapreduce. Just nitpicking.)
Anyway, these stats bother me a lot because everyone knows that
- Horizontal scalability comes at a very high price, because things get I/O bound. That’s fine, because you can always throw more hardware at the problem.
- Flexibility comes at a price, and that’s totally fine for most people. Hadoop supports pretty much anything that can be reduced to a series of Mapreduce jobs, which in practice turns out to me most stuff.
- Ease of use comes at a price, and that’s fine. There’s a reason a lot of people choose Python over C++, after all. Ok, writing mapreduce jobs in Java sucks, but there’s a lot of nice tools out there to make it simple (subtle product placement: check out Luigi)
I think Spark is a really cool piece of technology, so don’t get me wrong. I just think it’s stupid to compare things between Hadoop and Spark when clearly they are two very different products with different use cases. Just as you wouldn’t compare a Tokyo Cabinet to MySQL or whatever. So please never ever say that something is X times faster than Hadoop again.