Erik Bernhardsson    About


Just open sourced hdfs2cass which is a Hadoop job (written in Java) to do efficient Cassandra bulkloading. The nice thing is that it queries Cassandra for its topology and uses that to partition the data so that each reducer can upload data directly to a Cassandra node. It also builds SSTables locally etc. Not an expert at Cassandra so I’ll stop describing those parts before I embarrass myself.

The way we use it, Cassandra almost becomes some kind of a CDN. All we want to do is to push out large amounts of semi-static data, with no random writes at all. Cassandra is pretty good at high write loads so it’s a pretty good fit for the job.

Anyway, the reason why I think this is cool is that all of our previous bulk loading tool has had a single-machine bottleneck. Suddenly we can do many-to-many bulk loading. This means we now have a solution that scales horizontally. This starts to matter when you transfer lots of data. In contrast, for smaller files (< 50GB), we typically scp or even torrent files. This is pretty fast, but you are still constrained by the pipe of the source machine.

Another thing I like is that shows how Hadoop (I’m talking mapreduce now) is great at sharding up things, dealing with failures, retries, etc. It’s awesome to rely on that framework when you are pushing a terabyte and you expect some fraction of failed reduce tasks.

Want to get blog posts over email?

Enter your email address and get weekly emails with new articles!

Erik Bernhardsson

... is the CTO at Better, which is a startup changing how mortgages are done. I write a lot of code, some of which ends up being open sourced, such as Luigi and Annoy. I also co-organize NYC Machine Learning meetup. You can follow me on Twitter or see some more facts about me.