A brief history of Hadoop at Spotify
I was talking with some data engineers at Spotify and had a moment of nostalgia.
I was writing my master’s thesis at Spotify and had to run a Hadoop job to extract some data from the logs. Every time I started running the job, I kept hearing this subtle noise. I kept noticing the correlation for a few days but I was too intimidated to ask. Finally people starting cursing that their machines had gotten really slow lately and I realized we were running Hadoop on the developer’s desktop machines. No one had told me. I think back then we had only GB’s of log data. I remember running less on the log and I would recognize half the usernames because they were my friends.
We took a bunch of machines and put them on a pallet in the foosball room. It was a super hot Swedish summer and I kept running this matrix factorization job in Hadoop that would fail halfway through. The node on the top of the pile would crash and you had to reboot it. I suspected overheating. We had a fan running in the room but it wasn’t helping. Finally I realized the problem was the sun was shining in through the window.
Jon Åslund with our Hadoop cluster
I found a big sheet or blanket and some nails and a hammer and put it up over the window. I was finally able to run my matrix factorization job to completion after doing this. This is probably going to be my favorite bug fix until the day I die.
In the summer of 2009, we installed a 30-node Hadoop cluster in our data center in Stockholm. Finally a “real” cluster.
More and more people started using Hadoop so we decided to move to Elastic Mapreduce. I uploaded all our logs to S3 and we put together some tooling so that you could run things on our own Hadoop cluster or on EC2 using the same source code. It was pretty beautiful but the performance wasn’t super great compared to how much we were paying.
Later in 2011 we had grown even more. We decided to move back to our own data center. We installed 500 nodes in our data center in London, later upgrading it to 700 and then 900 nodes.
Our fifth Hadoop cluster
I also implemented Luigi as a workflow engine with Mapreduce support in late 2011.
There was this long-standing assumption (at least I had) that Hadoop jobs were I/O bound and thus the language didn’t matter. We were using Python for probably 95% of all jobs, with some stuff in Hive by the analytics team. During 2012 and forward, we started realizing Python isn’t the ideal language, both from a performance and usability point of view. Eventually we would end up switching to Crunch and Scalding. We still use Luigi as the workflow engine to glue everything together.