ML at Twitter
I recently came across this paper describing how they do ML at Twitter.
TL;DR Their approach is pretty interesting. Everything is a Pig workflow and then they do everything as UDF's.
This approach seems pretty interesting. As long as your data can be expressed as small atomic machine learning functions, I'm sure it works great. But there's so much more than that. All small slicing, transforming etc is so much easier to express in a language like Python. I'm still not really comfortable with Pig as a language to power these data flows.
John Cook wrote about math stuff in Python a few months ago: I find doing mathematical programming in a general-purpose language is easier than doing general-purpose programming in a mathematical language. I could not agree more and I want to generalize it further: I rather do domain specific programming in a general-purpose language than the other way around.
Pig definitely comes with a set of cool features. It handles joining automatically (not a trivial task), gives you trivial ways to transform your data, and does the query planning for you.
It's not a full fledged workflow management tool though. At Spotify, we use Luigi to define the workflows and the dependencies. Luigi gives you a way to write everything in 100% Python, including how different computational tasks are related and what is dependent on what. It comes with Hadoop support, but the abstraction layer is lower than Pig, and you have to implement your mapper and reducer yourself. For skewed joins this could involve some engineering around things that Pig does automatically for you.
I'm quite curious how this works out for Twitter. Do they find that the abstraction makes it easy for them to focus on ML and not care about the execution details? Or does it add complexity because it enforces the ML to be written as UDF primitives? Will try to find out more.