Implicit data and collaborative filtering
A lot of people these days know about collaborative filtering. It's that Netflix Prize thing, right? People rate things 1-5 stars and then you have to predict missing ratings.
While there's no doubt that the Netflix Prize was successful, I think it created an illusion that all recommender systems care about explicit 1-5 ratings and RMSE as the objective. Some people even distrust me when I talk about the approach we take at Spotify.
Misconception 1: Recommender systems are about predicting missing ratings.
This is not true. In our case at Spotify, we have a huge matrix with users and items and each element containing the number of times user u played track i. Note that all of the matrix entries are known. Zero is a zero, and it means that user u actually played track i exactly 0 times.
Actually, even Netflix themselves have stated that there's much more information in the implicit data than the explicit. Using implicit data has received a lot less attention, probably because the Netflix Prize was so successful.
Misconception 2: Recommender systems use squared loss.
This is one of my biggest pet peeves. Think about it – what does squared loss mean, from a Bayesian perspective? It means you assume that the errors are all from a normal distribution. This is a reasonable approximation for 1-5 star rating (although questionable even there), but it's definitely a horrible way to fit play count data (a reasonable approximation would be Poisson). Some people's reaction to this is to transform the data to a more reasonable scale before taking the squared difference, but then your model gets even more complicated to interpret.
If there's one lesson here, it's definitely that every loss function is an assumption about how data is generated. That's why I prefer generative models in the first place, such as PLSA or LDA. These methods were originally developed for text classification, but the “bag of words” approach turns out to work great for implicit collaborative filtering. Note that there are some algorithms that use squared loss even for implicit collaborative filtering, but I'm not sure what they assume about the data really.
Misconception 3: Recommender systems are predictive models.
This is a subtle one. You can look at the Netflix Prize as a challenge to predict unknown values, and in the same way you can look at implicit collaborative filtering as essentially a predictive model where you are trying to predict what the user is going to do in the future. But just because you can predict that user u is going to play track i, does that mean it's a good recommendation? After all, there might be some super obscure track j that user u would love if they actually had found it. Just recommending most likely track i introduces a strong popularity bias.
This relates to the previous question. Even if we found a loss function that relates to the generative model, it doesn't mean we have a way of optimizing recommendation quality by minimizing some loss function. So what should we do? Luckily, it turns out there's some tricks you can do, like normalizing for popularity, that work reasonably well in practice.
Misconception 4: Recommender systems are all about recommending items to users.
I would actually argue that detecting item similarity using collaborative filtering is more important. This is another thing where I've hardly seen any research, and I don't really have a lot of good ideas, although for some reason item-item cosine works really well in latent factor models.