In Chapter 7, we studied personalization through feature-based regression models. We reviewed models based on item-item similarity, user-user similarity, and matrix factorization in Chapter 2. In practice, matrix factorization has better prediction accuracy in warm-start scenarios, that is, when predicting responses for (user, item) pairs where both the user and the item have large numbers of observations in the training data. However, prediction accuracy deteriorates for users (or items) with meagre or no response in the training data. Such cases are often referred to as cold-start scenarios. In Section 8.1, we describe the regression-based latent factor model (RLFM) that extends matrix factorization to leverage available user and item features simultaneously. Such a strategy helps to improve the performance of both cold-start and warm-start scenarios in a single framework. We then develop the model-fitting algorithms in Section 8.2 and illustrate the performance of RLFM on a number of data sets in Section 8.3. Finally, in Section 8.4, we discuss model-fitting strategies to train RLFM on the very large data that are typical in modern web recommendation systems, and we evaluate their performance in Section 8.5.
Regression-Based Latent Factor Model (RLFM)
RLFM extends matrix factorization to leverage features and users’ past responses to items within a single modeling framework (Agarwal and Chen, 2009; Zhang et al., 2011). It provides a principled framework to combine the pros of collaborative filtering and content-based methods. When past response data are not sufficient to determine the latent factors of a user or an item, RLFM estimates them through a regression model based on features; otherwise, the latent factors are estimated as in matrix factorization. The key aspect of RLFM is the ability to seamlessly transition, in the continuum, from cold start to warm start.
Data and Notation. As usual, we use i to denote a user and j to denote an item. Let yij denote the response that user i gives item j. This response can be of various types, as described in Section 2.3.2, including numeric response (e.g., numeric ratings) modeled using the Gaussian response model and binary response (e.g., click or not) modeled using the logistic response model.
In addition to response data, we also have features available for each user and item. Let xi, xj, and xij denote feature vectors for user i, item j, and pair (i, j), respectively.