probabilistic Matrix Factorization: Difference between revisions

From statwiki
Jump to navigation Jump to search
Line 15: Line 15:
=Probabilistic Matrix Factorization=
=Probabilistic Matrix Factorization=


Given the preference matrix described above with entries <math>R_{ij}</math>. The aim is to find a factorization that minimizes the root mean squared error (RMSE) on the test set, an initial attempt is to use linear model where we assume that there is Gaussian noise in the data ('''Figure 1'''). Define <math>I_{ij}</math> to be 1 if <math>R_{ij}</math> is known and 0 otherwise. Further, let <math>\,\mathcal N(x|\mu,\sigma^2)</math> = f_X(x)</math> with <math>X \sim  \mathcal N(\mu,\sigma^2)</math>. Then, we can define a conditional probability of the ratings and priors on <math>U</math> and <math>V</math>.
Given the preference matrix described above with entries <math>R_{ij}</math>. The aim is to find a factorization that minimizes the root mean squared error (RMSE) on the test set, an initial attempt is to use linear model where we assume that there is Gaussian noise in the data ('''Figure 1'''). Define <math>I_{ij}</math> to be 1 if <math>R_{ij}</math> is known and 0 otherwise. Further, let <math>\,\mathcal N(x|\mu,\sigma^2) = f_X(x)</math> with <math>X \sim  \mathcal N(\mu,\sigma^2)</math>. Then, we can define a conditional probability of the ratings with hyperparameter <math>\sigma^2</math>  


<math> p(R|U,V,\sigma^2) = \prod_{i=1}^N \prod_{j=1}^M \left[ \mathcal N(R_{ij}|U_i^TV_j,\sigma^2)\right]^{I_{ij}}\,\,\,\,\,\,\,\,(1)</math>
<math> p(R|U,V,\sigma^2) = \prod_{i=1}^N \prod_{j=1}^M \left[ \mathcal N(R_{ij}|U_i^TV_j,\sigma^2)\right]^{I_{ij}}\,\,\,\,\,\,\,\,(1)</math>


<math>p(U|\sigma^2_U) = \prod_{i=1}^N \mathcal N(U_i|0,\sigma^2_U\textbf{I})\,\,\,\,\,\,\,\,p(V|\sigma^2_V) = \prod_{i=1}^M \mathcal N(V_i|0,\sigma^2_V\textbf{I})</math>
and priors on <math>U</math> and <math>V</math> with hyperparameters <math>\sigma^2_U, \sigma^2_V</math>
 
<math>p(U|\sigma^2_U) = \prod_{i=1}^N \mathcal N(U_i|0,\sigma^2_U\textbf{I})\,\,\,\,\,\,\,\,p(V|\sigma^2_V) = \prod_{i=1}^M \mathcal N(V_i|0,\sigma^2_V\textbf{I})\,\,\,\,\,\,\,\,(2)</math>
 
Then we aim to maximize the log posterior over <math>U</math> and <math>V</math>
<math>\ln p(U,V,\sigma^2,\sigma^2_V,\sigma^2_U) = - \frac{1}{2\sigma^2}\sum_{i=1}^N\sum_{j=1}^M I_{ij}(R_{ij} - U_i^TV_j)^2 - \frac{1}{2\sigma^2_U} \sum_{i=1}^N U_i^TU_i - \frac{1}{2\sigma^2_V} \sum_{i=1}^N V_i^TV_i </math><math> - \frac{1}{2}\left(K\ln\sigma^2 + ND\ln\sigma^2_U + MD\ln\sigma^2_V\right) + C\,\,\,\,\,\,\,\,(4)</math>
 
where <math>K</math> is the number of entries that are known and <math>C</math> is a constant independent of the parameters. We can fix the three variance hyperparameters as constants which reduces the optimization to the first three terms. Then defining <math>\lambda_M = \sigma^2/\sigma^2_M</math> for <math>M=U,V</math> and multiplying by <math>-\sigma^2< 0</math> results in a goal of minimizing
 
<math>E = \frac{1}{2}\left(\sum_{i=1}^N\sum_{j=1}^M I_{ij}(R_{ij} - U_i^TV_j)^2 + \lambda_U \sum_{i=1}^N \parallel U_i \parallel^2_F + \lambda_V \sum_{i=1}^M \parallel V_i \parallel^2_F \right)\,\,\,\,\,\,\,(5) </math>
 
where <math>\parallel A \parallel^2_F = \|A\|_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^n |a_{ij}|^2}</math> is the [http://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm| Frobenius norm].

Revision as of 15:55, 22 November 2010

UNDER CONSTRUCTION

Introduction and Motivation (Netflix Dataset)

The problem provides an approach to collaborative filtering that can scale linearly and, thus, handle large datasets. The methods are discussed in the context of the Netflix dataet.

This dataset is [math]\displaystyle{ N \times M }[/math] matrix where each row corresponds to a user, each column to a movie and the entries to a rating. Ratings take on values [math]\displaystyle{ 1,...,K }[/math]. The difficulty in this problem arises from the fact that each user has only rated a subset of movies. If there is some latent rank-[math]\displaystyle{ D }[/math] structure to these ratings (that is, underlying variables that determine how much a user enjoys a movie) then it would beneficial to factor the matrix.

Formally, given a preference matrix [math]\displaystyle{ R }[/math] which is [math]\displaystyle{ N \times M }[/math], we wish to find an [math]\displaystyle{ N \times D }[/math] matrix [math]\displaystyle{ U^T }[/math] and an [math]\displaystyle{ D \times M }[/math] matrix [math]\displaystyle{ V }[/math]. [math]\displaystyle{ V }[/math] is a factor matrix (i.e. for the [math]\displaystyle{ D }[/math] latent variables, it gives the value for each movie) and [math]\displaystyle{ U^T }[/math] gives coefficients of these factors for each user. The actual "Prize" dataset has [math]\displaystyle{ N=480,189, M = 17,770 }[/math] with over 100 million of these entries having values.

There are several existing ways to solve this problem but they often require approximations that are slow and, at times, inaccurate. Another instinctive approach might be to use SVD. However, since there is a lot of missing data in this case, the optimization problem that needs to be solved is not convex. Thus, existing methods do not scale well and often have lower accuracy in cases where a row is sparse.

The proposed solution is to use a probabilistic model.

Probabilistic Matrix Factorization

Given the preference matrix described above with entries [math]\displaystyle{ R_{ij} }[/math]. The aim is to find a factorization that minimizes the root mean squared error (RMSE) on the test set, an initial attempt is to use linear model where we assume that there is Gaussian noise in the data (Figure 1). Define [math]\displaystyle{ I_{ij} }[/math] to be 1 if [math]\displaystyle{ R_{ij} }[/math] is known and 0 otherwise. Further, let [math]\displaystyle{ \,\mathcal N(x|\mu,\sigma^2) = f_X(x) }[/math] with [math]\displaystyle{ X \sim \mathcal N(\mu,\sigma^2) }[/math]. Then, we can define a conditional probability of the ratings with hyperparameter [math]\displaystyle{ \sigma^2 }[/math]

[math]\displaystyle{ p(R|U,V,\sigma^2) = \prod_{i=1}^N \prod_{j=1}^M \left[ \mathcal N(R_{ij}|U_i^TV_j,\sigma^2)\right]^{I_{ij}}\,\,\,\,\,\,\,\,(1) }[/math]

and priors on [math]\displaystyle{ U }[/math] and [math]\displaystyle{ V }[/math] with hyperparameters [math]\displaystyle{ \sigma^2_U, \sigma^2_V }[/math]

[math]\displaystyle{ p(U|\sigma^2_U) = \prod_{i=1}^N \mathcal N(U_i|0,\sigma^2_U\textbf{I})\,\,\,\,\,\,\,\,p(V|\sigma^2_V) = \prod_{i=1}^M \mathcal N(V_i|0,\sigma^2_V\textbf{I})\,\,\,\,\,\,\,\,(2) }[/math]

Then we aim to maximize the log posterior over [math]\displaystyle{ U }[/math] and [math]\displaystyle{ V }[/math] [math]\displaystyle{ \ln p(U,V,\sigma^2,\sigma^2_V,\sigma^2_U) = - \frac{1}{2\sigma^2}\sum_{i=1}^N\sum_{j=1}^M I_{ij}(R_{ij} - U_i^TV_j)^2 - \frac{1}{2\sigma^2_U} \sum_{i=1}^N U_i^TU_i - \frac{1}{2\sigma^2_V} \sum_{i=1}^N V_i^TV_i }[/math][math]\displaystyle{ - \frac{1}{2}\left(K\ln\sigma^2 + ND\ln\sigma^2_U + MD\ln\sigma^2_V\right) + C\,\,\,\,\,\,\,\,(4) }[/math]

where [math]\displaystyle{ K }[/math] is the number of entries that are known and [math]\displaystyle{ C }[/math] is a constant independent of the parameters. We can fix the three variance hyperparameters as constants which reduces the optimization to the first three terms. Then defining [math]\displaystyle{ \lambda_M = \sigma^2/\sigma^2_M }[/math] for [math]\displaystyle{ M=U,V }[/math] and multiplying by [math]\displaystyle{ -\sigma^2\lt 0 }[/math] results in a goal of minimizing

[math]\displaystyle{ E = \frac{1}{2}\left(\sum_{i=1}^N\sum_{j=1}^M I_{ij}(R_{ij} - U_i^TV_j)^2 + \lambda_U \sum_{i=1}^N \parallel U_i \parallel^2_F + \lambda_V \sum_{i=1}^M \parallel V_i \parallel^2_F \right)\,\,\,\,\,\,\,(5) }[/math]

where [math]\displaystyle{ \parallel A \parallel^2_F = \|A\|_F=\sqrt{\sum_{i=1}^m\sum_{j=1}^n |a_{ij}|^2} }[/math] is the Frobenius norm.